[DRBD-user] drbd module crashes at BUG at lru_cache.c:570 when the peer mode goes diskless
Evzen Demcenko
demcenko at cldn.cz
Tue May 14 18:40:53 CEST 2019
I backported the hack from 9.0.17-1, now under same conditions node
still floods the log with "al_comlete_io()" and "LOGIC BUG" messages,
but at least it does not crash.
Ing. Evzen Demcenko
Senior Linux Administrator
Cluster Design s.r.o.
On 5/13/19 6:43 PM, Evzen Demcenko wrote:
> If drbd node in primary-primary setup looses disk by any reason
> (faulty disk, controller etc., or even manual detach) and there is R/W
> activity on both nodes, the second node eventually crashes with kernel
> panic "kernel BUG at
> /root/rpmbuild/BUILD/drbd-8.4.11-1/drbd/lru_cache.c:570!"
> Before the crash there are a lot of messages in kernel.log on good
> node (with attached disk) like
>
> block drbd1: al_complete_io() called on inactive extent 57
> block drbd1: LOGIC BUG for enr=74
>
> Eventually, the "good" node crashes within minutes or hours depending
> on disk activity leaving the cluster without any data.
> I tested different versions from 8.4 tree (8.4.6, 8.4.7-1, 8.4.9-1,
> 8.4.11-1), always with the same result.
> There is also no difference on "real-hardware" and virtualized machines.
> kernel is 2.6.32-754.12.1.el6.x86_64 on Centos-6.10 with latest
> updates. Tested also on other kernels with the same outcome.
> There is a vmcore-dmesg.txt attached to this email, vmcore itself is
> available as well for every tested version, core files are 35-50Mb, so
> i can't attach them to email, but i'll be glad to share them in some
> other way if needed.
>
> [root at drtest-11 ~]# cat /etc/drbd.d/global_common.conf
> global {
> usage-count yes;
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh";
> pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh";
> }
>
> startup {
> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
> }
>
> disk {
> resync-rate 100M;
> on-io-error detach;
> al-extents 1447;
> c-plan-ahead 32;
> c-max-rate 1000M;
> c-min-rate 80M;
> c-fill-target 65536k;
> }
>
> net {
> sndbuf-size 4096k;
> rcvbuf-size 4096k;
> timeout 100; # 10 seconds (unit = 0.1 seconds)
> connect-int 15; # 15 seconds (unit = 1 second)
> ping-int 15; # 15 seconds (unit = 1 second)
> ping-timeout 50; # 5000 ms (unit = 0.1 seconds)
> max-buffers 131072;
> max-epoch-size 20000;
> ko-count 0;
> after-sb-0pri discard-younger-primary;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> rr-conflict disconnect;
> }
> }
>
> [root at drtest-11 ~]# cat /etc/drbd.d/r1.res
> resource r1 {
> net {
> protocol C;
> allow-two-primaries;
> verify-alg crc32c;
> csums-alg crc32c;
> }
> startup {
> become-primary-on both;
> }
> disk {
> disk-timeout 1200;
> }
> on drtest-11.uvt.internal {
> device /dev/drbd1;
> disk "/dev/vdb";
> address 10.0.11.201:7790;
> flexible-meta-disk internal;
> }
> on drtest-12.uvt.internal {
> device /dev/drbd1;
> disk "/dev/vdb";
> address 10.0.11.202:7790;
> flexible-meta-disk internal;
> }
> }
>
> How to reproduce:
> After create-md, connect, primary etc.:
>
> On drtest-11:
> pvcreate /dev/drbd1
> vgcreate test /dev/drbd1
> lvcreate -n t1 -L20g test
> lvcreate -n t2 -L20g test
> mkfs.ext4 /dev/test/t1
> mount /dev/test/t1 /mnt/t1
> mkdir -m 777 /mnt/t1/test
> while true ; do bonnie++ -u nobody -d /mnt/t1/test/ -n 8192 -s8192 ; done
>
> On drtest-12
> vgchange -aly
> mkfs.ext4 /dev/test/t2
> mount /dev/test/t2 /mnt/t2
> mkdir -m 777 /mnt/t2/test
> while true ; do bonnie++ -u nobody -d /mnt/t2/test/ -n 8192 -s8192 ; done
> drbdadm detach r1
>
> After the detach on drtest-12, drtest-11 almost instantly starts
> flooding the log with " al_complete_io() called on inactive extent"
> and "LOGIC BUG for enr=" and crashes within couple of minutes.
>
> Thanks in advance for investigating this issue.
> Sincerely,
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190514/920cb278/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd-cldn-actlog_refcnt.patch
Type: text/x-patch
Size: 491 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190514/920cb278/attachment-0001.bin>
More information about the drbd-user
mailing list