[DRBD-user] drbd module crashes at BUG at lru_cache.c:570 when the peer mode goes diskless

Tue May 14 18:40:53 CEST 2019

I backported the hack from 9.0.17-1, now under same conditions node 
still floods the log with "al_comlete_io()" and "LOGIC BUG" messages, 
but at least it does not crash.

Ing. Evzen Demcenko
Senior Linux Administrator
Cluster Design s.r.o.

On 5/13/19 6:43 PM, Evzen Demcenko wrote:
> If drbd node in primary-primary setup looses disk by any reason 
> (faulty disk, controller etc., or even manual detach) and there is R/W 
> activity on both nodes, the second node eventually crashes with kernel 
> panic "kernel BUG at 
> /root/rpmbuild/BUILD/drbd-8.4.11-1/drbd/lru_cache.c:570!"
> Before the crash there are a lot of messages in kernel.log on good 
> node (with attached disk) like
>
> block drbd1: al_complete_io() called on inactive extent 57
> block drbd1: LOGIC BUG for enr=74
>
> Eventually, the "good" node crashes within minutes or hours depending 
> on disk activity leaving the cluster without any data.
> I tested different versions from 8.4 tree (8.4.6, 8.4.7-1, 8.4.9-1, 
> 8.4.11-1), always with the same result.
> There is also no difference on "real-hardware" and virtualized machines.
> kernel is 2.6.32-754.12.1.el6.x86_64 on Centos-6.10 with latest 
> updates. Tested also on other kernels with the same outcome.
> There is a vmcore-dmesg.txt attached to this email, vmcore itself is 
> available as well for every tested version, core files are 35-50Mb, so 
> i can't attach them to email, but i'll be glad to share them in some 
> other way if needed.
>
> [root at drtest-11 ~]# cat /etc/drbd.d/global_common.conf
> global {
>         usage-count yes;
> }
>
> common {
>         protocol C;
>
>         handlers {
>                 pri-on-incon-degr 
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh";
>                 pri-lost-after-sb 
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh";
>                 local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh";
>         }
>
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout 
> wait-after-sb
>         }
>
>         disk {
>                 resync-rate 100M;
>                 on-io-error detach;
>                 al-extents 1447;
>                 c-plan-ahead 32;
>                 c-max-rate 1000M;
>                 c-min-rate 80M;
>                 c-fill-target 65536k;
>         }
>
>         net {
>                 sndbuf-size 4096k;
>                 rcvbuf-size 4096k;
>                 timeout       100;    #  10 seconds  (unit = 0.1 seconds)
>                 connect-int   15;    # 15 seconds  (unit = 1 second)
>                 ping-int      15;    # 15 seconds  (unit = 1 second)
>                 ping-timeout  50;    # 5000 ms (unit = 0.1 seconds)
>                 max-buffers     131072;
>                 max-epoch-size  20000;
>                 ko-count 0;
>                 after-sb-0pri discard-younger-primary;
>                 after-sb-1pri consensus;
>                 after-sb-2pri disconnect;
>                 rr-conflict disconnect;
>         }
> }
>
> [root at drtest-11 ~]# cat /etc/drbd.d/r1.res
> resource r1 {
>     net {
>         protocol C;
>         allow-two-primaries;
>         verify-alg crc32c;
>         csums-alg crc32c;
>     }
>     startup {
>         become-primary-on both;
>     }
>   disk {
>       disk-timeout 1200;
>   }
>   on drtest-11.uvt.internal {
>         device      /dev/drbd1;
>         disk        "/dev/vdb";
>         address     10.0.11.201:7790;
>         flexible-meta-disk internal;
>     }
>   on drtest-12.uvt.internal {
>         device      /dev/drbd1;
>         disk        "/dev/vdb";
>         address     10.0.11.202:7790;
>         flexible-meta-disk internal;
>     }
> }
>
> How to reproduce:
> After create-md, connect, primary etc.:
>
> On drtest-11:
> pvcreate /dev/drbd1
> vgcreate test /dev/drbd1
> lvcreate -n t1 -L20g test
> lvcreate -n t2 -L20g test
> mkfs.ext4 /dev/test/t1
> mount /dev/test/t1 /mnt/t1
> mkdir -m 777 /mnt/t1/test
> while true ; do bonnie++ -u nobody -d /mnt/t1/test/ -n 8192 -s8192 ; done
>
> On drtest-12
> vgchange -aly
> mkfs.ext4 /dev/test/t2
> mount /dev/test/t2 /mnt/t2
> mkdir -m 777 /mnt/t2/test
> while true ; do bonnie++ -u nobody -d /mnt/t2/test/ -n 8192 -s8192 ; done
> drbdadm detach r1
>
> After the detach on drtest-12, drtest-11 almost instantly starts 
> flooding the log with " al_complete_io() called on inactive extent" 
> and "LOGIC BUG for enr=" and crashes within couple of minutes.
>
> Thanks in advance for investigating this issue.
> Sincerely,
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190514/920cb278/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd-cldn-actlog_refcnt.patch
Type: text/x-patch
Size: 491 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190514/920cb278/attachment-0001.bin>