[DRBD-user] Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x"

Tue Jan 19 11:46:23 CET 2016

On Sun, Jan 17, 2016 at 05:59:20PM +0100, Francois Baligant wrote:
> Hi,
> 
> We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
> protocol C on DRBD9.
> 
> Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
> megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
> Storage is SSD.
> 
> When doing heavy I/O in a VM, we have a kernel panic in drbd module on
> the node running the VM.
> 
> We get the kernel panic using the latest proxmox kernel (drbd9
> 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
> also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)
> 
> We have a kdump crash dump if that can be of any help.
> 
> Virtualization: KVM guest with virtio for net and disk. Using
> writethrough caching strategy for guest VM. Backing storage for VM is
> LVM on top of DRBD.
> 
> Tried both versions:
> 
> # cat /proc/drbd
> version: 9.0.0 (api:2/proto:86-110)
> GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root at elsa,
> 2016-01-10 15:26:34
> Transports (api:10): tcp (1.0.0)
> 
> # cat /proc/drbd
> version: 9.0.0 (api:2/proto:86-110)
> GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
> root at sd-84686, 2016-01-17 16:31:20
> Transports (api:13): tcp (1.0.0)
> 
> Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M
> 
> Node:
> 
> Linux version 4.2.6-1-pve (root at sd-84686) (gcc version 4.9.2 (Debian
> 4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016
> 
> [  861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243

This is the real problem ^^

I will add a fix to the "LOGIC BUG" path there
that at least will not return "Success" for a failed operation,
so it won't later trigger the BUG_ON() below.

This BUG_ON() is only a followup failure.

But the interesting thing will be to figure out
where the logic is wrong: if, within a protected critical region,
I first check that at least N "slots" are available,
and then a few lines later, still within the same protected region,
suddenly some of them are not available...
As they say, this "can not happen" ;-)

> [  862.065397] ------------[ cut here ]------------
> [  862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!

> [  862.067277] Call Trace:
> [  862.067316]  [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
> [  862.067360]  [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
> [  862.067406]  [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
> [  862.067451]  [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
> [  862.067493]  [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
> [  862.067537]  [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
> [  862.067582]  [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
> [  862.067626]  [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]

...

> # drbdsetup show
> resource r0 {
>     _this_host {
>         node-id      0;
>         volume 0 {
>             device     minor 0;
>             disk       "/dev/sda4";
>             meta-disk     internal;
>             disk {
>                 disk-flushes       no;
>             }
>         }
>     }
>     connection {
>         _peer_node_id 1;
>         path {
>             _this_host ipv4 10.0.0.197:7788;
>             _remote_host ipv4 10.0.0.140:7788;
>         }
>         net {
>             allow-two-primaries yes;
>             cram-hmac-alg       "sha1";
>             shared-secret       "xxxxxxxx";
>             after-sb-0pri       discard-zero-changes;
>             after-sb-1pri       discard-secondary;
>             verify-alg          "md5";
>             _name               "proxmox1";
>         }
>         volume 0 {
>             disk {
>                 resync-rate        40960k; # bytes/second
>             }
>         }
>     }
> }
> 
> Shortly after the tg3 watchdog trigger, it's probably a consequence of
> the drbd kernel panic but maybe not ?
> 
> See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg
> 
> Is this a known problem for this kind of configuration?
> (kvm->virtio->lvm->drbd->h730p+tg3)
> 
> Best regards,
> Francois

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed