[DRBD-user] Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x"

Tue Jan 19 11:53:58 CET 2016

Hi Lars,

Thanks for your analysis !

If it can be of any help, we downgraded (module downgrade and metadata
downgrade, nothing else touched, even drbd configuration) this cluster
to DRBD 8.4 and after some heavy stress tests we have yet to make it
crash while it would take around only 10 minutes to crash it on DRBD
9.0.

version: 8.4.7-1 (api:1/proto:86-101)
GIT-hash: aff41b8a77838faac8f4e8f8ee843e182d4e4bcc build by
root at sd-84686, 2016-01-17 20:32:22
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:798510236 nr:54885336 dw:853395572 dr:54193620 al:630522 bm:0
lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

Best regards,
Francois

2016-01-19 11:46 GMT+01:00 Lars Ellenberg <lars.ellenberg at linbit.com>:
> On Sun, Jan 17, 2016 at 05:59:20PM +0100, Francois Baligant wrote:
>> Hi,
>>
>> We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
>> protocol C on DRBD9.
>>
>> Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
>> megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
>> Storage is SSD.
>>
>> When doing heavy I/O in a VM, we have a kernel panic in drbd module on
>> the node running the VM.
>>
>> We get the kernel panic using the latest proxmox kernel (drbd9
>> 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
>> also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)
>>
>> We have a kdump crash dump if that can be of any help.
>>
>> Virtualization: KVM guest with virtio for net and disk. Using
>> writethrough caching strategy for guest VM. Backing storage for VM is
>> LVM on top of DRBD.
>>
>> Tried both versions:
>>
>> # cat /proc/drbd
>> version: 9.0.0 (api:2/proto:86-110)
>> GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root at elsa,
>> 2016-01-10 15:26:34
>> Transports (api:10): tcp (1.0.0)
>>
>> # cat /proc/drbd
>> version: 9.0.0 (api:2/proto:86-110)
>> GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
>> root at sd-84686, 2016-01-17 16:31:20
>> Transports (api:13): tcp (1.0.0)
>>
>> Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M
>>
>> Node:
>>
>> Linux version 4.2.6-1-pve (root at sd-84686) (gcc version 4.9.2 (Debian
>> 4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016
>>
>> [  861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243
>
> This is the real problem ^^
>
> I will add a fix to the "LOGIC BUG" path there
> that at least will not return "Success" for a failed operation,
> so it won't later trigger the BUG_ON() below.
>
> This BUG_ON() is only a followup failure.
>
> But the interesting thing will be to figure out
> where the logic is wrong: if, within a protected critical region,
> I first check that at least N "slots" are available,
> and then a few lines later, still within the same protected region,
> suddenly some of them are not available...
> As they say, this "can not happen" ;-)
>
>> [  862.065397] ------------[ cut here ]------------
>> [  862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!
>
>> [  862.067277] Call Trace:
>> [  862.067316]  [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
>> [  862.067360]  [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
>> [  862.067406]  [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
>> [  862.067451]  [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
>> [  862.067493]  [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
>> [  862.067537]  [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
>> [  862.067582]  [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
>> [  862.067626]  [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]
>
> ...
>
>> # drbdsetup show
>> resource r0 {
>>     _this_host {
>>         node-id      0;
>>         volume 0 {
>>             device     minor 0;
>>             disk       "/dev/sda4";
>>             meta-disk     internal;
>>             disk {
>>                 disk-flushes       no;
>>             }
>>         }
>>     }
>>     connection {
>>         _peer_node_id 1;
>>         path {
>>             _this_host ipv4 10.0.0.197:7788;
>>             _remote_host ipv4 10.0.0.140:7788;
>>         }
>>         net {
>>             allow-two-primaries yes;
>>             cram-hmac-alg       "sha1";
>>             shared-secret       "xxxxxxxx";
>>             after-sb-0pri       discard-zero-changes;
>>             after-sb-1pri       discard-secondary;
>>             verify-alg          "md5";
>>             _name               "proxmox1";
>>         }
>>         volume 0 {
>>             disk {
>>                 resync-rate        40960k; # bytes/second
>>             }
>>         }
>>     }
>> }
>>
>> Shortly after the tg3 watchdog trigger, it's probably a consequence of
>> the drbd kernel panic but maybe not ?
>>
>> See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg
>>
>> Is this a known problem for this kind of configuration?
>> (kvm->virtio->lvm->drbd->h730p+tg3)
>>
>> Best regards,
>> Francois
>
> --
> : Lars Ellenberg
> : http://www.LINBIT.com | Your Way to High Availability
> : DRBD, Linux-HA  and  Pacemaker support and consulting
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
       François BALIGANT
Gérant
+33 (0) 811 69 65 60
http://www.synalabs.com