Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, Jan 17, 2016 at 05:59:20PM +0100, Francois Baligant wrote: > Hi, > > We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with > protocol C on DRBD9. > > Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with > megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID. > Storage is SSD. > > When doing heavy I/O in a VM, we have a kernel panic in drbd module on > the node running the VM. > > We get the kernel panic using the latest proxmox kernel (drbd9 > 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master > also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28) > > We have a kdump crash dump if that can be of any help. > > Virtualization: KVM guest with virtio for net and disk. Using > writethrough caching strategy for guest VM. Backing storage for VM is > LVM on top of DRBD. > > Tried both versions: > > # cat /proc/drbd > version: 9.0.0 (api:2/proto:86-110) > GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root at elsa, > 2016-01-10 15:26:34 > Transports (api:10): tcp (1.0.0) > > # cat /proc/drbd > version: 9.0.0 (api:2/proto:86-110) > GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by > root at sd-84686, 2016-01-17 16:31:20 > Transports (api:13): tcp (1.0.0) > > Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M > > Node: > > Linux version 4.2.6-1-pve (root at sd-84686) (gcc version 4.9.2 (Debian > 4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016 > > [ 861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243 This is the real problem ^^ I will add a fix to the "LOGIC BUG" path there that at least will not return "Success" for a failed operation, so it won't later trigger the BUG_ON() below. This BUG_ON() is only a followup failure. But the interesting thing will be to figure out where the logic is wrong: if, within a protected critical region, I first check that at least N "slots" are available, and then a few lines later, still within the same protected region, suddenly some of them are not available... As they say, this "can not happen" ;-) > [ 862.065397] ------------[ cut here ]------------ > [ 862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571! > [ 862.067277] Call Trace: > [ 862.067316] [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd] > [ 862.067360] [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd] > [ 862.067406] [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd] > [ 862.067451] [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90 > [ 862.067493] [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd] > [ 862.067537] [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd] > [ 862.067582] [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd] > [ 862.067626] [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd] ... > # drbdsetup show > resource r0 { > _this_host { > node-id 0; > volume 0 { > device minor 0; > disk "/dev/sda4"; > meta-disk internal; > disk { > disk-flushes no; > } > } > } > connection { > _peer_node_id 1; > path { > _this_host ipv4 10.0.0.197:7788; > _remote_host ipv4 10.0.0.140:7788; > } > net { > allow-two-primaries yes; > cram-hmac-alg "sha1"; > shared-secret "xxxxxxxx"; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > verify-alg "md5"; > _name "proxmox1"; > } > volume 0 { > disk { > resync-rate 40960k; # bytes/second > } > } > } > } > > Shortly after the tg3 watchdog trigger, it's probably a consequence of > the drbd kernel panic but maybe not ? > > See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg > > Is this a known problem for this kind of configuration? > (kvm->virtio->lvm->drbd->h730p+tg3) > > Best regards, > Francois -- : Lars Ellenberg : http://www.LINBIT.com | Your Way to High Availability : DRBD, Linux-HA and Pacemaker support and consulting DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed