Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
protocol C on DRBD9.
Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
Storage is SSD.
When doing heavy I/O in a VM, we have a kernel panic in drbd module on
the node running the VM.
We get the kernel panic using the latest proxmox kernel (drbd9
360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)
We have a kdump crash dump if that can be of any help.
Virtualization: KVM guest with virtio for net and disk. Using
writethrough caching strategy for guest VM. Backing storage for VM is
LVM on top of DRBD.
Tried both versions:
# cat /proc/drbd
version: 9.0.0 (api:2/proto:86-110)
GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root at elsa,
2016-01-10 15:26:34
Transports (api:10): tcp (1.0.0)
# cat /proc/drbd
version: 9.0.0 (api:2/proto:86-110)
GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
root at sd-84686, 2016-01-17 16:31:20
Transports (api:13): tcp (1.0.0)
Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M
Node:
Linux version 4.2.6-1-pve (root at sd-84686) (gcc version 4.9.2 (Debian
4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016
[ 861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243
[ 862.065397] ------------[ cut here ]------------
[ 862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!
[ 862.065484] invalid opcode: 0000 [#1] SMP
[ 862.065529] Modules linked in: drbd_transport_tcp(O) drbd(O)
netconsole configfs ip_set ip6table_filter ip6_tables iptable_filter
ip_tables softdog x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp libiscs
i_tcp libiscsi scsi_transport_iscsi ip_gre ip_tunnel vport_gre gre
openvswitch libcrc32c nfnetlink_log nfnetlink ipmi_ssif ipmi_devintf
intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel dcdbas kvm crct10dif_
pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper
ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr joydev
input_leds sb_edac edac_core mei_me ioatdma mei shpchp lpc_ich dca wmi
ipmi_si 8250_fintek ipmi_msgha
ndler mac_hid acpi_power_meter vhost_net vhost macvtap macvlan autofs4
hid_generic usbkbd usbmouse usbhid hid ahci libahci tg3 ptp pps_core
megaraid_sas [last unloaded: drbd]
[ 862.066319] CPU: 0 PID: 2343 Comm: drbd_a_r0 Tainted: G O
4.2.6-1-pve #1
[ 862.066386] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
1.5.4 10/002/2015
[ 862.066451] task: ffff881fee1d0000 ti: ffff881fecd78000 task.ti:
ffff881fecd78000
[ 862.066517] RIP: 0010:[<ffffffffc0556e30>] [<ffffffffc0556e30>]
lc_put+0x90/0xa0 [drbd]
[ 862.066594] RSP: 0000:ffff881fecd7bb08 EFLAGS: 00010046
[ 862.066633] RAX: 0000000000000000 RBX: 000000000000faf3 RCX: ffff881fe7b8cab0
[ 862.066677] RDX: ffff881fe65dc000 RSI: ffff881fe7b8cab0 RDI: ffff881fdc2eca80
[ 862.066721] RBP: ffff881fecd7bb08 R08: 0000000000000484 R09: 0000000000000000
[ 862.066765] R10: ffff883cf8428870 R11: 0000000000000000 R12: ffff881fec93ec00
[ 862.066808] R13: 0000000000000000 R14: 000000000000faf3 R15: 0000000000000001
[ 862.066852] FS: 0000000000000000(0000) GS:ffff881ffec00000(0000)
knlGS:0000000000000000
[ 862.066919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 862.066959] CR2: 00007f1f0a44f028 CR3: 0000000001e0d000 CR4: 00000000001426f0
[ 862.067003] Stack:
[ 862.067034] ffff881fecd7bb58 ffffffffc0553b5a 0000000000000046
ffff881fec93eeb0
[ 862.067115] ffff881fecd7bb68 ffff883cf8428438 ffff881fec93ec00
ffff883cf8428448
[ 862.067196] 0000000000000800 0000000000004000 ffff881fecd7bb68
ffffffffc0554060
[ 862.067277] Call Trace:
[ 862.067316] [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
[ 862.067360] [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
[ 862.067406] [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
[ 862.067451] [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
[ 862.067493] [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
[ 862.067537] [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
[ 862.067582] [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
[ 862.067626] [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]
[ 862.067670] [<ffffffffc054cdc0>] drbd_ack_receiver+0x160/0x5c0 [drbd]
[ 862.067716] [<ffffffffc05571d0>] ? w_complete+0x20/0x20 [drbd]
[ 862.067760] [<ffffffffc0557234>] drbd_thread_setup+0x64/0x120 [drbd]
[ 862.067804] [<ffffffffc05571d0>] ? w_complete+0x20/0x20 [drbd]
[ 862.067847] [<ffffffff8109acaa>] kthread+0xea/0x100
[ 862.067886] [<ffffffff8109abc0>] ? kthread_create_on_node+0x1f0/0x1f0
[ 862.067930] [<ffffffff8180875f>] ret_from_fork+0x3f/0x70
[ 862.067970] [<ffffffff8109abc0>] ? kthread_create_on_node+0x1f0/0x1f0
[ 862.068012] Code: 89 42 08 48 89 56 10 48 89 7e 18 48 89 07 83 6f
64 01 f0 80 a7 90 00 00 00 f7 f0 80 a7 90 00 00 00 fe 8b 46 20 5d c3
0f 0b 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
00 00
[ 862.068414] RIP [<ffffffffc0556e30>] lc_put+0x90/0xa0 [drbd]
[ 862.068459] RSP <ffff881fecd7bb08>
[ 862.069000] ---[ end trace b005772103543ee2 ]---
[ 872.163694] ------------[ cut here ]------------
# drbdsetup show
resource r0 {
_this_host {
node-id 0;
volume 0 {
device minor 0;
disk "/dev/sda4";
meta-disk internal;
disk {
disk-flushes no;
}
}
}
connection {
_peer_node_id 1;
path {
_this_host ipv4 10.0.0.197:7788;
_remote_host ipv4 10.0.0.140:7788;
}
net {
allow-two-primaries yes;
cram-hmac-alg "sha1";
shared-secret "xxxxxxxx";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
verify-alg "md5";
_name "proxmox1";
}
volume 0 {
disk {
resync-rate 40960k; # bytes/second
}
}
}
}
Shortly after the tg3 watchdog trigger, it's probably a consequence of
the drbd kernel panic but maybe not ?
See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg
Is this a known problem for this kind of configuration?
(kvm->virtio->lvm->drbd->h730p+tg3)
Best regards,
Francois