Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello there,
not sure if this was already reported, but here goes:
3 node setup following the manual, all settings on defaults with two
exceptions:
[GLOBAL]
storage-plugin = drbdmanage.storage.lvm.Lvm
common {
net {
verify-alg crc32c;
}
My workload (KVM with guest disk cache=none) every now and then triggers
the "Digest mismatch, buffer modified by upper layers during write:" error.
As per documentation, a digest mismatch should immediately cause a
disconnect/reconnect/resync.
The first observed problem is: in my case the disconnect happens
immediately only sometimes,
and sometimes a lot of those messages are repeated and a couple of
minutes can pass before it does disconnect.
Second observed problem: sometimes when such disconnect happens, the
drbd thread on primary node can crash,
(after that kernel works for a moment before all cpus finally lock up
and the node needs a power cycle)
Disabling verify-alg helps, but maybe simply because there are no more
disconnects happening to trigger the crash?
software versions:
---
[ 0.000000] Linux version 4.2.8-1-pve (root at elsa) (gcc version 4.9.2
(Debian 4.9.2-10) ) #1 SMP Fri Feb 26 16:37:36 CET 2016 ()
(...)
[ 116.274693] drbd: initialized. Version: 9.0.1-1 (api:2/proto:86-111)
[ 116.281869] drbd: GIT-hash: 3d38916489fac62b036d8e79d3fcd81d318ca4cb
build by root at elsa, 2016-02-26 16:42:55
---
crash relevant dmesg output:
---
[114257.391726] drbd vm-10030-disk-1/0 drbd104: Digest mismatch, buffer
modified by upper layers during write: 3534072s +163840
[114257.408687] drbd vm-10030-disk-1 hn51: Connection closed
[114257.414970] drbd vm-10030-disk-1 hn51: conn( Disconnecting ->
StandAlone )
[114257.422821] drbd vm-10030-disk-1 hn51: Terminating receiver thread
[114257.450955] BUG: unable to handle kernel NULL pointer dereference at
0000000000000068
[114257.459878] IP: [<ffffffffc04b0ffa>] _tl_restart+0xaa/0xf0 [drbd]
[114257.466834] PGD 0
[114257.469214] Oops: 0000 [#1] SMP
[114257.472968] Modules linked in: veth act_police cls_u32 sch_ingress
sch_htb drbd_transport_tcp(O) drbd(O) ip6t_REJECT nf_reject_ipv6
nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac ipt_REJECT nf_reject_ipv4
xt_physdev xt_comment xt_tcpudp xt_mark xt_addrtype ip_set_hash_net
softdog iptable_filter nfsd auth_rpcgss nfs_acl nfs lockd grace fscache
sunrpc 8021q garp mrp openvswitch libcrc32c bonding xt_set ip_set
xt_multiport xt_conntrack ip6table_filter ip6_tables xt_nat iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables nfnetlink_log nfnetlink intel_rapl
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
crct10dif_pclmul snd_pcm crc32_pclmul snd_timer aesni_intel snd
aes_x86_64 soundcore lrw uas gf128mul glue_helper ablk_helper cdc_ether
cryptd pcspkr sb_edac
[114257.559773] usbnet usb_storage mii edac_core lpc_ich 8250_fintek
shpchp mac_hid ioatdma wmi ipmi_ssif vhost_net vhost macvtap macvlan
ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 igb
i2c_algo_bit dca ahci ptp libahci pps_core megaraid_sas
[114257.583826] CPU: 5 PID: 10103 Comm: drbd_r_vm-10030 Tainted:
G O 4.2.8-1-pve #1
[114257.593406] Hardware name: IBM System x3650 M4 -[7915E3G]-/00W2665,
BIOS -[VVE146AUS-2.00]- 09/17/2015
[114257.603957] task: ffff881003872940 ti: ffff88201b2ec000 task.ti:
ffff88201b2ec000
[114257.612467] RIP: 0010:[<ffffffffc04b0ffa>] [<ffffffffc04b0ffa>]
_tl_restart+0xaa/0xf0 [drbd]
[114257.622159] RSP: 0018:ffff88201b2efcd8 EFLAGS: 00010082
[114257.628219] RAX: ffff88103573d1e0 RBX: ffff881fc0d56db0 RCX:
0000000000000000
[114257.636342] RDX: ffff881fc0d56e08 RSI: 0000000000000092 RDI:
0000000000000092
[114257.644464] RBP: ffff88201b2efd38 R08: 0000000000000000 R09:
0000000180190014
[114257.652587] R10: ffff88103fb60720 R11: ffff8810010f7200 R12:
ffff882035dbd930
[114257.660709] R13: ffff881035e22000 R14: 0000000000000000 R15:
ffff881fc0d56db0
[114257.668832] FS: 0000000000000000(0000) GS:ffff88103fb40000(0000)
knlGS:0000000000000000
[114257.678022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[114257.684567] CR2: 0000000000000068 CR3: 0000000001e0d000 CR4:
00000000000426e0
[114257.692689] Stack:
[114257.695055] ffff881035e22048 0000000b00000286 ffff88201b2efd18
ffff880f7a01df00
[114257.703525] 0000000000000000 0000000016746d9c ffff882035dbdbb0
ffff882035dbd918
[114257.711992] ffff881035e22000 000000000000000b ffff882034cc9800
ffff882034cc9800
[114257.720456] Call Trace:
[114257.723318] [<ffffffffc04b1082>] tl_restart+0x42/0x60 [drbd]
[114257.729860] [<ffffffffc04b10b3>] tl_clear+0x13/0x20 [drbd]
[114257.736215] [<ffffffffc04a3681>] conn_disconnect+0x281/0x830 [drbd]
[114257.743447] [<ffffffffc04d30b6>] ? change_cstate+0x86/0xc0 [drbd]
[114257.750483] [<ffffffffc0499590>] ? got_IsInSync+0x300/0x300 [drbd]
[114257.757617] [<ffffffffc04a4887>] drbd_receiver+0x177/0x5e0 [drbd]
[114257.764654] [<ffffffffc04af390>] ? w_complete+0x20/0x20 [drbd]
[114257.771397] [<ffffffffc04af3f4>] drbd_thread_setup+0x64/0x120 [drbd]
[114257.778723] [<ffffffffc04af390>] ? w_complete+0x20/0x20 [drbd]
[114257.785466] [<ffffffff8109b1fa>] kthread+0xea/0x100
[114257.791140] [<ffffffff8109b110>] ? kthread_create_on_node+0x1f0/0x1f0
[114257.798562] [<ffffffff8180af1f>] ret_from_fork+0x3f/0x70
[114257.804718] [<ffffffff8109b110>] ? kthread_create_on_node+0x1f0/0x1f0
[114257.812139] Code: 75 b8 4c 89 f7 e8 87 5f ff ff 48 8b 43 58 48 8d 53
58 49 89 df 48 83 e8 58 49 39 d4 74 29 48 89 c3 49 8b 45 48 4d 8b 37 48
85 c0 <41> 8b 76 68 74 a8 89 f2 30 d2 3b 10 75 a0 40 0f b6 f6 48 8d 04
[114257.834267] RIP [<ffffffffc04b0ffa>] _tl_restart+0xaa/0xf0 [drbd]
[114257.841312] RSP <ffff88201b2efcd8>
[114257.845331] CR2: 0000000000000068
[114257.849575] ---[ end trace 7c993d7d40ff47ee ]---
[114271.130266] ------------[ cut here ]------------
---
--
Jan Janicki