Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I have testing DRBD setup on two nodes on top on 80Tb logical volume. nfs1 is configured as primary node nfs2 and secondary node. It was working properly until DRBD and heartbeat were stopped and started manually on both nodes. On nfs1 I see in logs repetitive errors : block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% d-con storage: sock was shut down by peer And on nfs2 there are repetitive errors in logs: block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet d-con storage: error receiving CBitmap, e: -5 l: 32! I tried to disable data-integrity-alg, verify-alg but it hasn't fixed the issue. Also searched mailing list, but doesn't look like anyone encountered with such issue. See logs and configuration bellow. Let me know if I need to provide some additional information. Any help really appreciated. Thank you. ========== Logical volume configuration =============== Partition and LVM scheme is identical on both servers # lvdisplay | grep 'LV Path\|LV Name\|VG Name\|LV Size\|Current LE\|Logical volume' --- Logical volume --- LV Path /dev/vg_storage1/lv_storage1 LV Name lv_storage1 VG Name vg_storage1 LV Size 80.18 TiB Current LE 21019223 --- Logical volume --- LV Path /dev/vg_storage1/lv_metadata1 LV Name lv_metadata1 VG Name vg_storage1 LV Size 8.00 GiB Current LE 2048 =============== /etc/drbd.conf ========================== global { usage-count no; } resource storage { protocol C; startup { wfc-timeout 300; degr-wfc-timeout 240; outdated-wfc-timeout 180; } handlers { fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh root"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh root"; local-io-error "/usr/lib/drbd/notify-io-error.sh root"; before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } disk { on-io-error detach; fencing resource-only; resync-rate 1024M; } net { cram-hmac-alg sha1; shared-secret "secret"; verify-alg sha1; data-integrity-alg sha1; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; sndbuf-size 0; } on nfs1 { address 192.168.35.121:7788; volume 0 { device /dev/drbd0; disk /dev/vg_storage1/lv_storage1; flexible-meta-disk /dev/vg_storage1/lv_metadata1; } } on nfs2 { address 192.168.35.122:7788; volume 0 { device /dev/drbd0; disk /dev/vg_storage1/lv_storage1; flexible-meta-disk /dev/vg_storage1/lv_metadata1; } } } =============== Logs from nfs1 ================================== Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Connecting channel Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Client outdater (0xcbd710) connected Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: invoked: outdater Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processing msg from outdater Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Got message from (drbd-peer-outdater). (peer: nfs2, res :storage) Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Starting node walk Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: node nfs2 found Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: info: sending start_outdate message to the other node nfs1 -> nfs2 Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: sending [start_outdate res: storage] to node: nfs2 Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processed 1 messages Jul 11 23:08:46 nfs1 kernel: [ 5102.663357] d-con storage: Handshake successful: Agreed network protocol version 101 Jul 11 23:08:46 nfs1 kernel: [ 5102.663539] d-con storage: Peer authenticated using 20 bytes HMAC Jul 11 23:08:46 nfs1 kernel: [ 5102.663587] d-con storage: conn( WFConnection -> WFReportParams ) Jul 11 23:08:46 nfs1 kernel: [ 5102.663591] d-con storage: Starting asender thread (from drbd_r_storage [12110]) Jul 11 23:08:46 nfs1 kernel: [ 5102.672679] block drbd0: drbd_sync_handshake: Jul 11 23:08:46 nfs1 kernel: [ 5102.672685] block drbd0: self 72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15 flags:0 Jul 11 23:08:46 nfs1 kernel: [ 5102.672689] block drbd0: peer 29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11 flags:0 Jul 11 23:08:46 nfs1 kernel: [ 5102.672692] block drbd0: uuid_compare()=1 by rule 70 Jul 11 23:08:46 nfs1 kernel: [ 5102.672699] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Jul 11 23:08:47 nfs1 kernel: [ 5103.344839] block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0% Jul 11 23:08:47 nfs1 kernel: [ 5103.377945] d-con storage: sock was shut down by peer Jul 11 23:08:47 nfs1 kernel: [ 5103.377962] d-con storage: peer( Secondary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( Consistent -> DUnknown ) Jul 11 23:08:47 nfs1 kernel: [ 5103.377966] d-con storage: meta connection shut down by peer. Jul 11 23:08:47 nfs1 kernel: [ 5103.383302] d-con storage: short read (expected size 16) Jul 11 23:08:47 nfs1 kernel: [ 5103.383438] d-con storage: asender terminated Jul 11 23:08:47 nfs1 kernel: [ 5103.383449] d-con storage: Terminating drbd_a_storage Jul 11 23:08:47 nfs1 kernel: [ 5103.387969] d-con storage: Connection closed Jul 11 23:08:47 nfs1 kernel: [ 5103.388141] d-con storage: conn( BrokenPipe -> Unconnected ) Jul 11 23:08:47 nfs1 kernel: [ 5103.388146] d-con storage: receiver terminated Jul 11 23:08:47 nfs1 kernel: [ 5103.388148] d-con storage: Restarting receiver thread Jul 11 23:08:47 nfs1 kernel: [ 5103.388149] d-con storage: receiver (re)started Jul 11 23:08:47 nfs1 kernel: [ 5103.388157] d-con storage: conn( Unconnected -> WFConnection ) Jul 11 23:08:47 nfs1 kernel: [ 5103.388166] d-con storage: helper command: /sbin/drbdadm fence-peer storage ========= Logs from nfs2 ================== Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: command: drbdadm outdate storage Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: nfs1, command rc: 0, rc: 4 Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return code: 4, nfs2 -> nfs1 Jul 11 23:08:45 nfs2 kernel: [ 5192.997534] block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet Jul 11 23:08:45 nfs2 kernel: [ 5193.019524] d-con storage: error receiving CBitmap, e: -5 l: 32! Jul 11 23:08:45 nfs2 kernel: [ 5193.030454] d-con storage: peer( Primary -> Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Jul 11 23:08:45 nfs2 kernel: [ 5193.030470] d-con storage: asender terminated Jul 11 23:08:45 nfs2 kernel: [ 5193.030480] d-con storage: Terminating drbd_a_storage Jul 11 23:08:45 nfs2 kernel: [ 5193.035377] d-con storage: Connection closed Jul 11 23:08:45 nfs2 kernel: [ 5193.035396] d-con storage: conn( ProtocolError -> Unconnected ) Jul 11 23:08:45 nfs2 kernel: [ 5193.035398] d-con storage: receiver terminated Jul 11 23:08:45 nfs2 kernel: [ 5193.035400] d-con storage: Restarting receiver thread Jul 11 23:08:45 nfs2 kernel: [ 5193.035402] d-con storage: receiver (re)started Jul 11 23:08:45 nfs2 kernel: [ 5193.035407] d-con storage: conn( Unconnected -> WFConnection ) Jul 11 23:08:46 nfs2 kernel: [ 5193.534605] d-con storage: Handshake successful: Agreed network protocol version 101 Jul 11 23:08:46 nfs2 kernel: [ 5193.534848] d-con storage: Peer authenticated using 20 bytes HMAC Jul 11 23:08:46 nfs2 kernel: [ 5193.534876] d-con storage: conn( WFConnection -> WFReportParams ) Jul 11 23:08:46 nfs2 kernel: [ 5193.534880] d-con storage: Starting asender thread (from drbd_r_storage [10342]) Jul 11 23:08:46 nfs2 kernel: [ 5193.542628] block drbd0: drbd_sync_handshake: Jul 11 23:08:46 nfs2 kernel: [ 5193.542634] block drbd0: self 29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11 flags:0 Jul 11 23:08:46 nfs2 kernel: [ 5193.542637] block drbd0: peer 72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15 flags:0 Jul 11 23:08:46 nfs2 kernel: [ 5193.542640] block drbd0: uuid_compare()=-1 by rule 50 Jul 11 23:08:46 nfs2 kernel: [ 5193.542647] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jul 11 23:08:47 nfs2 kernel: [ 5194.216363] block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet Jul 11 23:08:47 nfs2 kernel: [ 5194.238177] d-con storage: error receiving CBitmap, e: -5 l: 32! Jul 11 23:08:47 nfs2 kernel: [ 5194.249163] d-con storage: peer( Primary -> Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Jul 11 23:08:47 nfs2 kernel: [ 5194.249177] d-con storage: asender terminated Jul 11 23:08:47 nfs2 kernel: [ 5194.249187] d-con storage: Terminating drbd_a_storage Jul 11 23:08:47 nfs2 kernel: [ 5194.254026] d-con storage: Connection closed Jul 11 23:08:47 nfs2 kernel: [ 5194.254045] d-con storage: conn( ProtocolError -> Unconnected ) Jul 11 23:08:47 nfs2 kernel: [ 5194.254048] d-con storage: receiver terminated Jul 11 23:08:47 nfs2 kernel: [ 5194.254050] d-con storage: Restarting receiver thread Jul 11 23:08:47 nfs2 kernel: [ 5194.254051] d-con storage: receiver (re)started Jul 11 23:08:47 nfs2 kernel: [ 5194.254057] d-con storage: conn( Unconnected -> WFConnection ) Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: command: drbdadm outdate storage Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: nfs1, command rc: 0, rc: 4 Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return code: 4, nfs2 -> nfs1 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130711/646cf89c/attachment.htm>