<div dir="ltr">Hello,<br><br>I have testing DRBD setup on two nodes on top on 80Tb logical volume.<br>nfs1 is configured as primary node nfs2 and secondary node.<br><br>It was working properly until DRBD and heartbeat were stopped and started manually on both nodes.<br>
<br>On nfs1 I see in logs repetitive errors :<br>block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%<br>d-con storage: sock was shut down by peer<br><br>And on nfs2 there are repetitive errors in logs:<br>
block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet<br>d-con storage: error receiving CBitmap, e: -5 l: 32!<br><br>I tried to disable data-integrity-alg, verify-alg but it hasn't fixed the issue. Also searched mailing list, but doesn't look like anyone encountered with such issue.<br>
<br>See logs and configuration bellow. Let me know if I need to provide some additional information.<br><br>Any help really appreciated.<br><br>Thank you.<br><br><br>========== Logical volume configuration ===============<br>
Partition and LVM scheme is identical on both servers<br><br># lvdisplay | grep 'LV Path\|LV Name\|VG Name\|LV Size\|Current LE\|Logical volume'<br> --- Logical volume ---<br> LV Path /dev/vg_storage1/lv_storage1<br>
LV Name lv_storage1<br> VG Name vg_storage1<br> LV Size 80.18 TiB<br> Current LE 21019223<br> --- Logical volume ---<br> LV Path /dev/vg_storage1/lv_metadata1<br>
LV Name lv_metadata1<br> VG Name vg_storage1<br> LV Size 8.00 GiB<br> Current LE 2048<br><br>=============== /etc/drbd.conf ==========================<br><br>global { usage-count no; }<br>
resource storage {<br> protocol C;<br> startup {<br> wfc-timeout 300;<br> degr-wfc-timeout 240;<br> outdated-wfc-timeout 180;<br> }<br> handlers {<br>
fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";<br> split-brain "/usr/lib/drbd/notify-split-brain.sh root";<br> out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";<br>
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh root";<br> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh root";<br> local-io-error "/usr/lib/drbd/notify-io-error.sh root";<br>
before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";<br> after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;<br> }<br> disk {<br>
on-io-error detach;<br> fencing resource-only;<br> resync-rate 1024M;<br> }<br> net {<br> cram-hmac-alg sha1;<br> shared-secret "secret";<br>
verify-alg sha1;<br> data-integrity-alg sha1;<br> after-sb-0pri disconnect;<br> after-sb-1pri disconnect;<br> after-sb-2pri disconnect;<br> sndbuf-size 0;<br>
}<br> on nfs1 {<br> address <a href="http://192.168.35.121:7788">192.168.35.121:7788</a>;<br> volume 0 {<br> device /dev/drbd0;<br> disk /dev/vg_storage1/lv_storage1;<br>
flexible-meta-disk /dev/vg_storage1/lv_metadata1;<br> }<br> }<br> on nfs2 {<br> address <a href="http://192.168.35.122:7788">192.168.35.122:7788</a>;<br>
volume 0 {<br> device /dev/drbd0;<br> disk /dev/vg_storage1/lv_storage1;<br> flexible-meta-disk /dev/vg_storage1/lv_metadata1;<br> }<br>
}<br>} <br><br>=============== Logs from nfs1 ==================================<br>Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Connecting channel<br>Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Client outdater (0xcbd710) connected<br>
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: invoked: outdater<br>Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processing msg from outdater<br>Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Got message from (drbd-peer-outdater). (peer: nfs2, res :storage)<br>
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Starting node walk<br>Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: node nfs2 found<br>Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: info: sending start_outdate message to the other node nfs1 -> nfs2<br>
Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: sending [start_outdate res: storage] to node: nfs2<br>Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processed 1 messages<br>Jul 11 23:08:46 nfs1 kernel: [ 5102.663357] d-con storage: Handshake successful: Agreed network protocol version 101<br>
Jul 11 23:08:46 nfs1 kernel: [ 5102.663539] d-con storage: Peer authenticated using 20 bytes HMAC<br>Jul 11 23:08:46 nfs1 kernel: [ 5102.663587] d-con storage: conn( WFConnection -> WFReportParams ) <br>Jul 11 23:08:46 nfs1 kernel: [ 5102.663591] d-con storage: Starting asender thread (from drbd_r_storage [12110])<br>
Jul 11 23:08:46 nfs1 kernel: [ 5102.672679] block drbd0: drbd_sync_handshake:<br>Jul 11 23:08:46 nfs1 kernel: [ 5102.672685] block drbd0: self 72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15 flags:0<br>
Jul 11 23:08:46 nfs1 kernel: [ 5102.672689] block drbd0: peer 29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11 flags:0<br>Jul 11 23:08:46 nfs1 kernel: [ 5102.672692] block drbd0: uuid_compare()=1 by rule 70<br>
Jul 11 23:08:46 nfs1 kernel: [ 5102.672699] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) <br>Jul 11 23:08:47 nfs1 kernel: [ 5103.344839] block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%<br>
Jul 11 23:08:47 nfs1 kernel: [ 5103.377945] d-con storage: sock was shut down by peer<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.377962] d-con storage: peer( Secondary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( Consistent -> DUnknown ) <br>
Jul 11 23:08:47 nfs1 kernel: [ 5103.377966] d-con storage: meta connection shut down by peer.<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.383302] d-con storage: short read (expected size 16)<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.383438] d-con storage: asender terminated<br>
Jul 11 23:08:47 nfs1 kernel: [ 5103.383449] d-con storage: Terminating drbd_a_storage<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.387969] d-con storage: Connection closed<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.388141] d-con storage: conn( BrokenPipe -> Unconnected ) <br>
Jul 11 23:08:47 nfs1 kernel: [ 5103.388146] d-con storage: receiver terminated<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.388148] d-con storage: Restarting receiver thread<br>Jul 11 23:08:47 nfs1 kernel: [ 5103.388149] d-con storage: receiver (re)started<br>
Jul 11 23:08:47 nfs1 kernel: [ 5103.388157] d-con storage: conn( Unconnected -> WFConnection ) <br>Jul 11 23:08:47 nfs1 kernel: [ 5103.388166] d-con storage: helper command: /sbin/drbdadm fence-peer storage<br><br>========= Logs from nfs2 ==================<br>
<br>Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: command: drbdadm outdate storage<br>Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: nfs1, command rc: 0, rc: 4<br>
Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return code: 4, nfs2 -> nfs1<br>Jul 11 23:08:45 nfs2 kernel: [ 5192.997534] block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet<br>
Jul 11 23:08:45 nfs2 kernel: [ 5193.019524] d-con storage: error receiving CBitmap, e: -5 l: 32!<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.030454] d-con storage: peer( Primary -> Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown )<br>
Jul 11 23:08:45 nfs2 kernel: [ 5193.030470] d-con storage: asender terminated<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.030480] d-con storage: Terminating drbd_a_storage<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.035377] d-con storage: Connection closed<br>
Jul 11 23:08:45 nfs2 kernel: [ 5193.035396] d-con storage: conn( ProtocolError -> Unconnected )<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.035398] d-con storage: receiver terminated<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.035400] d-con storage: Restarting receiver thread<br>
Jul 11 23:08:45 nfs2 kernel: [ 5193.035402] d-con storage: receiver (re)started<br>Jul 11 23:08:45 nfs2 kernel: [ 5193.035407] d-con storage: conn( Unconnected -> WFConnection )<br>Jul 11 23:08:46 nfs2 kernel: [ 5193.534605] d-con storage: Handshake successful: Agreed network protocol version 101<br>
Jul 11 23:08:46 nfs2 kernel: [ 5193.534848] d-con storage: Peer authenticated using 20 bytes HMAC<br>Jul 11 23:08:46 nfs2 kernel: [ 5193.534876] d-con storage: conn( WFConnection -> WFReportParams )<br>Jul 11 23:08:46 nfs2 kernel: [ 5193.534880] d-con storage: Starting asender thread (from drbd_r_storage [10342])<br>
Jul 11 23:08:46 nfs2 kernel: [ 5193.542628] block drbd0: drbd_sync_handshake:<br>Jul 11 23:08:46 nfs2 kernel: [ 5193.542634] block drbd0: self 29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11 flags:0<br>
Jul 11 23:08:46 nfs2 kernel: [ 5193.542637] block drbd0: peer 72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15 flags:0<br>Jul 11 23:08:46 nfs2 kernel: [ 5193.542640] block drbd0: uuid_compare()=-1 by rule 50<br>
Jul 11 23:08:46 nfs2 kernel: [ 5193.542647] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.216363] block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE packet<br>
Jul 11 23:08:47 nfs2 kernel: [ 5194.238177] d-con storage: error receiving CBitmap, e: -5 l: 32!<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.249163] d-con storage: peer( Primary -> Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown )<br>
Jul 11 23:08:47 nfs2 kernel: [ 5194.249177] d-con storage: asender terminated<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.249187] d-con storage: Terminating drbd_a_storage<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.254026] d-con storage: Connection closed<br>
Jul 11 23:08:47 nfs2 kernel: [ 5194.254045] d-con storage: conn( ProtocolError -> Unconnected )<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.254048] d-con storage: receiver terminated<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.254050] d-con storage: Restarting receiver thread<br>
Jul 11 23:08:47 nfs2 kernel: [ 5194.254051] d-con storage: receiver (re)started<br>Jul 11 23:08:47 nfs2 kernel: [ 5194.254057] d-con storage: conn( Unconnected -> WFConnection )<br>Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: command: drbdadm outdate storage<br>
Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug: msg_start_outdate: nfs1, command rc: 0, rc: 4<br>Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return code: 4, nfs2 -> nfs1<br><br></div>