[DRBD-user] Bitmap overflow while decoding bm RLE packet after DRBD restart.

Viktar Sakovich viktar at serverstack.com
Fri Jul 12 05:24:37 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I have testing DRBD setup on two nodes on top on 80Tb logical volume.
nfs1 is configured as primary node nfs2 and secondary node.

It was working properly until DRBD and heartbeat were stopped and started
manually on both nodes.

On nfs1 I see in logs repetitive errors :
block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1),
total 48; compression: 100.0%
d-con storage: sock was shut down by peer

And on nfs2 there are repetitive errors in logs:
block drbd0: bitmap overflow (e:42060721984381164) while decoding bm RLE
packet
d-con storage: error receiving CBitmap, e: -5 l: 32!

I tried to disable data-integrity-alg, verify-alg but it hasn't fixed the
issue. Also searched mailing list, but doesn't look like anyone encountered
with such issue.

See logs and configuration bellow. Let me know if I need to provide some
additional information.

Any help really appreciated.

Thank you.


========== Logical volume configuration ===============
Partition and LVM scheme is identical on both servers

# lvdisplay  | grep 'LV Path\|LV Name\|VG Name\|LV Size\|Current
LE\|Logical volume'
  --- Logical volume ---
  LV Path                /dev/vg_storage1/lv_storage1
  LV Name                lv_storage1
  VG Name                vg_storage1
  LV Size                80.18 TiB
  Current LE             21019223
  --- Logical volume ---
  LV Path                /dev/vg_storage1/lv_metadata1
  LV Name                lv_metadata1
  VG Name                vg_storage1
  LV Size                8.00 GiB
  Current LE             2048

=============== /etc/drbd.conf ==========================

global { usage-count no; }
resource storage {
        protocol C;
        startup {
                wfc-timeout  300;
                degr-wfc-timeout 240;
                outdated-wfc-timeout 180;
        }
        handlers {
                fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh root";
                pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh root";
                local-io-error "/usr/lib/drbd/notify-io-error.sh root";
                before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        }
        disk {
                on-io-error detach;
                fencing resource-only;
                resync-rate 1024M;
        }
        net {
                cram-hmac-alg sha1;
                shared-secret "secret";
                verify-alg sha1;
                data-integrity-alg sha1;
                after-sb-0pri disconnect;
                after-sb-1pri disconnect;
                after-sb-2pri disconnect;
                sndbuf-size 0;
        }
        on nfs1 {
                address 192.168.35.121:7788;
                volume 0 {
                        device /dev/drbd0;
                        disk /dev/vg_storage1/lv_storage1;
                        flexible-meta-disk /dev/vg_storage1/lv_metadata1;
                }
        }
        on nfs2 {
                address 192.168.35.122:7788;
                volume 0 {
                        device /dev/drbd0;
                        disk /dev/vg_storage1/lv_storage1;
                        flexible-meta-disk /dev/vg_storage1/lv_metadata1;
                }
        }
}

=============== Logs from nfs1 ==================================
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Connecting
channel
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Client
outdater (0xcbd710) connected
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: invoked:
outdater
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processing
msg from outdater
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Got message
from (drbd-peer-outdater). (peer: nfs2, res :storage)
Jul 11 23:08:45 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Starting node
walk
Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: node nfs2
found
Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: info: sending
start_outdate message to the other node nfs1 -> nfs2
Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: sending
[start_outdate res: storage] to node: nfs2
Jul 11 23:08:46 nfs1 /usr/lib/heartbeat/dopd: [20492]: debug: Processed 1
messages
Jul 11 23:08:46 nfs1 kernel: [ 5102.663357] d-con storage: Handshake
successful: Agreed network protocol version 101
Jul 11 23:08:46 nfs1 kernel: [ 5102.663539] d-con storage: Peer
authenticated using 20 bytes HMAC
Jul 11 23:08:46 nfs1 kernel: [ 5102.663587] d-con storage: conn(
WFConnection -> WFReportParams )
Jul 11 23:08:46 nfs1 kernel: [ 5102.663591] d-con storage: Starting asender
thread (from drbd_r_storage [12110])
Jul 11 23:08:46 nfs1 kernel: [ 5102.672679] block drbd0:
drbd_sync_handshake:
Jul 11 23:08:46 nfs1 kernel: [ 5102.672685] block drbd0: self
72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15
flags:0
Jul 11 23:08:46 nfs1 kernel: [ 5102.672689] block drbd0: peer
29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11
flags:0
Jul 11 23:08:46 nfs1 kernel: [ 5102.672692] block drbd0: uuid_compare()=1
by rule 70
Jul 11 23:08:46 nfs1 kernel: [ 5102.672699] block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
Consistent )
Jul 11 23:08:47 nfs1 kernel: [ 5103.344839] block drbd0: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Jul 11 23:08:47 nfs1 kernel: [ 5103.377945] d-con storage: sock was shut
down by peer
Jul 11 23:08:47 nfs1 kernel: [ 5103.377962] d-con storage: peer( Secondary
-> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( Consistent -> DUnknown )
Jul 11 23:08:47 nfs1 kernel: [ 5103.377966] d-con storage: meta connection
shut down by peer.
Jul 11 23:08:47 nfs1 kernel: [ 5103.383302] d-con storage: short read
(expected size 16)
Jul 11 23:08:47 nfs1 kernel: [ 5103.383438] d-con storage: asender
terminated
Jul 11 23:08:47 nfs1 kernel: [ 5103.383449] d-con storage: Terminating
drbd_a_storage
Jul 11 23:08:47 nfs1 kernel: [ 5103.387969] d-con storage: Connection closed
Jul 11 23:08:47 nfs1 kernel: [ 5103.388141] d-con storage: conn( BrokenPipe
-> Unconnected )
Jul 11 23:08:47 nfs1 kernel: [ 5103.388146] d-con storage: receiver
terminated
Jul 11 23:08:47 nfs1 kernel: [ 5103.388148] d-con storage: Restarting
receiver thread
Jul 11 23:08:47 nfs1 kernel: [ 5103.388149] d-con storage: receiver
(re)started
Jul 11 23:08:47 nfs1 kernel: [ 5103.388157] d-con storage: conn(
Unconnected -> WFConnection )
Jul 11 23:08:47 nfs1 kernel: [ 5103.388166] d-con storage: helper command:
/sbin/drbdadm fence-peer storage

========= Logs from nfs2 ==================

Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug:
msg_start_outdate: command: drbdadm outdate storage
Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug:
msg_start_outdate: nfs1, command rc: 0, rc: 4
Jul 11 23:08:45 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return
code: 4, nfs2 -> nfs1
Jul 11 23:08:45 nfs2 kernel: [ 5192.997534] block drbd0: bitmap overflow
(e:42060721984381164) while decoding bm RLE packet
Jul 11 23:08:45 nfs2 kernel: [ 5193.019524] d-con storage: error receiving
CBitmap, e: -5 l: 32!
Jul 11 23:08:45 nfs2 kernel: [ 5193.030454] d-con storage: peer( Primary ->
Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Jul 11 23:08:45 nfs2 kernel: [ 5193.030470] d-con storage: asender
terminated
Jul 11 23:08:45 nfs2 kernel: [ 5193.030480] d-con storage: Terminating
drbd_a_storage
Jul 11 23:08:45 nfs2 kernel: [ 5193.035377] d-con storage: Connection closed
Jul 11 23:08:45 nfs2 kernel: [ 5193.035396] d-con storage: conn(
ProtocolError -> Unconnected )
Jul 11 23:08:45 nfs2 kernel: [ 5193.035398] d-con storage: receiver
terminated
Jul 11 23:08:45 nfs2 kernel: [ 5193.035400] d-con storage: Restarting
receiver thread
Jul 11 23:08:45 nfs2 kernel: [ 5193.035402] d-con storage: receiver
(re)started
Jul 11 23:08:45 nfs2 kernel: [ 5193.035407] d-con storage: conn(
Unconnected -> WFConnection )
Jul 11 23:08:46 nfs2 kernel: [ 5193.534605] d-con storage: Handshake
successful: Agreed network protocol version 101
Jul 11 23:08:46 nfs2 kernel: [ 5193.534848] d-con storage: Peer
authenticated using 20 bytes HMAC
Jul 11 23:08:46 nfs2 kernel: [ 5193.534876] d-con storage: conn(
WFConnection -> WFReportParams )
Jul 11 23:08:46 nfs2 kernel: [ 5193.534880] d-con storage: Starting asender
thread (from drbd_r_storage [10342])
Jul 11 23:08:46 nfs2 kernel: [ 5193.542628] block drbd0:
drbd_sync_handshake:
Jul 11 23:08:46 nfs2 kernel: [ 5193.542634] block drbd0: self
29CAF8920E473134:0000000000000000:3218CF664B075198:3217CF664B075198 bits:11
flags:0
Jul 11 23:08:46 nfs2 kernel: [ 5193.542637] block drbd0: peer
72C2A2A1843576E3:29CAF8920E473135:3218CF664B075199:3217CF664B075198 bits:15
flags:0
Jul 11 23:08:46 nfs2 kernel: [ 5193.542640] block drbd0: uuid_compare()=-1
by rule 50
Jul 11 23:08:46 nfs2 kernel: [ 5193.542647] block drbd0: peer( Unknown ->
Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Jul 11 23:08:47 nfs2 kernel: [ 5194.216363] block drbd0: bitmap overflow
(e:42060721984381164) while decoding bm RLE packet
Jul 11 23:08:47 nfs2 kernel: [ 5194.238177] d-con storage: error receiving
CBitmap, e: -5 l: 32!
Jul 11 23:08:47 nfs2 kernel: [ 5194.249163] d-con storage: peer( Primary ->
Unknown ) conn( WFBitMapT -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Jul 11 23:08:47 nfs2 kernel: [ 5194.249177] d-con storage: asender
terminated
Jul 11 23:08:47 nfs2 kernel: [ 5194.249187] d-con storage: Terminating
drbd_a_storage
Jul 11 23:08:47 nfs2 kernel: [ 5194.254026] d-con storage: Connection closed
Jul 11 23:08:47 nfs2 kernel: [ 5194.254045] d-con storage: conn(
ProtocolError -> Unconnected )
Jul 11 23:08:47 nfs2 kernel: [ 5194.254048] d-con storage: receiver
terminated
Jul 11 23:08:47 nfs2 kernel: [ 5194.254050] d-con storage: Restarting
receiver thread
Jul 11 23:08:47 nfs2 kernel: [ 5194.254051] d-con storage: receiver
(re)started
Jul 11 23:08:47 nfs2 kernel: [ 5194.254057] d-con storage: conn(
Unconnected -> WFConnection )
Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug:
msg_start_outdate: command: drbdadm outdate storage
Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: debug:
msg_start_outdate: nfs1, command rc: 0, rc: 4
Jul 11 23:08:47 nfs2 /usr/lib/heartbeat/dopd: [14301]: info: sending return
code: 4, nfs2 -> nfs1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130711/646cf89c/attachment.htm>


More information about the drbd-user mailing list