[DRBD-user] Initial sync stalls forever with many drbd disks

Fri Apr 13 18:19:36 CEST 2012

Hey all,

I am currently running into an issue using drbd in a xen cluster
(managed by ganeti).

When adding drbd instances, I can add up to 17 without issue, but the
18th instance stalls on initial sync:

block drbd17: Starting worker thread (from cqueue/2 [261])
block drbd17: disk( Diskless -> Attaching )
block drbd17: No usable activity log found.
block drbd17: Method to ensure write ordering: barrier
block drbd17: max_segment_size ( = BIO size ) = 32768
block drbd17: drbd_bm_resize called with capacity == 419430400
block drbd17: resync bitmap: bits=52428800 words=819200
block drbd17: size = 200 GB (209715200 KB)
block drbd17: Writing the whole bitmap, size changed
block drbd17: 200 GB (52428800 bits) marked out-of-sync by on disk
bit-map.
block drbd17: recounting of set bits took additional 2 jiffies
block drbd17: 200 GB (52428800 bits) marked out-of-sync by on disk
bit-map.
block drbd17: disk( Attaching -> Inconsistent )
block drbd17: Barriers not supported on meta data device - disabling
block drbd17: conn( StandAlone -> Unconnected )
block drbd17: Starting receiver thread (from drbd17_worker [21794])
block drbd17: receiver (re)started
block drbd17: conn( Unconnected -> WFConnection )
block drbd17: Handshake successful: Agreed network protocol version 94
block drbd17: Peer authenticated using 16 bytes of 'md5' HMAC
block drbd17: conn( WFConnection -> WFReportParams )
block drbd17: Starting asender thread (from drbd17_receiver [21799])
block drbd17: data-integrity-alg: <not-used>
block drbd17: drbd_sync_handshake:
block drbd17: self
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:52428800 flags:0
block drbd17: peer
4829B58EB3A8FE8D:0000000000000004:0000000000000000:0000000000000000
bits:52428800 flags:0
block drbd17: uuid_compare()=-2 by rule 20
block drbd17: Becoming sync target due to disk states.
block drbd17: Writing the whole bitmap, full sync required after
drbd_sync_handshake.
block drbd17: 200 GB (52428800 bits) marked out-of-sync by on disk
bit-map.
block drbd17: peer( Unknown -> Primary ) conn( WFReportParams ->
WFBitMapT ) pdsk( DUnknown -> UpToDate )
block drbd17: conn( WFBitMapT -> WFSyncUUID )
block drbd17: helper command: /bin/true before-resync-target minor-17
block drbd17: helper command: /bin/true before-resync-target minor-17
exit code 0 (0x0)
block drbd17: conn( WFSyncUUID -> SyncTarget )
block drbd17: Began resync as SyncTarget (will sync 209715200 KB
[52428800 bits set]).
block drbd17: peer( Primary -> Unknown ) conn( SyncTarget ->
Disconnecting ) pdsk( UpToDate -> DUnknown )
block drbd17: short read expecting header on sock: r=-512
block drbd17: meta connection shut down by peer.
block drbd17: asender terminated
block drbd17: Terminating asender thread
block drbd17: Connection closed
block drbd17: conn( Disconnecting -> StandAlone )
block drbd17: receiver terminated
block drbd17: Terminating receiver thread
block drbd17: disk( Inconsistent -> Diskless )
block drbd17: drbd_bm_resize called with capacity == 0
block drbd17: worker terminated
block drbd17: Terminating worker thread

I am running Centos 5 xen, drbd 8.3.8. I have tried multiple
kernel/drbd(8.3.2/8)/bios combinations to no avail. This behavior is
consistent between all nodes (currently 5). I have even changed out the
switch the drbd data is transferred on.

Currently the xen is running with 4GB ram allocated to dom0, with over
2GB free on each node.

Do I just have not enough ram allocated to dom0? or am I missing
something else.

Any thoughts/assistance is appreciated.

-- 
Andrew Maldonado
Systems Administrator
Pictage, Inc.