[Drbd-dev] DRBD Reconnection Bug

John Muth muth at parascale.com
Tue Jul 1 08:29:13 CEST 2008


I've been testing DRBD 8.2.6 on top of Centos 5.1 for use in a lights-out
environment.

I think I ran across a bug.Ha sanyone seen anything like this:

Background:

I'm running DRBD on top of LVM. I have a LVM volume named 'sm-vol' on two
nodes and I've got /dev/drbd0 mapped to it. I replicate using 'protocol C'.
At the beginning of the test node A is the primary and node B is the
secondary. My test script does:

  1. Write verifiable data to /dev/drbd0 on node A. Create a LVM snapshot on
node B and verify the data.
  2. Bring down node B using 'service drbd stop' Write more data frm node A.
Bring the node B back up with 'service drbd start'. When re-sync is
complete, create a LVM snapshot on the node B and verify that the contents
match what was written by the node A.
  3. Bring down node A with 'service drbd stop'. Make the node B into the
primary (drbdadm primary sm-vol). Write more data to the /dev/drbd0 from
node B.. Bring node A back with 'service drbd start'. Wait for resync to
complete. Create LVM snapshot on node A and erify data.
  4. Run 'drbdadm secondary sm-vol' on node B followed by 'drbdadm primary
sm-vol' on node A.
  5  Repeat

Once every 100 or so loops, DRBD gets wedged during a reconnection. When the
system gets wedged, I do 'cat /etc/drbd' on both nodes and I see the
following:

On primary:

[root at vm5 ~]# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-06-26 16:40:17
 0: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Outdated C r---
    ns:288064 nr:94516 dw:613836 dr:845384 al:122 bm:31 lo:0 pe:0 ua:0 ap:0
oos:95888

On secondary:

[root at vm6 ~]# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-06-26 16:40:17
 0: cs:WFBitMapT st:Secondary/Primary ds:Negotiating/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:106 bm:93 lo:0 pe:0 ua:0 ap:0 oos:95888

/var/log/messages looks like:

On primary:

Jun 30 13:30:08 vm5 kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.
Jun 30 13:30:08 vm5 kernel: drbd0: Creating new current UUID
Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.
Jun 30 13:30:08 vm5 kernel: drbd0: asender terminated
Jun 30 13:30:08 vm5 kernel: drbd0: Terminating asender thread
Jun 30 13:30:08 vm5 kernel: drbd0: tl_clear()
Jun 30 13:30:08 vm5 kernel: drbd0: Connection closed
Jun 30 13:30:08 vm5 kernel: drbd0: conn( TearDown -> Unconnected )
Jun 30 13:30:08 vm5 kernel: drbd0: receiver terminated
Jun 30 13:30:08 vm5 kernel: drbd0: receiver (re)started
Jun 30 13:30:08 vm5 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun 30 13:30:08 vm5 kernel: drbd0: Handshake successful: Agreed network
protocol version 88
Jun 30 13:30:08 vm5 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 30 13:30:08 vm5 kernel: drbd0: Starting asender thread (from
drbd0_receiver [12327])
Jun 30 13:30:08 vm5 kernel: drbd0: data-integrity-alg: <not-used>
Jun 30 13:30:08 vm5 kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> Connected ) pdsk( DUnknown -> Diskless )
Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.
Jun 30 13:30:17 vm5 kernel: drbd0: real peer disk state = Consistent
Jun 30 13:30:17 vm5 kernel: drbd0: conn( Connected -> WFBitMapS ) pdsk(
Diskless -> Outdated )
Jun 30 13:30:17 vm5 kernel: drbd0: Writing meta data super block now.


On secondary:

Jun 30 13:28:46 vm6 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jun 30 13:28:46 vm6 kernel: drbd0: asender terminated
Jun 30 13:28:46 vm6 kernel: drbd0: Terminating asender thread
Jun 30 13:28:46 vm6 kernel: drbd0: Writing meta data super block now.
Jun 30 13:28:46 vm6 kernel: drbd0: sock was shut down by peer
Jun 30 13:28:46 vm6 kernel: drbd0: short read expecting header on sock: r=0
Jun 30 13:28:46 vm6 kernel: drbd0: tl_clear()
Jun 30 13:28:46 vm6 kernel: drbd0: Connection closed
Jun 30 13:28:46 vm6 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jun 30 13:28:46 vm6 kernel: drbd0: receiver terminated
Jun 30 13:28:46 vm6 kernel: drbd0: receiver (re)started
Jun 30 13:28:46 vm6 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun 30 13:28:46 vm6 kernel: drbd0: disk( UpToDate -> Diskless )
Jun 30 13:28:46 vm6 kernel: drbd0: Handshake successful: Agreed network
protocol version 88
Jun 30 13:28:46 vm6 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 30 13:28:46 vm6 kernel: drbd0: Starting asender thread (from
drbd0_receiver [9457])
Jun 30 13:28:46 vm6 kernel: drbd0: data-integrity-alg: <not-used>
Jun 30 13:28:46 vm6 kernel: drbd0: peer( Unknown -> Primary ) conn(
WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Jun 30 13:28:55 vm6 kernel: drbd0: disk( Diskless -> Attaching )
Jun 30 13:28:55 vm6 kernel: drbd0: Found 4 transactions (145 active extents)
in activity log.
Jun 30 13:28:55 vm6 kernel: drbd0: max_segment_size ( = BIO size ) = 32768
Jun 30 13:28:55 vm6 kernel: drbd0: reading of bitmap took 1 jiffies
Jun 30 13:28:55 vm6 kernel: drbd0: recounting of set bits took additional 0
jiffies
Jun 30 13:28:55 vm6 kernel: drbd0: 0 KB (0 bits) marked out-of-sync by on
disk bit-map.
Jun 30 13:28:55 vm6 kernel: drbd0: disk( Attaching -> Negotiating )
Jun 30 13:28:55 vm6 kernel: drbd0: Writing meta data super block now.
Jun 30 13:28:55 vm6 kernel: drbd0: unexpected cstate (Connected) in
receive_bitmap
Jun 30 13:28:55 vm6 kernel: drbd0: conn( Connected -> WFBitMapT )

Has anyone seen this before?

-John Muth
Parascale Software
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20080630/faac1c23/attachment.html


More information about the drbd-dev mailing list