I've been testing DRBD 8.2.6 on top of Centos 5.1 for use in a lights-out environment.<br><br>I think I ran across a bug.Ha sanyone seen anything like this:<br><br>Background:<br><br>I'm running DRBD on top of LVM. I have a LVM volume named 'sm-vol' on two nodes and I've got /dev/drbd0 mapped to it. I replicate using 'protocol C'. At the beginning of the test node A is the primary and node B is the secondary. My test script does:<br>
<br> 1. Write verifiable data to /dev/drbd0 on node A. Create a LVM snapshot on node B and verify the data.<br> 2. Bring down node B using 'service drbd stop' Write more data frm node A. Bring the node B back up with 'service drbd start'. When re-sync is complete, create a LVM snapshot on the node B and verify that the contents match what was written by the node A.<br>
3. Bring down node A with 'service drbd stop'. Make the node B into the primary (drbdadm primary sm-vol). Write more data to the /dev/drbd0 from node B.. Bring node A back with 'service drbd start'. Wait for resync to complete. Create LVM snapshot on node A and erify data.<br>
4. Run 'drbdadm secondary sm-vol' on node B followed by 'drbdadm primary sm-vol' on node A.<br> 5 Repeat<br><br>Once every 100 or so loops, DRBD gets wedged during a reconnection. When the system gets wedged, I do 'cat /etc/drbd' on both nodes and I see the following:<br>
<br>On primary:<br><br>[root@vm5 ~]# cat /proc/drbd<br>version: 8.2.6 (api:88/proto:86-88)<br>GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-i386-build, 2008-06-26 16:40:17<br> 0: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Outdated C r---<br>
ns:288064 nr:94516 dw:613836 dr:845384 al:122 bm:31 lo:0 pe:0 ua:0 ap:0 oos:95888<br><br>On secondary:<br><br>[root@vm6 ~]# cat /proc/drbd<br>version: 8.2.6 (api:88/proto:86-88)<br>GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-i386-build, 2008-06-26 16:40:17<br>
0: cs:WFBitMapT st:Secondary/Primary ds:Negotiating/UpToDate C r---<br> ns:0 nr:0 dw:0 dr:0 al:106 bm:93 lo:0 pe:0 ua:0 ap:0 oos:95888<br><br>/var/log/messages looks like:<br><br>On primary:<br><br>Jun 30 13:30:08 vm5 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )<br>
Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.<br>Jun 30 13:30:08 vm5 kernel: drbd0: Creating new current UUID<br>Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.<br>Jun 30 13:30:08 vm5 kernel: drbd0: asender terminated<br>
Jun 30 13:30:08 vm5 kernel: drbd0: Terminating asender thread<br>Jun 30 13:30:08 vm5 kernel: drbd0: tl_clear()<br>Jun 30 13:30:08 vm5 kernel: drbd0: Connection closed<br>Jun 30 13:30:08 vm5 kernel: drbd0: conn( TearDown -> Unconnected )<br>
Jun 30 13:30:08 vm5 kernel: drbd0: receiver terminated<br>Jun 30 13:30:08 vm5 kernel: drbd0: receiver (re)started<br>Jun 30 13:30:08 vm5 kernel: drbd0: conn( Unconnected -> WFConnection )<br>Jun 30 13:30:08 vm5 kernel: drbd0: Handshake successful: Agreed network protocol version 88<br>
Jun 30 13:30:08 vm5 kernel: drbd0: conn( WFConnection -> WFReportParams )<br>Jun 30 13:30:08 vm5 kernel: drbd0: Starting asender thread (from drbd0_receiver [12327])<br>Jun 30 13:30:08 vm5 kernel: drbd0: data-integrity-alg: <not-used><br>
Jun 30 13:30:08 vm5 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless )<br>Jun 30 13:30:08 vm5 kernel: drbd0: Writing meta data super block now.<br>Jun 30 13:30:17 vm5 kernel: drbd0: real peer disk state = Consistent<br>
Jun 30 13:30:17 vm5 kernel: drbd0: conn( Connected -> WFBitMapS ) pdsk( Diskless -> Outdated )<br>Jun 30 13:30:17 vm5 kernel: drbd0: Writing meta data super block now.<br><br><br>On secondary:<br><br>Jun 30 13:28:46 vm6 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )<br>
Jun 30 13:28:46 vm6 kernel: drbd0: asender terminated<br>Jun 30 13:28:46 vm6 kernel: drbd0: Terminating asender thread<br>Jun 30 13:28:46 vm6 kernel: drbd0: Writing meta data super block now.<br>Jun 30 13:28:46 vm6 kernel: drbd0: sock was shut down by peer<br>
Jun 30 13:28:46 vm6 kernel: drbd0: short read expecting header on sock: r=0<br>Jun 30 13:28:46 vm6 kernel: drbd0: tl_clear()<br>Jun 30 13:28:46 vm6 kernel: drbd0: Connection closed<br>Jun 30 13:28:46 vm6 kernel: drbd0: conn( NetworkFailure -> Unconnected )<br>
Jun 30 13:28:46 vm6 kernel: drbd0: receiver terminated<br>Jun 30 13:28:46 vm6 kernel: drbd0: receiver (re)started<br>Jun 30 13:28:46 vm6 kernel: drbd0: conn( Unconnected -> WFConnection )<br>Jun 30 13:28:46 vm6 kernel: drbd0: disk( UpToDate -> Diskless )<br>
Jun 30 13:28:46 vm6 kernel: drbd0: Handshake successful: Agreed network protocol version 88<br>Jun 30 13:28:46 vm6 kernel: drbd0: conn( WFConnection -> WFReportParams )<br>Jun 30 13:28:46 vm6 kernel: drbd0: Starting asender thread (from drbd0_receiver [9457])<br>
Jun 30 13:28:46 vm6 kernel: drbd0: data-integrity-alg: <not-used><br>Jun 30 13:28:46 vm6 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )<br>Jun 30 13:28:55 vm6 kernel: drbd0: disk( Diskless -> Attaching )<br>
Jun 30 13:28:55 vm6 kernel: drbd0: Found 4 transactions (145 active extents) in activity log.<br>Jun 30 13:28:55 vm6 kernel: drbd0: max_segment_size ( = BIO size ) = 32768<br>Jun 30 13:28:55 vm6 kernel: drbd0: reading of bitmap took 1 jiffies<br>
Jun 30 13:28:55 vm6 kernel: drbd0: recounting of set bits took additional 0 jiffies<br>Jun 30 13:28:55 vm6 kernel: drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.<br>Jun 30 13:28:55 vm6 kernel: drbd0: disk( Attaching -> Negotiating )<br>
Jun 30 13:28:55 vm6 kernel: drbd0: Writing meta data super block now.<br>Jun 30 13:28:55 vm6 kernel: drbd0: unexpected cstate (Connected) in receive_bitmap<br>Jun 30 13:28:55 vm6 kernel: drbd0: conn( Connected -> WFBitMapT )<br>
<br>Has anyone seen this before?<br><br>-John Muth<br>Parascale Software<br><br><br>