[Drbd-dev] DRBD connection fails with unexpected packet

Graham, Simon Simon.Graham at stratus.com
Sat May 19 02:58:27 CEST 2007


We are seeing an occasional problem where DRBD connection fails with
this error:

May 13 04:25:12 ben kernel: drbd2: expected HandShake packet, received:
ReportState (0x000e)

I believe this is a timing issue between the thread setting up the
connection and the thread for a different drbd device pausing the resync
for the device being connected.

Here's the full trace on the side that sends the ReportState (I've left
out a lot of stuff here, but it shows drbd0 starting to resync and
causing drbd1 and drbd2 to set aftr_isp:

May 13 04:22:31 jerry kernel: drbd2: receiver (re)started
May 13 04:22:31 jerry kernel: drbd2: conn( Unconnected -> WFConnection )

...
May 13 04:25:07 jerry kernel: drbd2: conn( WFConnection ->
WFReportParams ) 
...
May 13 04:25:08 jerry kernel: drbd0: Writing meta data super block now.
May 13 04:25:08 jerry kernel: drbd0:  uuid[History_start] now
9349CA8853B42403
May 13 04:25:08 jerry kernel: drbd0:  uuid[Bitmap] now BD4768B8E7A944E2
May 13 04:25:09 jerry kernel: drbd0: conn( WFBitMapS -> SyncSource )
pdsk( UpToDate -> Inconsistent ) 
May 13 04:25:09 jerry kernel: drbd1: aftr_isp( 0 -> 1 )
May 13 04:25:09 jerry kernel: drbd2: aftr_isp( 0 -> 1 ) 
May 13 04:25:12 jerry kernel: drbd2: sock_sendmsg returned -104
May 13 04:25:12 jerry kernel: drbd2: conn( WFReportParams -> BrokenPipe
) 

and on the other side:

May 13 04:25:09 ben kernel: drbd2: receiver (re)started
May 13 04:25:09 ben kernel: drbd2: conn( Unconnected -> WFConnection ) 
May 13 04:25:11 ben kernel: drbd2: conn( WFConnection -> WFReportParams
) 
May 13 04:25:12 ben kernel: drbd2: aftr_isp( 0 -> 1 ) 
May 13 04:25:12 ben kernel: drbd2: expected HandShake packet, received:
ReportState (0x000e)
May 13 04:25:12 ben kernel: drbd2: Discarding network configuration.
May 13 04:25:12 ben kernel: drbd2: conn( WFReportParams -> Disconnecting
) 
May 13 04:25:12 ben kernel: drbd2: tl_clear()
May 13 04:25:12 ben kernel: drbd2: Connection closed

Here's what I think the sequence is:

1. We start connecting drbd2 - we get to the point of having the data
connection
   up but have not done the handshake yet - state is now WFReportParams.
2. DRBD0 thread decides it needs to pause the sync of drbd2 - calls 
   _drbd_pause_after - this sets the aftr_isp flag amd THIS causes the
ReportState
   to be sent
3. Now we come back to the original thread and start the handshake which
fails 
   because of the ReportState message that was sent that is sitting in
the pipe.

I think the fix is to check that (ns.conn is >= Connected) in
after_state_ch before sending the current state. I haven't tested this
yet, but I've attached a proposed patch that does this for the two cases
I can see where this might be an issue...

Simon

-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd-2311.patch
Type: application/octet-stream
Size: 790 bytes
Desc: drbd-2311.patch
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20070518/12fa113c/drbd-2311.obj


More information about the drbd-dev mailing list