[Drbd-dev] DRBD8: failed to complete sync due to receiving bitmap in unexpected cstate

Graham, Simon Simon.Graham at stratus.com
Tue Dec 19 20:36:40 CET 2006


> 
> My theory was that there is a timing window relative to moving from
the
> PauseSync{T|S} state such that one side can get there first and
restart
> syncing before the other side.
> 

Not sure if you've had any thoughts on this, but I have a theory about
this that was sparked by the problem I found today where we can still be
in the PausedSyncX state when sync finishes...

If you recall, the problem was what the sync source side would get into
WFBitMapS and never exit and the target side would output:

unexpected cstate (PausedSyncT) in receive_bitmap

Here's my theory in a time sequence...

          Source                 Target
             |                      |
         <PausedSyncS>          <PausedSyncT>
             |                      |
       resync completes             |
             |                      |
          <Connected>               |
             |                      |
       high priority group          |
           finishes sync            |
         <aftr_isp->0>              |
             |                      |
     drbd_send_state                |
             |       ReportState    |
             +--------------------->|
             |     UUIDs            | ***Note UUIDs haven't been updated
here yet, so still look
             +<---------------------|    out of date
             |     ReportState      |
             +<---------------------|
             |                      |
        drbd_sync_handshake         |
          hg>1                      |
      <WFBitMapS>                   |
             |    Bitmap            |
             +--------------------->| *** get unexpected cstate message
plus never return bitmap
             |                      |
             |               Now we notice resync complete
             |                 <Connected>

Obviously this requires a lot of things to happen all together and
somewhat out of sequence, but I think it's feasible. As I see it, there
are actually several problems here:

1. When aftr_isp went to 0 we still initiated the resumption of resync
even though we are in Connected state
2. We ended up deciding to restart the resync because we got stale UUID
info from the target
3. The target side did not reply to the Bitmap leaving the source stuck
in WFBitmapS

I somehow don't think that putting a test for PausedSyncX in the
receive_bitmap() is the correct solution here but I'm not sure what
would be better... Any ideas?

Simon


More information about the drbd-dev mailing list