[DRBD-user] can drbd be made to detect that it has failed to write to the underlying device in a 'long time'?

Thu Apr 15 23:08:05 CEST 2004

Since you did not really ask a question,
I just comment your "event log" ...

> Today the card reported an error at 12:54:20, and I had some
> data trapped.  seems after the secondary latched, the primary
> thought ti snuck in another 37 blocks...

> Also seems a little weird that the Primary pe: does not match
> the Secondary ua:, am I just reading something into the
> relationship there?

well, no, the relationship you see is correct, only that accessing
a moving variable of atomic_t is difficult on ONE node already, so
doing this asynchronous on TWO nodes ... ok, you got the point :)

arbitrarily normalized that garbage:

   primary    ,-Connected/standAlone
          `* /   ns   nr   dw  dr    pe     ua
> 12:52:55   C       3272 -500                
> 12:53:05   C       3376 -396                
> 12:53:08 * C 69744         0 68985 2        
> 12:53:15   C       4352  580                
> 12:53:18 * C 70324       580 69025          

probably here the secondary already got stuck,
nr and dw do not move anymore, ua stays "high".

ns on primary does still move by 144 blocks,
which probably just went into the "wire",
i.e. filled up tcp buffers.

> 12:53:25   C       6948 3176              96
> 12:53:28 * C 72940      3196 69045 101      
> 12:53:35   C       6948 3176              96

now finally primary can no longer submit any data, and starts to
ko-count the peer, probably at 53:40 or so.
(what is the correct english term for this in boxing sports?)
with 6s ping-timeout and ko-count starting at 10, this makes ...

> 12:53:58 * C 73088      3348 69045 138      

     54:20  IO ERROR
	but you did not tell it to panic yet

> 12:54:38 * C 73088      3348 69045 138      

... 60 seconds; right.
somewhen here ko-count reached zero, and Primary decides:
ok, peer does no longer cooperate, going to do my job *alone*.
and imediately the dw increases again, your load when high for
some seconds... then all pending IO suddenly works again, since it
no longer blocks waiting for the peer.

> 12:54:48 * A 73088      4556 69045          
> 12:54:58 * A 73088      5444 72281          

secondary still did not recognize socket shutdown yet,
maybe it even does not yet know about the io-error?
> 12:56:05   C       6948 3176              96
> 12:56:15   C       6948 3176              96

at least, all works as expected.
... "suprisingly" ... for a change, this one has "even" been
tested before we did the release ...
:P

	Lars Ellenberg