[DRBD-user] can drbd be made to detect that it has failed to write to the underlying device in a 'long time'?

Tue Apr 13 23:48:09 CEST 2004

Todd Denniston wrote:
> 
> Lars Ellenberg wrote:
> >
<SNIP>
> 
> Unfortunately the lower level has not yet (at that time) declared an IO
> failure, might be a buglet there (adaptec SCSI layer). :{
>
<SNIP>
> > So a "very slow" but "not slow enough" write throughput on the Secondary
> > will throttle the Primary to the same slowness.
> >
> > On the Primary, if it still is responsive, try to watch the "ns:".
> > If it still increases, this is what happens.
> >
> good point, I'll look at it when it latches up today or tomorrow. (seems to
> happen ~14?? local time for 2 working days in a row).
> 

<SNIP>
Darn, I was not doing anything using the disk at the time it had a problem
today, and missed the slow spot.

on a positive note, about 22 seconds into the secondaries Card dump (I think
this is the time that the lockup starts) I got the following on the active
primary:
Apr 13 15:36:23 foo kernel: drbd1: sock_sendmsg time expired on sock
Apr 13 15:36:23 foo kernel: drbd1: no data sent since 10 ping intervals, peer
seems knocked out: going to StandAlone.
Apr 13 15:36:23 foo kernel: drbd1: Connection lost.

so was the 'ko count down' messages something that you might have added or
fixed in 0.6.1[12]?

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter