[DRBD-user] can drbd be made to detect that it has failed to write to the underlying device in a 'long time'?

Tue Apr 13 14:15:56 CEST 2004

Lars Ellenberg wrote:
> 
> / 2004-04-12 16:51:25 -0500
> \ Todd Denniston:
> > can drbd be made to detect that it has failed to write to the underlying
> > device in a 'long time'?
> > I am experiencing a problem where the external raid box I have {Promise
> > RM8000} stops responding on the scsi bus and the card {adaptec} is unable to
> > reset the Promise box.
> > I was wondering if in this situation where drbd has been unable to actually
> > get ANY data synced to the disk on the secondary node (and because the
> > secondary node can't sync any data to disk in proto C, the primary is stuck
> > too) for about 10 minutes, drbd could be made to consider this a lower level
> > failure and do the drbd-panic in 0.6.10 (or other options I believe are
> > available in 0.7.x)?
> 
> did you look at DRBDs ko-count option?
> 
> in your scenario, you should have plenty of 'ko count down' messages in
> the syslog. just set it to some value > 0, and when it hits 0,
> Primary goes into StandAlone, because it figures that its peer has
> severe IO problems, and is unlikely to recover soonish.
> 

when I `grep -i -2 "count" /var/log/messages` on both machines, I do not see
anything relating to drbd.

> once "operator" sorted out the problems, do a "drbd reconnect" on the
> StandAlone Primary...
> 
That's what I expected.
BTW I am running drbd version 0.6.10 if that might make any diff.

all of my drbd device net sections contain (and did at the time of the lockup
too):
    sync-nice  = -1  
    sync-min    = 1M
    sync-max    = 20M   # maximal average syncer bandwidth
    tl-size     = 5000  # transfer log size, ensures strict write ordering
    timeout     = 60    # unit: 0.1 seconds
    connect-int = 10    # unit: seconds
    ping-int    = 10    # unit: seconds
    ko-count    = 10    # if some block send times out this many times,
    sync-group  = 0 #Note this changes with each drbd device so they don't
thrash the heads

Which I thought meant that in ~60 seconds[1] I would get a fallover.

What I was suspecting is the 2 drbd's can still talk to one another, and the
failing node's drbd is blocking (but not failing) on the write to the SCSI
layer, because the scsi layer is in a loop retrying the reset on the Promise
box ''hard drive''.

please clue-by-four me if I am still missing the point here. :)

[1]the way I processed the information:
 timeout * ko-count = 6.0 seconds * 10 = 60 seconds

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter