Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote:
>
> / 2004-04-12 16:51:25 -0500
> \ Todd Denniston:
> > can drbd be made to detect that it has failed to write to the underlying
> > device in a 'long time'?
> > I am experiencing a problem where the external raid box I have {Promise
> > RM8000} stops responding on the scsi bus and the card {adaptec} is unable to
> > reset the Promise box.
> > I was wondering if in this situation where drbd has been unable to actually
> > get ANY data synced to the disk on the secondary node (and because the
> > secondary node can't sync any data to disk in proto C, the primary is stuck
> > too) for about 10 minutes, drbd could be made to consider this a lower level
> > failure and do the drbd-panic in 0.6.10 (or other options I believe are
> > available in 0.7.x)?
>
> did you look at DRBDs ko-count option?
>
> in your scenario, you should have plenty of 'ko count down' messages in
> the syslog. just set it to some value > 0, and when it hits 0,
> Primary goes into StandAlone, because it figures that its peer has
> severe IO problems, and is unlikely to recover soonish.
>
when I `grep -i -2 "count" /var/log/messages` on both machines, I do not see
anything relating to drbd.
> once "operator" sorted out the problems, do a "drbd reconnect" on the
> StandAlone Primary...
>
That's what I expected.
BTW I am running drbd version 0.6.10 if that might make any diff.
all of my drbd device net sections contain (and did at the time of the lockup
too):
sync-nice = -1
sync-min = 1M
sync-max = 20M # maximal average syncer bandwidth
tl-size = 5000 # transfer log size, ensures strict write ordering
timeout = 60 # unit: 0.1 seconds
connect-int = 10 # unit: seconds
ping-int = 10 # unit: seconds
ko-count = 10 # if some block send times out this many times,
sync-group = 0 #Note this changes with each drbd device so they don't
thrash the heads
Which I thought meant that in ~60 seconds[1] I would get a fallover.
What I was suspecting is the 2 drbd's can still talk to one another, and the
failing node's drbd is blocking (but not failing) on the write to the SCSI
layer, because the scsi layer is in a loop retrying the reset on the Promise
box ''hard drive''.
please clue-by-four me if I am still missing the point here. :)
[1]the way I processed the information:
timeout * ko-count = 6.0 seconds * 10 = 60 seconds
--
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter