Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: > > / 2004-04-12 16:51:25 -0500 > \ Todd Denniston: > > can drbd be made to detect that it has failed to write to the underlying > > device in a 'long time'? > > I am experiencing a problem where the external raid box I have {Promise > > RM8000} stops responding on the scsi bus and the card {adaptec} is unable to > > reset the Promise box. > > I was wondering if in this situation where drbd has been unable to actually > > get ANY data synced to the disk on the secondary node (and because the > > secondary node can't sync any data to disk in proto C, the primary is stuck > > too) for about 10 minutes, drbd could be made to consider this a lower level > > failure and do the drbd-panic in 0.6.10 (or other options I believe are > > available in 0.7.x)? > > did you look at DRBDs ko-count option? > > in your scenario, you should have plenty of 'ko count down' messages in > the syslog. just set it to some value > 0, and when it hits 0, > Primary goes into StandAlone, because it figures that its peer has > severe IO problems, and is unlikely to recover soonish. > when I `grep -i -2 "count" /var/log/messages` on both machines, I do not see anything relating to drbd. > once "operator" sorted out the problems, do a "drbd reconnect" on the > StandAlone Primary... > That's what I expected. BTW I am running drbd version 0.6.10 if that might make any diff. all of my drbd device net sections contain (and did at the time of the lockup too): sync-nice = -1 sync-min = 1M sync-max = 20M # maximal average syncer bandwidth tl-size = 5000 # transfer log size, ensures strict write ordering timeout = 60 # unit: 0.1 seconds connect-int = 10 # unit: seconds ping-int = 10 # unit: seconds ko-count = 10 # if some block send times out this many times, sync-group = 0 #Note this changes with each drbd device so they don't thrash the heads Which I thought meant that in ~60 seconds[1] I would get a fallover. What I was suspecting is the 2 drbd's can still talk to one another, and the failing node's drbd is blocking (but not failing) on the write to the SCSI layer, because the scsi layer is in a loop retrying the reset on the Promise box ''hard drive''. please clue-by-four me if I am still missing the point here. :) [1]the way I processed the information: timeout * ko-count = 6.0 seconds * 10 = 60 seconds -- Todd Denniston Crane Division, Naval Surface Warfare Center (NSWC Crane) Harnessing the Power of Technology for the Warfighter