[DRBD-user] Secondary SCSI Errors causing Primary Unresponsiveness

Lars Ellenberg Lars.Ellenberg at linbit.com
Mon Sep 20 02:03:59 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2004-09-18 11:42:24 -0400
\ Tony Willoughby:
> On Wed, 2004-09-15 at 11:57, Todd Denniston wrote:
> > Tony Willoughby wrote:
> > > 
> > > Greetings,
> > > 
> > > We've had an incident that I am trying to understand.
> > > 
> > > Configuration:
> > >   Two IBM E-Server x330's running Heartbeat/DRBD (0.6.4).
> > >   Redhat 7.3
> > >   Protocol C
> > >   Crossover Ethernet
> > > 
> > > (I know that 0.6.4 is old, but we have a rather staggered release
> > > cycle and our customers tend to upgrade infrequently.)
> > > 
> > > At some point the secondary machine started reporting SCSI errors (the
> > > disk eventually failed).  It is not known how long the system was
> > > having these errors.
> > > 
> > > The primary machine started to become unresponsive.
> > > 
> > > Here is the odd thing:  Any command that accessed the filesystem above
> > > DRBD  (e.g. "ls /the/mirrored/partition") would hang.  Once the
> > > secondary was shutdown the commands that were hung suddenly
> > > completed.
> > > 
> > > I'm not necessarily looking for a fix (although if I were told this
> > > was fixed in a latter release you'd make my day :^), I'm trying to
> > > understand why this would happen.
> > > 
> > > Anyone have any ideas?
> > Note: I am a user not a writer of drbd, and I have some Promise raid boxes
> > that put me in the above situation ALL too often.
> > 
> > 0.6.10 behaves the same way.
> > Proto C requires that before the primary returns "data written", both host's
> > subsystems have to return "data written".  IIRC ls (and many other commands)
> > at a minimum may end up updating things like access time on some
> > file/directory entries, that's a write that requires a "data written" on both
> > systems, so you get to wait until Proto C is satisfied.
> 
> A failing secondary bringing down a primary kind of defeats the whole
> purpose of redundancy.  :^)
> 
> Any developers care to comment on this?  Would protocol B be a better
> choice with respect to increasing the availability of the cluster? 
> Would the mount flag "sync" be required with protocol B?  
> 
> See this thread for my experience of the sync flag and the reason that I
> switched to protocol C in the first place:
> 
> http://sourceforge.net/mailarchive/message.php?msg_id=5668764
> 
> Would moving to the 0.7 code base make things better?
> 
> I'm very concerned about this issue.  My customer would have had a more
> available system with just one node.

no you don't want to use anything but protocol C if you care about
transactions (and even a mere journalling file system does...)

for an HA system you also need monitoring.  you monitor the box,
you see it has problems, you take it down (out of the cluster at least).
and if you had configured it for panic on lower level io error, it
should have taken down itself...

since 0.6.10 or .12, we have the ko-count.
yes we have it in 0.7, too.

what it means is: if we cannot get any data transfered to our peer,
but it still answeres to "drbd ping packets", we normally would retry
(actually, tcp will do the retry for us), or continue to wait for ACK
packets. but we start the ko count down. once this counter hits zero, we
consider the peer dead even though it is still partially responsive, and
we do not try to connect there again until explicitly told to do so.

however, if your secondary just becomes very slooow and not fail
completely, this mechanism will not work and indeed slow down the
primary, too. sorry about that.
btw, linbit takes sponsors and support contracts.
if you don't think you need our support,
think of it as you supporting us instead!

and yes, 0.7. improves here too, because it has the concept of
"NegAck"s and "detaching" the lower level device on io errors,
continuing in "diskless" mode. which makes it possible for your
monitoring system to do a _graceful_ failover once it recognizes that
the primary went into diskless state because of underlying io errors.

we are better than you think...
but we have to improve our documentation obviously.

	Lars Ellenberg

-- 
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list