[DRBD-user] Secondary SCSI Errors causing Primary Unresponsiveness

Mon Sep 20 16:13:51 CEST 2004

On Sun, 2004-09-19 at 20:03, Lars Ellenberg wrote:
> / 2004-09-18 11:42:24 -0400
> \ Tony Willoughby:
> > On Wed, 2004-09-15 at 11:57, Todd Denniston wrote:
> > > Tony Willoughby wrote:
> > > > 
> > > > Greetings,
> > > > 
> > > > We've had an incident that I am trying to understand.
> > > > 
> > > > Configuration:
> > > >   Two IBM E-Server x330's running Heartbeat/DRBD (0.6.4).
> > > >   Redhat 7.3
> > > >   Protocol C
> > > >   Crossover Ethernet
> > > > 
> > > > (I know that 0.6.4 is old, but we have a rather staggered release
> > > > cycle and our customers tend to upgrade infrequently.)
> > > > 
> > > > At some point the secondary machine started reporting SCSI errors (the
> > > > disk eventually failed).  It is not known how long the system was
> > > > having these errors.
> > > > 
> > > > The primary machine started to become unresponsive.
> > > > 
> > > > Here is the odd thing:  Any command that accessed the filesystem above
> > > > DRBD  (e.g. "ls /the/mirrored/partition") would hang.  Once the
> > > > secondary was shutdown the commands that were hung suddenly
> > > > completed.
> > > > 
> > > > I'm not necessarily looking for a fix (although if I were told this
> > > > was fixed in a latter release you'd make my day :^), I'm trying to
> > > > understand why this would happen.
> > > > 
> > > > Anyone have any ideas?
> > > Note: I am a user not a writer of drbd, and I have some Promise raid boxes
> > > that put me in the above situation ALL too often.
> > > 
> > > 0.6.10 behaves the same way.
> > > Proto C requires that before the primary returns "data written", both host's
> > > subsystems have to return "data written".  IIRC ls (and many other commands)
> > > at a minimum may end up updating things like access time on some
> > > file/directory entries, that's a write that requires a "data written" on both
> > > systems, so you get to wait until Proto C is satisfied.
> > 
> > A failing secondary bringing down a primary kind of defeats the whole
> > purpose of redundancy.  :^)
> > 
> > Any developers care to comment on this?  Would protocol B be a better
> > choice with respect to increasing the availability of the cluster? 
> > Would the mount flag "sync" be required with protocol B?  
> > 
> > See this thread for my experience of the sync flag and the reason that I
> > switched to protocol C in the first place:
> > 
> > http://sourceforge.net/mailarchive/message.php?msg_id=5668764
> > 
> > Would moving to the 0.7 code base make things better?
> > 
> > I'm very concerned about this issue.  My customer would have had a more
> > available system with just one node.
> 
> no you don't want to use anything but protocol C if you care about
> transactions (and even a mere journalling file system does...)
> 
> for an HA system you also need monitoring.  you monitor the box,
> you see it has problems, you take it down (out of the cluster at least).
> and if you had configured it for panic on lower level io error, it
> should have taken down itself...

Here is my configuration, is "do-panic" what you are refereeing to?  I
have that enabled.

resource drbd0 {
  protocol=C
  fsckcmd=fsck -p -y
  inittimeout=180
  disk {
    do-panic
    disk-size=2048256
  }

  net {
    sync-rate=5000
    tl-size=5000
    timeout=60
    connect-int=10
    ping-int=10
  }

  on basfbpm-1 {
    device=/dev/nb0
    disk=/dev/sda5
    address=192.0.2.2
    port=7788
  }

  on basfbpm-2 {
    device=/dev/nb0
    disk=/dev/sda5
    address=192.0.2.1
    port=7788
  }
}

> 
> since 0.6.10 or .12, we have the ko-count.
> yes we have it in 0.7, too.

Excellent.  I will dig into ko-count.  Thanks for the tip.

> 
> what it means is: if we cannot get any data transfered to our peer,
> but it still answeres to "drbd ping packets", we normally would retry
> (actually, tcp will do the retry for us), or continue to wait for ACK
> packets. but we start the ko count down. once this counter hits zero, we
> consider the peer dead even though it is still partially responsive, and
> we do not try to connect there again until explicitly told to do so.

Any tips on how to tune the ko-count?  

Any tips on how to simulate a failing disk in the lab?

> 
> however, if your secondary just becomes very slooow and not fail
> completely, this mechanism will not work and indeed slow down the
> primary, too. sorry about that.
> btw, linbit takes sponsors and support contracts.
> if you don't think you need our support,
> think of it as you supporting us instead!

We have!  :^)

My company had a service contract with Linbit for several years.

Thanks for your input Lars.

> 
> and yes, 0.7. improves here too, because it has the concept of
> "NegAck"s and "detaching" the lower level device on io errors,
> continuing in "diskless" mode. which makes it possible for your
> monitoring system to do a _graceful_ failover once it recognizes that
> the primary went into diskless state because of underlying io errors.
> 
> we are better than you think...
> but we have to improve our documentation obviously.
> 
> 	Lars Ellenberg
-- 
Tony Willoughby
Bigband Networks
mailto:tony.willoughby at bigbandnet.com