[DRBD-user] Secondary SCSI Errors causing Primary Unresponsiveness

Sat Sep 18 17:42:24 CEST 2004

On Wed, 2004-09-15 at 11:57, Todd Denniston wrote:
> Tony Willoughby wrote:
> > 
> > Greetings,
> > 
> > We've had an incident that I am trying to understand.
> > 
> > Configuration:
> >   Two IBM E-Server x330's running Heartbeat/DRBD (0.6.4).
> >   Redhat 7.3
> >   Protocol C
> >   Crossover Ethernet
> > 
> > (I know that 0.6.4 is old, but we have a rather staggered release
> > cycle and our customers tend to upgrade infrequently.)
> > 
> > At some point the secondary machine started reporting SCSI errors (the
> > disk eventually failed).  It is not known how long the system was
> > having these errors.
> > 
> > The primary machine started to become unresponsive.
> > 
> > Here is the odd thing:  Any command that accessed the filesystem above
> > DRBD  (e.g. "ls /the/mirrored/partition") would hang.  Once the
> > secondary was shutdown the commands that were hung suddenly
> > completed.
> > 
> > I'm not necessarily looking for a fix (although if I were told this
> > was fixed in a latter release you'd make my day :^), I'm trying to
> > understand why this would happen.
> > 
> > Anyone have any ideas?
> Note: I am a user not a writer of drbd, and I have some Promise raid boxes
> that put me in the above situation ALL too often.
> 
> 0.6.10 behaves the same way.
> Proto C requires that before the primary returns "data written", both host's
> subsystems have to return "data written".  IIRC ls (and many other commands)
> at a minimum may end up updating things like access time on some
> file/directory entries, that's a write that requires a "data written" on both
> systems, so you get to wait until Proto C is satisfied.

A failing secondary bringing down a primary kind of defeats the whole
purpose of redundancy.  :^)

Any developers care to comment on this?  Would protocol B be a better
choice with respect to increasing the availability of the cluster? 
Would the mount flag "sync" be required with protocol B?  

See this thread for my experience of the sync flag and the reason that I
switched to protocol C in the first place:

http://sourceforge.net/mailarchive/message.php?msg_id=5668764

Would moving to the 0.7 code base make things better?

I'm very concerned about this issue.  My customer would have had a more
available system with just one node.

-- 
Tony Willoughby
Bigband Networks
mailto:tony.willoughby at bigbandnet.com