[DRBD-user] Secondary SCSI Errors causing Primary Unresponsiveness

Wed Sep 15 17:57:37 CEST 2004

Tony Willoughby wrote:
> 
> Greetings,
> 
> We've had an incident that I am trying to understand.
> 
> Configuration:
>   Two IBM E-Server x330's running Heartbeat/DRBD (0.6.4).
>   Redhat 7.3
>   Protocol C
>   Crossover Ethernet
> 
> (I know that 0.6.4 is old, but we have a rather staggered release
> cycle and our customers tend to upgrade infrequently.)
> 
> At some point the secondary machine started reporting SCSI errors (the
> disk eventually failed).  It is not known how long the system was
> having these errors.
> 
> The primary machine started to become unresponsive.
> 
> Here is the odd thing:  Any command that accessed the filesystem above
> DRBD  (e.g. "ls /the/mirrored/partition") would hang.  Once the
> secondary was shutdown the commands that were hung suddenly
> completed.
> 
> I'm not necessarily looking for a fix (although if I were told this
> was fixed in a latter release you'd make my day :^), I'm trying to
> understand why this would happen.
> 
> Anyone have any ideas?
Note: I am a user not a writer of drbd, and I have some Promise raid boxes
that put me in the above situation ALL too often.

0.6.10 behaves the same way.
Proto C requires that before the primary returns "data written", both host's
subsystems have to return "data written".  IIRC ls (and many other commands)
at a minimum may end up updating things like access time on some
file/directory entries, that's a write that requires a "data written" on both
systems, so you get to wait until Proto C is satisfied.
-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter