Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, 2004-09-19 at 20:03, Lars Ellenberg wrote: > / 2004-09-18 11:42:24 -0400 > \ Tony Willoughby: > > On Wed, 2004-09-15 at 11:57, Todd Denniston wrote: > > > Tony Willoughby wrote: > > > > > > > > Greetings, > > > > > > > > We've had an incident that I am trying to understand. > > > > > > > > Configuration: > > > > Two IBM E-Server x330's running Heartbeat/DRBD (0.6.4). > > > > Redhat 7.3 > > > > Protocol C > > > > Crossover Ethernet > > > > > > > > (I know that 0.6.4 is old, but we have a rather staggered release > > > > cycle and our customers tend to upgrade infrequently.) > > > > > > > > At some point the secondary machine started reporting SCSI errors (the > > > > disk eventually failed). It is not known how long the system was > > > > having these errors. > > > > > > > > The primary machine started to become unresponsive. > > > > > > > > Here is the odd thing: Any command that accessed the filesystem above > > > > DRBD (e.g. "ls /the/mirrored/partition") would hang. Once the > > > > secondary was shutdown the commands that were hung suddenly > > > > completed. > > > > > > > > I'm not necessarily looking for a fix (although if I were told this > > > > was fixed in a latter release you'd make my day :^), I'm trying to > > > > understand why this would happen. > > > > > > > > Anyone have any ideas? > > > Note: I am a user not a writer of drbd, and I have some Promise raid boxes > > > that put me in the above situation ALL too often. > > > > > > 0.6.10 behaves the same way. > > > Proto C requires that before the primary returns "data written", both host's > > > subsystems have to return "data written". IIRC ls (and many other commands) > > > at a minimum may end up updating things like access time on some > > > file/directory entries, that's a write that requires a "data written" on both > > > systems, so you get to wait until Proto C is satisfied. > > > > A failing secondary bringing down a primary kind of defeats the whole > > purpose of redundancy. :^) > > > > Any developers care to comment on this? Would protocol B be a better > > choice with respect to increasing the availability of the cluster? > > Would the mount flag "sync" be required with protocol B? > > > > See this thread for my experience of the sync flag and the reason that I > > switched to protocol C in the first place: > > > > http://sourceforge.net/mailarchive/message.php?msg_id=5668764 > > > > Would moving to the 0.7 code base make things better? > > > > I'm very concerned about this issue. My customer would have had a more > > available system with just one node. > > no you don't want to use anything but protocol C if you care about > transactions (and even a mere journalling file system does...) > > for an HA system you also need monitoring. you monitor the box, > you see it has problems, you take it down (out of the cluster at least). > and if you had configured it for panic on lower level io error, it > should have taken down itself... Here is my configuration, is "do-panic" what you are refereeing to? I have that enabled. resource drbd0 { protocol=C fsckcmd=fsck -p -y inittimeout=180 disk { do-panic disk-size=2048256 } net { sync-rate=5000 tl-size=5000 timeout=60 connect-int=10 ping-int=10 } on basfbpm-1 { device=/dev/nb0 disk=/dev/sda5 address=192.0.2.2 port=7788 } on basfbpm-2 { device=/dev/nb0 disk=/dev/sda5 address=192.0.2.1 port=7788 } } > > since 0.6.10 or .12, we have the ko-count. > yes we have it in 0.7, too. Excellent. I will dig into ko-count. Thanks for the tip. > > what it means is: if we cannot get any data transfered to our peer, > but it still answeres to "drbd ping packets", we normally would retry > (actually, tcp will do the retry for us), or continue to wait for ACK > packets. but we start the ko count down. once this counter hits zero, we > consider the peer dead even though it is still partially responsive, and > we do not try to connect there again until explicitly told to do so. Any tips on how to tune the ko-count? Any tips on how to simulate a failing disk in the lab? > > however, if your secondary just becomes very slooow and not fail > completely, this mechanism will not work and indeed slow down the > primary, too. sorry about that. > btw, linbit takes sponsors and support contracts. > if you don't think you need our support, > think of it as you supporting us instead! We have! :^) My company had a service contract with Linbit for several years. Thanks for your input Lars. > > and yes, 0.7. improves here too, because it has the concept of > "NegAck"s and "detaching" the lower level device on io errors, > continuing in "diskless" mode. which makes it possible for your > monitoring system to do a _graceful_ failover once it recognizes that > the primary went into diskless state because of underlying io errors. > > we are better than you think... > but we have to improve our documentation obviously. > > Lars Ellenberg -- Tony Willoughby Bigband Networks mailto:tony.willoughby at bigbandnet.com