[DRBD-user] Unexplained reboots in DRBD82 + OCFS2 setup

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jun 24 16:28:19 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, Jun 24, 2009 at 04:10:43PM +0200, Kris Buytaert wrote:
> 
> 
> We're trying to setup a dual-primary DRBD environment, with a shared
> disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
> DRBD82 (but also tried with DRBD83 from testing) .
> 
> Setting up a single primary disk and running bonnie++ on it works.
> Setting up a dual-primary disk, only mounting it on one node (ext3) and
> running bonnie++  works
> 
> When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both
> nodes, basic functionality seems in place but usually less than 5-10
> minutes after I start bonnie++ as a test on one of the nodes , both
> nodes power cycle  with no errors in the logfiles, just a crash.
> 
> When at the console at the time of crash it looks like a disk IO (you
> can type , but actions happen)  block happens  then a reboot, no panics,
> no oops , nothing. ( sysctl panic values set to timeouts etc )
> Setting up a dual-primary disk , with ocfs2 only mounting it on one node
> and starting bonnie++ causes only that node to crash.
> 
> On DRBD level I get the following error when that node dissapears
> 
> drbd0: PingAck did not arrive in time.
> drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
> pdsk(UpToDate -> DUnknown )
> drbd0: asender terminated
> drbd0: Terminating asender thread
> 
> That however is an expected error because of the reboot.
> 
> At first I assumed OCFS2 to be the root of this problem ..so I moved
> forward and setup an ISCSI target on a 3rd node, and used that device
> with the same OCFS2 setup. There no crashes occured and bonnie++
> flawlessly completed it test run.
> 
> So my attention went  back to the combination of DRBD and OCFS 
> 
> I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2  and
> the 83 variant from Centos Testing
> 
> At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but
> upgrading to  1.4.2-1.el5.i386.rpm didn't change the behaviour
> 
> 
> Anyone has an idea on this ? 

OCFS2 heartbeat to disk takes too long, and it self-fences?
increase the OCFS2 timeout there.

also, use the "deadline" IO scheduler,
or the latencies will kill you!

> How can we get more debug info from OCFS2  , apart from heartbeat
> tracing which doesn't learn me nothing yet ..  in order to potentially
> file a valuable bug report.

Use a serial console, attach that to some "monitoring" host.
(you can useUSB-to-Serial, they are cheap and work), and log
on that one. You'll get the last messages from there.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list