Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Jun 24, 2009 at 04:10:43PM +0200, Kris Buytaert wrote: > > > We're trying to setup a dual-primary DRBD environment, with a shared > disk with either OCFS2 or GFS. The environment is a Centos 5.3 with > DRBD82 (but also tried with DRBD83 from testing) . > > Setting up a single primary disk and running bonnie++ on it works. > Setting up a dual-primary disk, only mounting it on one node (ext3) and > running bonnie++ works > > When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both > nodes, basic functionality seems in place but usually less than 5-10 > minutes after I start bonnie++ as a test on one of the nodes , both > nodes power cycle with no errors in the logfiles, just a crash. > > When at the console at the time of crash it looks like a disk IO (you > can type , but actions happen) block happens then a reboot, no panics, > no oops , nothing. ( sysctl panic values set to timeouts etc ) > Setting up a dual-primary disk , with ocfs2 only mounting it on one node > and starting bonnie++ causes only that node to crash. > > On DRBD level I get the following error when that node dissapears > > drbd0: PingAck did not arrive in time. > drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) > pdsk(UpToDate -> DUnknown ) > drbd0: asender terminated > drbd0: Terminating asender thread > > That however is an expected error because of the reboot. > > At first I assumed OCFS2 to be the root of this problem ..so I moved > forward and setup an ISCSI target on a 3rd node, and used that device > with the same OCFS2 setup. There no crashes occured and bonnie++ > flawlessly completed it test run. > > So my attention went back to the combination of DRBD and OCFS > > I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and > the 83 variant from Centos Testing > > At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but > upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour > > > Anyone has an idea on this ? OCFS2 heartbeat to disk takes too long, and it self-fences? increase the OCFS2 timeout there. also, use the "deadline" IO scheduler, or the latencies will kill you! > How can we get more debug info from OCFS2 , apart from heartbeat > tracing which doesn't learn me nothing yet .. in order to potentially > file a valuable bug report. Use a serial console, attach that to some "monitoring" host. (you can useUSB-to-Serial, they are cheap and work), and log on that one. You'll get the last messages from there. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed