[DRBD-user] 0.6.10, sync should be in background?

Mon Apr 11 18:54:52 CEST 2005

On Mon, 11 Apr 2005 11:23:46 -0500
Todd Denniston <Todd.Denniston at ssa.crane.navy.mil> threw this fish to the penguins:

> george young wrote:
> > 
> > What happened:  One of a pair of servers (pig-app) hung (after several
> > months uptime), and had to be quickly rebooted.
> > 
> > What went wrong:  On rebooting, pig-app insisted on waiting the 1.7 hours
> > for the "db" partition to sync, even though pig-app does not mount "db".
> > 
> > The configuration: two servers, pig-app and pig-db.  Normally pig-app
> > mounts /home through drbd and pig-db mounts /db through drbd.  The two
> > file systems are mirrored on the other server, so if one dies, the other
> > can take over services.  The problem is that when pig-app rebooted, it
> > should (I think) have come up fully and synced it's copy of /db in the
> > background, not held up the boot process (and kept my users waiting!).
> > 
> > Is my configuration wrong?  
> <SNIP>
> 
> Why were your users waiting?

The boot process hung executing "/etc/rc.d/drbd start" until the sync
finished.  Networks, logins, file systems, etc. were not available until
the sync completed and the boot could finish.

> pig-db could have (should have?) taken over pig-app's work (via heartbeat
> configuration) until pig-app was fully ready to come back on line.

Failover did not happen due to an as yet un-diagnosed problem in pig-db.
I don't think there's any way this could have affected the wait/nowait
behavior on pig-app.

> At worst your users should have seen the system working slowly (so they wait
> a few seconds) not a full work stoppage. Granted I have mine only setup as
> CVS and NFS servers but when there is a fault on one (that actually causes
> drbd to panic the kernel, as I instructed it to do) generally people don't
> even notice, there is a 30-50 second burble of no activity and then the
> systems continue to function[1].
> 
> 
> [1] with two exceptions:
> 	1: any cvs commands in operation a the time have to be restarted, no big
> deal.
> 	2: we have a Red Hat 6.2 machine which does NOT like to be in runlevel 3
> (or higher) while a fall over is happening, on that box I have to issue
> `telinit 2` wait for fall-over to complete and then issue `telinit 3`.

-- 
"Are the gods not just?"  "Oh no, child.
What would become of us if they were?" (CSL)