Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, 11 Apr 2005 11:23:46 -0500 Todd Denniston <Todd.Denniston at ssa.crane.navy.mil> threw this fish to the penguins: > george young wrote: > > > > What happened: One of a pair of servers (pig-app) hung (after several > > months uptime), and had to be quickly rebooted. > > > > What went wrong: On rebooting, pig-app insisted on waiting the 1.7 hours > > for the "db" partition to sync, even though pig-app does not mount "db". > > > > The configuration: two servers, pig-app and pig-db. Normally pig-app > > mounts /home through drbd and pig-db mounts /db through drbd. The two > > file systems are mirrored on the other server, so if one dies, the other > > can take over services. The problem is that when pig-app rebooted, it > > should (I think) have come up fully and synced it's copy of /db in the > > background, not held up the boot process (and kept my users waiting!). > > > > Is my configuration wrong? > <SNIP> > > Why were your users waiting? The boot process hung executing "/etc/rc.d/drbd start" until the sync finished. Networks, logins, file systems, etc. were not available until the sync completed and the boot could finish. > pig-db could have (should have?) taken over pig-app's work (via heartbeat > configuration) until pig-app was fully ready to come back on line. Failover did not happen due to an as yet un-diagnosed problem in pig-db. I don't think there's any way this could have affected the wait/nowait behavior on pig-app. > At worst your users should have seen the system working slowly (so they wait > a few seconds) not a full work stoppage. Granted I have mine only setup as > CVS and NFS servers but when there is a fault on one (that actually causes > drbd to panic the kernel, as I instructed it to do) generally people don't > even notice, there is a 30-50 second burble of no activity and then the > systems continue to function[1]. > > > [1] with two exceptions: > 1: any cvs commands in operation a the time have to be restarted, no big > deal. > 2: we have a Red Hat 6.2 machine which does NOT like to be in runlevel 3 > (or higher) while a fall over is happening, on that box I have to issue > `telinit 2` wait for fall-over to complete and then issue `telinit 3`. -- "Are the gods not just?" "Oh no, child. What would become of us if they were?" (CSL)