[DRBD-user] Unidentified strange DRBD behaviors...

Casey Allen Shobe lists at seattleserver.com
Tue Jun 7 05:19:24 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


We had a rather creepy experience with DRBD recently.  Here's the story:

We have two machines set up as redundant mail servers using 4 DRBD partitions.

Both servers are identical:
* Pentium 4 3.0GHz (non-Xeon)
* 1GB ECC RAM
* 73GB Ultra320 10,000rpm Fujitsu SCA disk for O/S (non-DRBD) and PostgreSQL 
transaction log (DRBD partition).
* 73GB with 3 DRBD partitions, one each for PostgreSQL data, Qmail (mail queue 
and such), and VPopmail (filesystem destination of all mail).

Last night, we noticed the system load at the primary to be above 4, though 
the server appeared very responsive.  After shutting down all services, the 
load dropped to an even 4.0 with nothing running.  Something was wrong with 
loadav, so we rebooted the machine.  It hung on reboot due to a request for 
keyboard entry, and the datacenter is far away and charges $150 to reboot a 
machine at night.

Nevermind that, DRBD would save us!  The former primary was not pingable and 
hung after all services were shut down.  We set the secondary up as primary 
(we aren't using heartbeat yet), and all seemed well.  At 6:30am (about 7 
hours later), customers started complaining about slow mail delivery and the 
load on the server was above 7 (and actually meaning it, 100% disk I/O) and 
>600 emails stuck in the qmail queue (incoming mail is sent to dspam for 
processing, which stores it's data in PostgreSQL).  We saw an obscure (and 
only one) SCSI card-related error in dmesg, and a whole bunch of 
grsec-reported Signal 11's from dspam.  I tried catting several mail files 
into dspam manually - some would work fine, some would cause dspam to 
segfault.  We presumed hardware failure.

At 7:30, the former server was rebooted.  However, the DRBD partitions were 
not syncing.  The nonactive machine showed Secondary/Unknown, and the active 
server showed Primary/Unknown.  netstat -nlptu showed the DRBD ports 
listening on the inactive, and not on the active.  So, at a loss, we rebooted 
the active server.  When it came back up, the DRBD shares on both machines 
showed up as Secondary/Secondary, and were syncing.

After the sync finished, we set up the known-good server as the primary and 
brought up services.

Everything looked good, however when we logged into any mail account, the last 
7 hours worth of messages were gone (everything since the previous server 
changeover.  Several customers who use IMAP and/or webmail noticed missing 
mail and called.  We apologized and said we'd try to recover their mail from 
a backup.  The last backup had been taken at 1am, so did not contain many of 
the missed mails.

We went on about our days, and then some 4 hours later, one of the customers 
called up and said he was his missing mails appearing one by one, and thanked 
us for restoring his mail.  We didn't know wtf was going on, but when I 
looked on the server, all of the mail messages that were gone, were back!

This is a really good recovery, but creepy as all hell.  Can somebody take a 
stab at explaining this please?

Cheers,
-- 
Casey Allen Shobe | http://casey.shobe.info
cshobe at seattleserver.com | cell 425-443-4653
AIM & Yahoo:  SomeLinuxGuy | ICQ:  1494523
SeattleServer.com, Inc. | http://www.seattleserver.com



More information about the drbd-user mailing list