Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
We had a rather creepy experience with DRBD recently. Here's the story: We have two machines set up as redundant mail servers using 4 DRBD partitions. Both servers are identical: * Pentium 4 3.0GHz (non-Xeon) * 1GB ECC RAM * 73GB Ultra320 10,000rpm Fujitsu SCA disk for O/S (non-DRBD) and PostgreSQL transaction log (DRBD partition). * 73GB with 3 DRBD partitions, one each for PostgreSQL data, Qmail (mail queue and such), and VPopmail (filesystem destination of all mail). Last night, we noticed the system load at the primary to be above 4, though the server appeared very responsive. After shutting down all services, the load dropped to an even 4.0 with nothing running. Something was wrong with loadav, so we rebooted the machine. It hung on reboot due to a request for keyboard entry, and the datacenter is far away and charges $150 to reboot a machine at night. Nevermind that, DRBD would save us! The former primary was not pingable and hung after all services were shut down. We set the secondary up as primary (we aren't using heartbeat yet), and all seemed well. At 6:30am (about 7 hours later), customers started complaining about slow mail delivery and the load on the server was above 7 (and actually meaning it, 100% disk I/O) and >600 emails stuck in the qmail queue (incoming mail is sent to dspam for processing, which stores it's data in PostgreSQL). We saw an obscure (and only one) SCSI card-related error in dmesg, and a whole bunch of grsec-reported Signal 11's from dspam. I tried catting several mail files into dspam manually - some would work fine, some would cause dspam to segfault. We presumed hardware failure. At 7:30, the former server was rebooted. However, the DRBD partitions were not syncing. The nonactive machine showed Secondary/Unknown, and the active server showed Primary/Unknown. netstat -nlptu showed the DRBD ports listening on the inactive, and not on the active. So, at a loss, we rebooted the active server. When it came back up, the DRBD shares on both machines showed up as Secondary/Secondary, and were syncing. After the sync finished, we set up the known-good server as the primary and brought up services. Everything looked good, however when we logged into any mail account, the last 7 hours worth of messages were gone (everything since the previous server changeover. Several customers who use IMAP and/or webmail noticed missing mail and called. We apologized and said we'd try to recover their mail from a backup. The last backup had been taken at 1am, so did not contain many of the missed mails. We went on about our days, and then some 4 hours later, one of the customers called up and said he was his missing mails appearing one by one, and thanked us for restoring his mail. We didn't know wtf was going on, but when I looked on the server, all of the mail messages that were gone, were back! This is a really good recovery, but creepy as all hell. Can somebody take a stab at explaining this please? Cheers, -- Casey Allen Shobe | http://casey.shobe.info cshobe at seattleserver.com | cell 425-443-4653 AIM & Yahoo: SomeLinuxGuy | ICQ: 1494523 SeattleServer.com, Inc. | http://www.seattleserver.com