Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2005-06-07 03:19:24 +0000 \ Casey Allen Shobe: > We had a rather creepy experience with DRBD recently. Here's the story: > > We have two machines set up as redundant mail servers using 4 DRBD partitions. > > Both servers are identical: > * Pentium 4 3.0GHz (non-Xeon) > * 1GB ECC RAM > * 73GB Ultra320 10,000rpm Fujitsu SCA disk for O/S (non-DRBD) and PostgreSQL > transaction log (DRBD partition). > * 73GB with 3 DRBD partitions, one each for PostgreSQL data, Qmail (mail queue > and such), and VPopmail (filesystem destination of all mail). > > Last night, we noticed the system load at the primary to be above 4, though > the server appeared very responsive. After shutting down all services, the > load dropped to an even 4.0 with nothing running. Something was wrong with > loadav, so we rebooted the machine. It hung on reboot due to a request for > keyboard entry, and the datacenter is far away and charges $150 to reboot a > machine at night. just in case, definition of load: ( see also http://wwnew.linux-ha.org/DRBD/FAQ ) Load average is defined as average number of processes in the runqueue during a given interval. A process is in the run queue, if it is * not waiting for external events (e.g. select on some fd) * not waiting on its own (not called "wait" explicitly) * not stopped :) Note that all processes waiting for disk io are counted as runable! Therefore, if a lot of processes wait for disk io, the "load average" goes straight up, though the system actually may be almost idle cpu-wise ... E.g. crash your nfs server (or just unplug the cable of your diskless client work station), and start 100 ls /path/to/non-cached/dir/on/nfs/mount-point on a client... you get a "load average" of 100+ for as long as the nfs timeout, which might be weeks ... though the cpu does nothing. > Nevermind that, DRBD would save us! The former primary was not pingable and > hung after all services were shut down. We set the secondary up as primary > (we aren't using heartbeat yet), and all seemed well. At 6:30am (about 7 > hours later), customers started complaining about slow mail delivery and the > load on the server was above 7 (and actually meaning it, 100% disk I/O) and > >600 emails stuck in the qmail queue (incoming mail is sent to dspam for > processing, which stores it's data in PostgreSQL). We saw an obscure (and > only one) SCSI card-related error in dmesg, and a whole bunch of > grsec-reported Signal 11's from dspam. I tried catting several mail files > into dspam manually - some would work fine, some would cause dspam to > segfault. We presumed hardware failure. bad ram? motherboard? southbridge or whatever? > At 7:30, the former server was rebooted. However, the DRBD partitions were > not syncing. The nonactive machine showed Secondary/Unknown, and the active > server showed Primary/Unknown. netstat -nlptu showed the DRBD ports > listening on the inactive, and not on the active. So, at a loss, we rebooted > the active server. drbdadm reconnect all should have worked (on the one that was StandAlone ). when in the logs there is something about Current Primary shall become sync TARGET! Aborting to prevent data corruption. then you have a problem anyways, but you can then drbdadm invalidate on the one with the bad data (probably the current Secondary), then do a drbdadm reconnect all again. > When it came back up, the DRBD shares on both machines > showed up as Secondary/Secondary, and were syncing. and this might now have happend in the wrong direction, because of how the generation counters currently work in drbd, and because of what happened in between to those counters... we have changed this algorithm in the current development branch of drbd, it now reliably detect these kinds of situation and just refuse to connect until you resolve it manually... > After the sync finished, we set up the known-good server as the primary and > brought up services. well. you should have done so before, and more importantly set the known BAD server to "inconsistent", so it will receive a full sync... > Everything looked good, however when we logged into any mail account, the last > 7 hours worth of messages were gone (everything since the previous server > changeover. Several customers who use IMAP and/or webmail noticed missing > mail and called. We apologized and said we'd try to recover their mail from > a backup. The last backup had been taken at 1am, so did not contain many of > the missed mails. > > We went on about our days, and then some 4 hours later, one of the customers > called up and said he was his missing mails appearing one by one, and thanked > us for restoring his mail. We didn't know wtf was going on, but when I > looked on the server, all of the mail messages that were gone, were back! > > This is a really good recovery, but creepy as all hell. Can somebody take a > stab at explaining this please? oops? hm. maybe some oddities in imap/maildir/symlink/header cache or some such? maybe if you expect a mail to be already flagged as seen, you don't see it anymore if it suddenly is back as unseen again? or maybe it was just a meteorite shower, cosmic rays, you know :-> in any case I'd recommend a more or less short downtime, fsck, maybe md5sum the partitions or just do a full sync (invalidate the secondary)... btw, you are sure the hardware is ok (again) ? cheers, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.