Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I recently had a major unexplained failure of my HA cluster due to DRBD. I could use some help to understand what happened. Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5. The drives on the two systems are configured identically: /dev/sdc1 and /dev/sdd1 are grouped by software RAID1 into /dev/md2. DRBD resource "admin" is device /dev/drbd1 in a Primary/Secondary configuration, formed from /dev/md2 on both systems; I've attached the config file below, with comments stripped out. In case it's relevant: I use LVM to carve /dev/drbd1 into several partitions. One partition contains the image files for several xen VMs; other partitions are NFS-exported to both the VMs and to other systems in the lab. All of these resources, including DRBD, are managed by the HA software: corosync-1.2.7; pacemaker-1.0.11; openais-1.1.3; heartbeat-3.0.3. Here's the problem: I had a hard-drive failure on the secondary: Jun 8 01:04:04 orestes kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen Jun 8 01:04:04 orestes kernel: ata4: SError: { HostInt } Jun 8 01:04:04 orestes kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jun 8 01:04:04 orestes kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x44 (timeout) Jun 8 01:04:04 orestes kernel: ata4.00: status: { DRDY } Jun 8 01:04:04 orestes kernel: ata4: hard resetting link Jun 8 01:04:05 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jun 8 01:04:35 orestes kernel: ata4.00: qc timeout (cmd 0xec) Jun 8 01:04:35 orestes kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) Jun 8 01:04:35 orestes kernel: ata4.00: revalidation failed (errno=-5) Jun 8 01:04:35 orestes kernel: ata4: failed to recover some devices, retrying in 5 secs Jun 8 01:04:40 orestes kernel: ata4: hard resetting link Jun 8 01:04:42 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) OK, drives fail. That's what the double-redundancy is for (RAID1+DRBD). But on the primary: Jun 8 01:04:39 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg time expired, ko = 4294967295 Jun 8 01:04:45 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg time expired, ko = 4294967294 Jun 8 01:04:51 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg time expired, ko = 4294967293 Jun 8 01:04:57 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg time expired, ko = 4294967292 Jun 8 01:05:01 hypatia lrmd: [3988]: WARN: Nagios:monitor process (PID 23641) timed out (try 1). Killing with signal SIGTERM (15). Jun 8 01:05:01 hypatia lrmd: [3988]: WARN: operation monitor[75] on ocf::Xen::Nagios for client 3991, its parameters: CRM_meta_interval=[10000] xmfile=[/xen/configs/nagios.cfg] CRM_meta_timeout=[30000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23641] timed out I'll quote more of the log file if asked, but since this is a DRBD mailing list I'll describe the HA errors abstractly: "Nagios" is the name of one of my virtual machines whose disk image is on drbd1. One after the other, all the virtual machines on drbd1 timeout and fail. Corosync tries to transfer the resources from the good primary to the bad secondary; this fails (NFS issues) and the bad secondary STONITHs the good primary! There's nothing in the log files of the virtual machines; the entries stop as at the time of lrmd timeouts. My questions are: - The drive on ata4 failed, but it was part of a software RAID1. Why would that cause a problem as long as the other drive was OK? - Why would a drive failure on a secondary cause kernel drbd errors on the primary? My only explanation is that maybe I had the syncer rate too high; it was 50M but I reduced it to 15M. Is that a possible cause for this kind of problem? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5894 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110624/a4d92389/attachment.bin>