[DRBD-user] Disk failure on secondary brings down primary

Fri Jun 24 19:27:46 CEST 2011

I recently had a major unexplained failure of my HA cluster due to DRBD. I could
use some help to understand what happened.

Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific
Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.

The drives on the two systems are configured identically: /dev/sdc1 and
/dev/sdd1 are grouped by software RAID1 into /dev/md2. DRBD resource "admin" is
device /dev/drbd1 in a Primary/Secondary configuration, formed from /dev/md2 on
both systems; I've attached the config file below, with comments stripped out.

In case it's relevant: I use LVM to carve /dev/drbd1 into several partitions.
One partition contains the image files for several xen VMs; other partitions are
NFS-exported to both the VMs and to other systems in the lab.

All of these resources, including DRBD, are managed by the HA software:
corosync-1.2.7; pacemaker-1.0.11; openais-1.1.3; heartbeat-3.0.3.

Here's the problem: I had a hard-drive failure on the secondary:

Jun  8 01:04:04 orestes kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr
0x800 action 0x6 frozen
Jun  8 01:04:04 orestes kernel: ata4: SError: { HostInt }
Jun  8 01:04:04 orestes kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0
tag 0
Jun  8 01:04:04 orestes kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/40
Emask 0x44 (timeout)
Jun  8 01:04:04 orestes kernel: ata4.00: status: { DRDY }
Jun  8 01:04:04 orestes kernel: ata4: hard resetting link
Jun  8 01:04:05 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Jun  8 01:04:35 orestes kernel: ata4.00: qc timeout (cmd 0xec)
Jun  8 01:04:35 orestes kernel: ata4.00: failed to IDENTIFY (I/O error,
err_mask=0x4)
Jun  8 01:04:35 orestes kernel: ata4.00: revalidation failed (errno=-5)
Jun  8 01:04:35 orestes kernel: ata4: failed to recover some devices, retrying
in 5 secs
Jun  8 01:04:40 orestes kernel: ata4: hard resetting link
Jun  8 01:04:42 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
SControl 300)

OK, drives fail. That's what the double-redundancy is for (RAID1+DRBD). But on
the primary:

Jun  8 01:04:39 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
time expired, ko = 4294967295
Jun  8 01:04:45 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
time expired, ko = 4294967294
Jun  8 01:04:51 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
time expired, ko = 4294967293
Jun  8 01:04:57 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
time expired, ko = 4294967292
Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: Nagios:monitor process (PID 23641)
timed out (try 1).  Killing with signal SIGTERM (15).
Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: operation monitor[75] on
ocf::Xen::Nagios for client 3991, its parameters: CRM_meta_interval=[10000]
xmfile=[/xen/configs/nagios.cfg] CRM_meta_timeout=[30000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23641] timed out

I'll quote more of the log file if asked, but since this is a DRBD mailing list
I'll describe the HA errors abstractly: "Nagios" is the name of one of my
virtual machines whose disk image is on drbd1. One after the other, all the
virtual machines on drbd1 timeout and fail. Corosync tries to transfer the
resources from the good primary to the bad secondary; this fails (NFS issues)
and the bad secondary STONITHs the good primary!

There's nothing in the log files of the virtual machines; the entries stop as at
the time of lrmd timeouts.

My questions are:

- The drive on ata4 failed, but it was part of a software RAID1. Why would that
cause a problem as long as the other drive was OK?

- Why would a drive failure on a secondary cause kernel drbd errors on the primary?

My only explanation is that maybe I had the syncer rate too high; it was 50M but
I reduced it to 15M. Is that a possible cause for this kind of problem?

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5894 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110624/a4d92389/attachment.bin>