Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Oops... I omitted the config file. On 6/24/11 1:27 PM, William Seligman wrote: > I recently had a major unexplained failure of my HA cluster due to DRBD. I could > use some help to understand what happened. > > Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific > Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5. > > The drives on the two systems are configured identically: /dev/sdc1 and > /dev/sdd1 are grouped by software RAID1 into /dev/md2. DRBD resource "admin" is > device /dev/drbd1 in a Primary/Secondary configuration, formed from /dev/md2 on > both systems; I've attached the config file below, with comments stripped out. > > In case it's relevant: I use LVM to carve /dev/drbd1 into several partitions. > One partition contains the image files for several xen VMs; other partitions are > NFS-exported to both the VMs and to other systems in the lab. > > All of these resources, including DRBD, are managed by the HA software: > corosync-1.2.7; pacemaker-1.0.11; openais-1.1.3; heartbeat-3.0.3. > > Here's the problem: I had a hard-drive failure on the secondary: > > Jun 8 01:04:04 orestes kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr > 0x800 action 0x6 frozen > Jun 8 01:04:04 orestes kernel: ata4: SError: { HostInt } > Jun 8 01:04:04 orestes kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 > tag 0 > Jun 8 01:04:04 orestes kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/40 > Emask 0x44 (timeout) > Jun 8 01:04:04 orestes kernel: ata4.00: status: { DRDY } > Jun 8 01:04:04 orestes kernel: ata4: hard resetting link > Jun 8 01:04:05 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 > SControl 300) > Jun 8 01:04:35 orestes kernel: ata4.00: qc timeout (cmd 0xec) > Jun 8 01:04:35 orestes kernel: ata4.00: failed to IDENTIFY (I/O error, > err_mask=0x4) > Jun 8 01:04:35 orestes kernel: ata4.00: revalidation failed (errno=-5) > Jun 8 01:04:35 orestes kernel: ata4: failed to recover some devices, retrying > in 5 secs > Jun 8 01:04:40 orestes kernel: ata4: hard resetting link > Jun 8 01:04:42 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 > SControl 300) > > OK, drives fail. That's what the double-redundancy is for (RAID1+DRBD). But on > the primary: > > Jun 8 01:04:39 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg > time expired, ko = 4294967295 > Jun 8 01:04:45 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg > time expired, ko = 4294967294 > Jun 8 01:04:51 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg > time expired, ko = 4294967293 > Jun 8 01:04:57 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg > time expired, ko = 4294967292 > Jun 8 01:05:01 hypatia lrmd: [3988]: WARN: Nagios:monitor process (PID 23641) > timed out (try 1). Killing with signal SIGTERM (15). > Jun 8 01:05:01 hypatia lrmd: [3988]: WARN: operation monitor[75] on > ocf::Xen::Nagios for client 3991, its parameters: CRM_meta_interval=[10000] > xmfile=[/xen/configs/nagios.cfg] CRM_meta_timeout=[30000] > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23641] timed out > > I'll quote more of the log file if asked, but since this is a DRBD mailing list > I'll describe the HA errors abstractly: "Nagios" is the name of one of my > virtual machines whose disk image is on drbd1. One after the other, all the > virtual machines on drbd1 timeout and fail. Corosync tries to transfer the > resources from the good primary to the bad secondary; this fails (NFS issues) > and the bad secondary STONITHs the good primary! > > There's nothing in the log files of the virtual machines; the entries stop as at > the time of lrmd timeouts. > > My questions are: > > - The drive on ata4 failed, but it was part of a software RAID1. Why would that > cause a problem as long as the other drive was OK? > > - Why would a drive failure on a secondary cause kernel drbd errors on the primary? > > My only explanation is that maybe I had the syncer rate too high; it was 50M but > I reduced it to 15M. Is that a possible cause for this kind of problem? global { usage-count yes; } common { protocol A; handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; } startup { } disk { } net { ping-timeout 11; } syncer { rate 15M; } } resource admin { device /dev/drbd1; disk /dev/md2; net { after-sb-0pri discard-zero-changes; after-sb-1pri consensus; after-sb-2pri disconnect; } startup { wfc-timeout 60; degr-wfc-timeout 60; outdated-wfc-timeout 60; } handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh sysadmin at nevis.columbia.edu"; } meta-disk internal; on hypatia.nevis.columbia.edu { address 192.168.100.7:7789; } on orestes.nevis.columbia.edu { address 192.168.100.6:7789; } } -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5894 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110624/b5f876ef/attachment.bin>