[DRBD-user] Disk failure on secondary brings down primary

Fri Jun 24 23:09:48 CEST 2011

Oops... I omitted the config file.

On 6/24/11 1:27 PM, William Seligman wrote:
> I recently had a major unexplained failure of my HA cluster due to DRBD. I could
> use some help to understand what happened.
> 
> Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific
> Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.
> 
> The drives on the two systems are configured identically: /dev/sdc1 and
> /dev/sdd1 are grouped by software RAID1 into /dev/md2. DRBD resource "admin" is
> device /dev/drbd1 in a Primary/Secondary configuration, formed from /dev/md2 on
> both systems; I've attached the config file below, with comments stripped out.
> 
> In case it's relevant: I use LVM to carve /dev/drbd1 into several partitions.
> One partition contains the image files for several xen VMs; other partitions are
> NFS-exported to both the VMs and to other systems in the lab.
> 
> All of these resources, including DRBD, are managed by the HA software:
> corosync-1.2.7; pacemaker-1.0.11; openais-1.1.3; heartbeat-3.0.3.
> 
> Here's the problem: I had a hard-drive failure on the secondary:
> 
> Jun  8 01:04:04 orestes kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr
> 0x800 action 0x6 frozen
> Jun  8 01:04:04 orestes kernel: ata4: SError: { HostInt }
> Jun  8 01:04:04 orestes kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0
> tag 0
> Jun  8 01:04:04 orestes kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/40
> Emask 0x44 (timeout)
> Jun  8 01:04:04 orestes kernel: ata4.00: status: { DRDY }
> Jun  8 01:04:04 orestes kernel: ata4: hard resetting link
> Jun  8 01:04:05 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
> SControl 300)
> Jun  8 01:04:35 orestes kernel: ata4.00: qc timeout (cmd 0xec)
> Jun  8 01:04:35 orestes kernel: ata4.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Jun  8 01:04:35 orestes kernel: ata4.00: revalidation failed (errno=-5)
> Jun  8 01:04:35 orestes kernel: ata4: failed to recover some devices, retrying
> in 5 secs
> Jun  8 01:04:40 orestes kernel: ata4: hard resetting link
> Jun  8 01:04:42 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
> SControl 300)
> 
> OK, drives fail. That's what the double-redundancy is for (RAID1+DRBD). But on
> the primary:
> 
> Jun  8 01:04:39 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
> time expired, ko = 4294967295
> Jun  8 01:04:45 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
> time expired, ko = 4294967294
> Jun  8 01:04:51 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
> time expired, ko = 4294967293
> Jun  8 01:04:57 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
> time expired, ko = 4294967292
> Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: Nagios:monitor process (PID 23641)
> timed out (try 1).  Killing with signal SIGTERM (15).
> Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: operation monitor[75] on
> ocf::Xen::Nagios for client 3991, its parameters: CRM_meta_interval=[10000]
> xmfile=[/xen/configs/nagios.cfg] CRM_meta_timeout=[30000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23641] timed out
> 
> I'll quote more of the log file if asked, but since this is a DRBD mailing list
> I'll describe the HA errors abstractly: "Nagios" is the name of one of my
> virtual machines whose disk image is on drbd1. One after the other, all the
> virtual machines on drbd1 timeout and fail. Corosync tries to transfer the
> resources from the good primary to the bad secondary; this fails (NFS issues)
> and the bad secondary STONITHs the good primary!
> 
> There's nothing in the log files of the virtual machines; the entries stop as at
> the time of lrmd timeouts.
> 
> My questions are:
> 
> - The drive on ata4 failed, but it was part of a software RAID1. Why would that
> cause a problem as long as the other drive was OK?
> 
> - Why would a drive failure on a secondary cause kernel drbd errors on the primary?
> 
> My only explanation is that maybe I had the syncer rate too high; it was 50M but
> I reduced it to 15M. Is that a possible cause for this kind of problem?

global {
	usage-count yes;
}

common {
	protocol A;

	handlers {
		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
	}
	startup {
	}
	disk {
	}
	net {
		ping-timeout 11;
	}
	syncer {
		rate 15M;
	}
}

resource admin {
  device    /dev/drbd1;
  disk      /dev/md2;

  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;
  }
  startup {
    wfc-timeout 60;
    degr-wfc-timeout 60;
    outdated-wfc-timeout 60;
  }
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh sysadmin at nevis.columbia.edu";
  }

  meta-disk internal;

  on hypatia.nevis.columbia.edu {
    address   192.168.100.7:7789;
  }
  on orestes.nevis.columbia.edu {
    address   192.168.100.6:7789;
  }
}

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5894 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110624/b5f876ef/attachment.bin>