[DRBD-user] Disk failure on secondary brings down primary

Tue Jul 5 16:55:38 CEST 2011

On 06/24/2011 11:09 PM, William Seligman wrote:
> Oops... I omitted the config file.
>
> On 6/24/11 1:27 PM, William Seligman wrote:
>> I recently had a major unexplained failure of my HA cluster due to DRBD. I could
>> use some help to understand what happened.
>>
>> Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific
>> Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.
>>
>> The drives on the two systems are configured identically: /dev/sdc1 and
>> /dev/sdd1 are grouped by software RAID1 into /dev/md2. DRBD resource "admin" is
>> device /dev/drbd1 in a Primary/Secondary configuration, formed from /dev/md2 on
>> both systems; I've attached the config file below, with comments stripped out.
>>
>> In case it's relevant: I use LVM to carve /dev/drbd1 into several partitions.
>> One partition contains the image files for several xen VMs; other partitions are
>> NFS-exported to both the VMs and to other systems in the lab.
>>
>> All of these resources, including DRBD, are managed by the HA software:
>> corosync-1.2.7; pacemaker-1.0.11; openais-1.1.3; heartbeat-3.0.3.
>>
>> Here's the problem: I had a hard-drive failure on the secondary:
>>
>> Jun  8 01:04:04 orestes kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr
>> 0x800 action 0x6 frozen
>> Jun  8 01:04:04 orestes kernel: ata4: SError: { HostInt }
>> Jun  8 01:04:04 orestes kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0
>> tag 0
>> Jun  8 01:04:04 orestes kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/40
>> Emask 0x44 (timeout)
>> Jun  8 01:04:04 orestes kernel: ata4.00: status: { DRDY }
>> Jun  8 01:04:04 orestes kernel: ata4: hard resetting link
>> Jun  8 01:04:05 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
>> SControl 300)
>> Jun  8 01:04:35 orestes kernel: ata4.00: qc timeout (cmd 0xec)
>> Jun  8 01:04:35 orestes kernel: ata4.00: failed to IDENTIFY (I/O error,
>> err_mask=0x4)
>> Jun  8 01:04:35 orestes kernel: ata4.00: revalidation failed (errno=-5)
>> Jun  8 01:04:35 orestes kernel: ata4: failed to recover some devices, retrying
>> in 5 secs
>> Jun  8 01:04:40 orestes kernel: ata4: hard resetting link
>> Jun  8 01:04:42 orestes kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
>> SControl 300)
>>
>> OK, drives fail. That's what the double-redundancy is for (RAID1+DRBD). But on
>> the primary:
>>
>> Jun  8 01:04:39 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
>> time expired, ko = 4294967295
>> Jun  8 01:04:45 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
>> time expired, ko = 4294967294
>> Jun  8 01:04:51 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
>> time expired, ko = 4294967293
>> Jun  8 01:04:57 hypatia kernel: block drbd1: [drbd1_worker/6650] sock_sendmsg
>> time expired, ko = 4294967292
>> Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: Nagios:monitor process (PID 23641)
>> timed out (try 1).  Killing with signal SIGTERM (15).
>> Jun  8 01:05:01 hypatia lrmd: [3988]: WARN: operation monitor[75] on
>> ocf::Xen::Nagios for client 3991, its parameters: CRM_meta_interval=[10000]
>> xmfile=[/xen/configs/nagios.cfg] CRM_meta_timeout=[30000]
>> crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23641] timed out
>>
>> I'll quote more of the log file if asked, but since this is a DRBD mailing list
>> I'll describe the HA errors abstractly: "Nagios" is the name of one of my
>> virtual machines whose disk image is on drbd1. One after the other, all the
>> virtual machines on drbd1 timeout and fail. Corosync tries to transfer the
>> resources from the good primary to the bad secondary; this fails (NFS issues)
>> and the bad secondary STONITHs the good primary!
>>
>> There's nothing in the log files of the virtual machines; the entries stop as at
>> the time of lrmd timeouts.
>>
>> My questions are:
>>
>> - The drive on ata4 failed, but it was part of a software RAID1. Why would that
>> cause a problem as long as the other drive was OK?
>>
>> - Why would a drive failure on a secondary cause kernel drbd errors on the primary?
>>
>> My only explanation is that maybe I had the syncer rate too high; it was 50M but
>> I reduced it to 15M. Is that a possible cause for this kind of problem?
> global {
> 	usage-count yes;
> }
>
> common {
> 	protocol A;
>
> 	handlers {
> 		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> 		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> 		local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
> 	}
> 	startup {
> 	}
> 	disk {
> 	}
> 	net {
> 		ping-timeout 11;
> 	}
> 	syncer {
> 		rate 15M;
> 	}
> }
>
> resource admin {
>   device    /dev/drbd1;
>   disk      /dev/md2;
>
>   net {
>     after-sb-0pri discard-zero-changes;
>     after-sb-1pri consensus;
>     after-sb-2pri disconnect;
>   }
>   startup {
>     wfc-timeout 60;
>     degr-wfc-timeout 60;
>     outdated-wfc-timeout 60;
>   }
>   handlers {
>     split-brain "/usr/lib/drbd/notify-split-brain.sh sysadmin at nevis.columbia.edu";
>   }
>
>   meta-disk internal;
>
>   on hypatia.nevis.columbia.edu {
>     address   192.168.100.7:7789;
>   }
>   on orestes.nevis.columbia.edu {
>     address   192.168.100.6:7789;
>   }
> }
Hi William

> Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific
> Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.

I have a similar setup as you described here.
OS is Centos 5.6: kernel 2.6.18-238.12.1.el5xen; DRDB version
kmod-drbd83-xen-8.3.10-2.el5.elrepo.x86_64

Two machines running Xen, having one raid1 disk block then LVM and
several partitions are used for drbd resource (simple primary secundary,
fallover is done manually). A week ago a verify run triggered a lock as
you described. The secondary node was hit by a bad disk:

Primary:
Jun 27 02:00:02 xenutrecht kernel: block drbd4: conn( Connected -> VerifyT )
Jun 27 02:00:02 xenutrecht kernel: block drbd4: Online Verify start
sector: 0
Jun 27 02:00:45 xenutrecht kernel: block drbd4: [drbd4_worker/6525]
sock_sendmsg time expired, ko = 429496
7295
...this kept repeating...

And the sendary found a disk error:
Jun 27 02:00:02 xenutrecht02 kernel: block drbd4: conn( Connected ->
VerifyS )
Jun 27 02:00:02 xenutrecht02 kernel: block drbd4: Starting Online Verify
from sector 0
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:25 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:25 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:25 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:26 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:26 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:26 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:27 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:27 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:27 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:27 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:27 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:27 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:28 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:28 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:29 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:29 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:29 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:30 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:30 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:30 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:31 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:31 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:31 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:b1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:31 xenutrecht02 kernel:          res
51/40:da:e5:b3:a5/40:01:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:31 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:31 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:32 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:32 xenutrecht02 kernel: sd 1:0:0:0: SCSI error: return code
= 0x08000002
Jun 27 02:00:32 xenutrecht02 kernel: sdb: Current [descriptor]: sense
key: Medium Error
Jun 27 02:00:32 xenutrecht02 kernel:     Add. Sense: Unrecovered read
error - auto reallocate failed
Jun 27 02:00:32 xenutrecht02 kernel:
Jun 27 02:00:32 xenutrecht02 kernel: Descriptor sense data with sense
descriptors (in hex):
Jun 27 02:00:32 xenutrecht02 kernel:         72 03 11 04 00 00 00 0c 00
0a 80 00 00 00 00 00
Jun 27 02:00:32 xenutrecht02 kernel:         41 a5 b3 e5
Jun 27 02:00:32 xenutrecht02 kernel: end_request: I/O error, dev sdb,
sector 1101378533
Jun 27 02:00:32 xenutrecht02 kernel: printk: 1 messages suppressed.
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378464
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378472
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378480
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378488
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378496
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378504
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378512
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378520
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378528
Jun 27 02:00:32 xenutrecht02 kernel: raid1: sdb1: rescheduling sector
1101378536
Jun 27 02:00:32 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:32 xenutrecht02 kernel: SCSI device sdb: 1953525168
512-byte hdwr sectors (1000205 MB)
Jun 27 02:00:33 xenutrecht02 kernel: sdb: Write Protect is off
Jun 27 02:00:33 xenutrecht02 kernel: SCSI device sdb: drive cache: write
back
Jun 27 02:00:34 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:34 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:34 xenutrecht02 kernel: ata1.00: cmd
25/00:00:bf:c1:a5/00:04:41:00:00/e0 tag 0 dma 524288 in
Jun 27 02:00:34 xenutrecht02 kernel:          res
51/40:5a:65:c3:a5/40:02:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:34 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:34 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:35 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:35 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:36 xenutrecht02 kernel: SCSI device sdb: 1953525168
512-byte hdwr sectors (1000205 MB)
Jun 27 02:00:36 xenutrecht02 kernel: sdb: Write Protect is off
Jun 27 02:00:36 xenutrecht02 kernel: SCSI device sdb: drive cache: write
back
Jun 27 02:00:36 xenutrecht02 kernel: SCSI device sdb: 1953525168
512-byte hdwr sectors (1000205 MB)
Jun 27 02:00:36 xenutrecht02 kernel: sdb: Write Protect is off
Jun 27 02:00:36 xenutrecht02 kernel: SCSI device sdb: drive cache: write
back
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: cmd
25/00:08:df:b3:a5/00:00:41:00:00/e0 tag 0 dma 4096 in
Jun 27 02:00:39 xenutrecht02 kernel:          res
51/40:02:e5:b3:a5/40:00:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:39 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:39 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x0
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: BMDMA stat 0x4
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: cmd
25/00:08:df:b3:a5/00:00:41:00:00/e0 tag 0 dma 4096 in
Jun 27 02:00:40 xenutrecht02 kernel:          res
51/40:02:e5:b3:a5/40:00:41:00:00/e0 Emask 0x9 (media error)
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: status: { DRDY ERR }
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: error: { UNC }
Jun 27 02:00:40 xenutrecht02 kernel: ata1.00: configured for UDMA/133
Jun 27 02:00:40 xenutrecht02 kernel: ata1: EH complete
Jun 27 02:00:41 xenutrecht02 kernel: printk: 50 messages suppressed.
Jun 27 02:00:41 xenutrecht02 kernel: raid1: sdc1: redirecting sector
1101378464 to another mirror
Jun 27 02:00:41 xenutrecht02 kernel: SCSI device sdb: 1953525168
512-byte hdwr sectors (1000205 MB)
Jun 27 02:00:41 xenutrecht02 kernel: sdb: Write Protect is off
Jun 27 02:00:41 xenutrecht02 kernel: SCSI device sdb: drive cache: write
back
Jun 27 02:00:41 xenutrecht02 kernel: SCSI device sdb: 1953525168
512-byte hdwr sectors (1000205 MB)
Jun 27 02:00:41 xenutrecht02 kernel: sdb: Write Protect is off
Jun 27 02:00:41 xenutrecht02 kernel: SCSI device sdb: drive cache: write
back
Jun 27 02:00:42 xenutrecht02 kernel: printk: 9 messages suppressed.
Jun 27 02:00:42 xenutrecht02 kernel: raid1: sdb1: redirecting sector
1101378544 to another mirror
Jun 27 02:00:49 xenutrecht02 kernel: block drbd4: [drbd4_worker/3598]
sock_sendmsg time expired, ko = 4294967295
Jun 27 02:00:55 xenutrecht02 kernel: block drbd4: [drbd4_worker/3598]
sock_sendmsg time expired, ko = 4294967294

and this kept repeating..

So raid1 solved the the disk problem but drbd locked up. I had to reboot
the machine (rather hard by the power button as normal reboot did
stalled). As I was not sure whether this was a real hardware problem
instead of a cabling issue, I put the disk back in the raid. This
effectively rewrites the entire disk and no bad sectors remained. I
changed some cables and tried again.

This weekend the disk crashed again. As a result of the crash drbd
locked up again and now all virtual machines on both host stopped
working. qemu used 100% of cpu and the load on both machines went up 10
20. I was still able to access the two hosts over ssh, but even a reboot
was no longer working (I managed to enable sysrq and forced a reboot in
this way). After the reboot all machines came back up. So the other host
has not rebooted. This shows that a secondary can pull down a primary.
(Yes I am convinced now the disk is really bad :-) )

The disappointment for me is that I use block device mirroring with drbd
between two xen hosts so that I can quickly switch over virtual machines
to the surviving node. Without drbd the raid1 would have properly
worked. Perhaps drbd and xen don't go along nicely. I consider to stop
using it, as it turns out not to be reliable. Instead I plan to use an
iscsi disk in a raid1 mirror with a local disk. The raid1 is setup to
only write to the iscsi disk (write-mostly) and read from the local
disk. This gives me the same hardware redundancy as when using drbd. I
can use the iscsi disk on another machine in case of migration. The drbd
way is nicer, but only if it is reliable in my setup. And I have no clue
how to achieve that.

Cheers,
Theo