[DRBD-user] MD sets failing under heavy load in a DRBD/Pacemaker Cluster

Tue Oct 4 15:22:36 CEST 2011

>From my point of view it looks like driver/hardware errors, since you
have records like:
Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O
error, dev sdf, sector 3907028992


On Tue, Oct 4, 2011 at 4:01 PM, Caspar Smit <c.smit at truebit.nl> wrote:
> Hi all,
>
> We are having a major problem with one of our clusters.
>
> Here's a description of the setup:
>
> 2 supermicro servers containing the following hardware:
>
> Chassis: SC846E1-R1200B
> Mainboard: X8DTH-6F rev 2.01 (onboard LSI2008 controller disabled
> through jumper)
> CPU: Intel Xeon E5606 @ 2.13Ghz, 4 cores
> Memory: 4x KVR1333D3D4R9S/4G (16Gb)
> Backplane: SAS846EL1 rev 1.1
> Ethernet: 2x Intel Pro/1000 PT Quad Port Low Profile
> SAS/SATA Controller: LSI 3081E-R (P20, BIOS: 6.34.00.00, Firmware 1.32.00.00-IT)
> SAS/SATA JBOD Controller: LSI 3801E (P20, BIOS: 6.34.00.00, Firmware
> 1.32.00.00-IT)
> OS Disk: 30Gb SSD
> Harddisks: 24x Western Digital 2TB 7200RPM RE4-GP (WD2002FYPS)
>
> Both machines have debian lenny 5 installed, here are the versions of
> the packages involved:
>
> drbd/heartbeat/pacemaker are installed from the backports repository.
>
> linux-image-2.6.26-2-amd64   2.6.26-26lenny3
> mdadm                                2.6.7.2-3
> drbd8-2.6.26-2-amd64           2:8.3.7-1~bpo50+1+2.6.26-26lenny3
> drbd8-source                        2:8.3.7-1~bpo50+1
> drbd8-utils                            2:8.3.7-1~bpo50+1
> heartbeat                             1:3.0.3-2~bpo50+1
> pacemaker                           1.0.9.1+hg15626-1~bpo50+1
> iscsitarget                            1.4.20.2 (compiled from tar.gz)
>
>
> We created 4 MD sets out of the 24 harddisks (/dev/md0 through /dev/md3)
>
> Each is a RAID5 of 5 disks and 1 hotspare (8TB netto per MD), metadata
> version of the MD sets is 0.90
>
> For each MD we created a DRBD device to the second node. (/dev/drbd4
> through /dev/drbd7) (0 through 3 were used by disks from a JBOD which
> was disconnected, read below)
> (see attached drbd.conf.txt, these are the individual *.res files combined)
>
> Each drbd device has its own dedicated 1GbE NIC port.
>
> Each drbd device is then exported through iSCSI using iet in pacemaker
> (see attached crm-config.txt for the full pacemaker config)
>
>
> Now for the symptoms we are having:
>
> After a number of days (sometimes weeks) the disks from the MD sets
> start failing subsequently.
>
> See the attached syslog.txt for details but here are the main entries:
>
> It starts with:
>
> Oct  2 11:01:59 node03 kernel: [7370143.421999] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.435220] mptbase: ioc0:
> LogInfo(0x31181000): Originator={PL}, Code={IO Cancelled Due to
> Recieve Error}, SubCode(0x1000) cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.442141] mptbase: ioc0:
> LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
> cb_idx mptbase_reply
> Oct  2 11:01:59 node03 kernel: [7370143.442783] end_request: I/O
> error, dev sdf, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442783] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Disk failure on
> sdf, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442783] raid5: Operation
> continuing on 4 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
> error, dev sdb, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
> sdb, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
> continuing on 3 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] end_request: I/O
> error, dev sdd, sector 3907028992
> Oct  2 11:01:59 node03 kernel: [7370143.442820] md: super_written gets
> error=-5, uptodate=0
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Disk failure on
> sdd, disabling device.
> Oct  2 11:01:59 node03 kernel: [7370143.442820] raid5: Operation
> continuing on 2 devices.
> Oct  2 11:01:59 node03 kernel: [7370143.470791] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptbase_reply
> <snip>
> Oct  2 11:02:00 node03 kernel: [7370143.968976] Buffer I/O error on
> device drbd4, logical block 1651581030
> Oct  2 11:02:00 node03 kernel: [7370143.969056] block drbd4: p write: error=-5
> Oct  2 11:02:00 node03 kernel: [7370143.969126] block drbd4: Local
> WRITE failed sec=21013680s size=4096
> Oct  2 11:02:00 node03 kernel: [7370143.969203] block drbd4: disk(
> UpToDate -> Failed )
> Oct  2 11:02:00 node03 kernel: [7370143.969276] block drbd4: Local IO
> failed in __req_mod.Detaching...
> Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: disk(
> Failed -> Diskless )
> Oct  2 11:02:00 node03 kernel: [7370143.969492] block drbd4: Notified
> peer that my disk is broken.
> Oct  2 11:02:00 node03 kernel: [7370143.970120] block drbd4: Should
> have called drbd_al_complete_io(, 21013680), but my Disk seems to have
> failed :(
> Oct  2 11:02:00 node03 kernel: [7370144.003730] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:00 node03 kernel: [7370144.004931] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:00 node03 kernel: [7370144.006820] iscsi_trgt:
> fileio_make_request(63) I/O error 4096, -5
> Oct  2 11:02:01 node03 kernel: [7370144.849344] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849451] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849709] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.849814] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> Oct  2 11:02:01 node03 kernel: [7370144.850077] mptbase: ioc0:
> LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00)
> cb_idx mptscsih_io_done
> <snip>
> Oct  2 11:02:07 node03 kernel: [7370150.918849] mptbase: ioc0: WARNING
> - IOC is in FAULT state (7810h)!!!
> Oct  2 11:02:07 node03 kernel: [7370150.918929] mptbase: ioc0: WARNING
> - Issuing HardReset from mpt_fault_reset_work!!
> Oct  2 11:02:07 node03 kernel: [7370150.919027] mptbase: ioc0:
> Initiating recovery
> Oct  2 11:02:07 node03 kernel: [7370150.919098] mptbase: ioc0: WARNING
> - IOC is in FAULT state!!!
> Oct  2 11:02:07 node03 kernel: [7370150.919171] mptbase: ioc0: WARNING
> -            FAULT code = 7810h
> Oct  2 11:02:10 node03 kernel: [7370154.041934] mptbase: ioc0:
> Recovered from IOC FAULT
> Oct  2 11:02:16 node03 cib: [5734]: WARN: send_ipc_message: IPC
> Channel to 23559 is not connected
> Oct  2 11:02:21 node03 iSCSITarget[9060]: [9069]: WARNING:
> Configuration parameter "portals" is not supported by the iSCSI
> implementation and will be ignored.
> Oct  2 11:02:22 node03 kernel: [7370166.353087] mptbase: ioc0: WARNING
> - mpt_fault_reset_work: HardReset: success
>
>
> This results in 3 MD's were all disks are failed [_____] and 1 MD
> survives that is rebuilding with its spare.
> 3 drbd devices are Diskless/UpToDate and the survivor is UpToDate/UpToDate
> The weird thing of this all is that there is always 1 MD set that
> "survives" the FAULT state of the controller!
> Luckily DRBD redirects all read/writes to the second node so there is
> no downtime.
>
>
> Our findings:
>
> 1) It seems to only happen on heavy load
>
> 2) It seems to only happen when DRBD is connected (we didn't have any
> failing MD's yet when DRBD was not connected luckily!)
>
> 3) It seems to only happen on the primary node
>
> 4) It does not look like a hardware problem because there is always
> one MD that survives this, if this was hardware related I would expect
> ALL disks/MD's too fail.
>  Furthermore the disks are not broken because we can assemble the
> array again after it happened and they resync just fine.
>
> 5) I see that there is a new kernel version (2.6.26-27) available and
> if i look at the changelog it has a fair number of fixes related to
> MD, although the symptoms we are seeing are different from the
> described fixes it could be related. Can anyone tell if these issues
> are related to the fixes in the newest kernel image?
>
> 6) In the past we had a Dell MD1000 JBOD connected to the LSI 3801E
> controller on both nodes and had the same problem when every disk
> (only from the JBOD) failed so we disconnected the JBOD. The
> controller stayed inside the server.
>
>
> Things we tried so far:
>
> 1) We switched the LSI 3081E-R controller with another but to no avail
> (and we have another identical cluster suffering from this problem)
>
> 2) In stead of the stock lenny mptsas driver (version v3.04.06) we
> used the latest official LSI mptsas driver (v4.26.00.00) from the LSI
> website using KB article 16387
> (kb.lsi.com/KnowledgebaseArticle16387.aspx). Still to no avail, it
> happens with that driver too.
>
>
> Things that might be related:
>
> 1) We are using the deadline IO scheduler as recommended by IETD.
>
> 2) We are suspecting that the LSI 3801E controller might interfere
> with the LSI 3081E-R so we are planning to remove the unused LSI 3801E
> controllers.
> Is there a known issue when both controllers are used in the same
> machine? They have the same firmware/bios version. The linux driver
> (mptsas) is also the same for both controllers.
>
> Kind regards,
>
> Caspar Smit
> Systemengineer
> True Bit Resources B.V.
> Ampèrestraat 13E
> 1446 TP  Purmerend
>
> T: +31(0)299 410 475
> F: +31(0)299 410 476
> @: c.smit at truebit.nl
> W: www.truebit.nl
>


-- 
Best regards,
[COOLCOLD-RIPN]