Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 19-01-16 21:38, Dirk Bonenkamp - ProActive wrote: > On 19-01-16 11:33, Lars Ellenberg wrote: >> On Sun, Jan 17, 2016 at 09:42:04AM +0100, Dirk Bonenkamp - ProActive Software wrote: >>> Hi All, >>> >>> I've run into some problems on my DRDB cluster this week. This cluster >>> has been running fine for over a year. All of a sudden the secondary failed: >>> >>> Jan 14 08:37:58 data2 kernel: [4895290.318176] drbd r0: meta connection >>> shut down by peer. >>> Jan 14 08:37:58 data2 kernel: [4895290.318361] drbd r0: peer( Primary -> >>> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) >>> Jan 14 08:37:58 data2 kernel: [4895290.318532] drbd r0: asender terminated >>> Jan 14 08:37:58 data2 kernel: [4895290.318534] drbd r0: Terminating >>> drbd_a_r0 >>> Jan 14 08:38:07 data2 kernel: [4895298.502391] drbd r0: Connection closed >>> Jan 14 08:38:07 data2 kernel: [4895298.502405] drbd r0: conn( >>> NetworkFailure -> Unconnected ) >>> Jan 14 08:38:07 data2 kernel: [4895298.502406] drbd r0: receiver terminated >>> Jan 14 08:38:07 data2 kernel: [4895298.502408] drbd r0: Restarting >>> receiver thread >>> Jan 14 08:38:07 data2 kernel: [4895298.502409] drbd r0: receiver (re)started >>> Jan 14 08:38:07 data2 kernel: [4895298.502415] drbd r0: conn( >>> Unconnected -> WFConnection ) >>> Jan 14 08:38:07 data2 kernel: [4895299.002586] drbd r0: Handshake >>> successful: Agreed network protocol version 101 >>> Jan 14 08:38:07 data2 kernel: [4895299.002592] drbd r0: Agreed to >>> support TRIM on protocol level >>> Jan 14 08:38:07 data2 kernel: [4895299.002813] drbd r0: Peer >>> authenticated using 20 bytes HMAC >>> Jan 14 08:38:07 data2 kernel: [4895299.002848] drbd r0: conn( >>> WFConnection -> WFReportParams ) >>> Jan 14 08:38:07 data2 kernel: [4895299.002852] drbd r0: Starting asender >>> thread (from drbd_r_r0 [3400]) >>> >>> It would reconnect, sync, and disconnect again. I stopped the node, >>> checked the hardware (all seems fine), rebooted and tried to start drbd >>> again: >>> >>> root at data2:/var/log# drbdadm connect r0 >>> r0: Failure: (158) Unknown resource >> As you probably found out by now, >> You can only "connect" something that is already there. >> Try "up", or better yet, "adjust". >> > Actually, DRBD is controlled by pacemaker - I (mis)used the command line > tools to keep things a bit simpler - my mistake. > > When I put the node on line (like I did dozens of times before with no > problem), this is the kernel output: > > [45742.318104] drbd: module verification failed: signature and/or > required key missing - tainting kernel > [45742.322235] drbd: initialized. Version: 8.4.7-1 (api:1/proto:86-101) > [45742.322238] drbd: GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 > build by root at data2, 2016-01-17 08:27:34 > [45742.322239] drbd: registered as block device major 147 > [45742.408595] drbd r0: Starting worker thread (from drbdsetup-84 [4350]) > [45742.408897] block drbd0: disk( Diskless -> Attaching ) > [45742.409114] drbd r0: Method to ensure write ordering: drain > [45742.409117] block drbd0: max BIO size = 327680 > [45742.409121] block drbd0: drbd_bm_resize called with capacity == > 78127009272 > [45742.738885] block drbd0: resync bitmap: bits=9765876159 > words=152591815 pages=298031 > [45742.738890] block drbd0: size = 36 TB (39063504636 KB) > [45747.413416] block drbd0: md_sync_timer expired! Worker calls > drbd_md_sync(). > > This is the command that 'hangs' (from ps fax): > > drbdsetup-84 attach 0 /dev/sda1 /dev/sda1 internal --on-io-error=detach > --fencing=resource-only --c-plan-ahead=200 --c-max-rate=300M > --c-fill-target=100M --disk-barrier=no --disk-flushes=no --al-extents=3389 > > And this is how /proc/drbd looks: > > root at data2:~# cat /proc/drbd > version: 8.4.7-1 (api:1/proto:86-101) > GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 build by root at data2, > 2016-01-17 08:27:34 > 0: cs:StandAlone ro:Secondary/Unknown ds:Attaching/DUnknown r---d- > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:2 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 > > It seems as DRBD has trouble attaching the storage device. Which in this > case is a disk array controlled by an LSI MegaRAID controller. The > Megacli64 tools report that everything is fine with the controller and > the (virtual) drives. > > Could this however be a sign of faulty hardware? I did replace the BBU > recently, but it has worked after this. > For completeness: It turned out to be faulty hardware. I had a bad disk in one of the raidcontrollers. I however did not get kicked out by the controller (strange, we use these LSI MegaRaid controllers a lot and they have been very reliable to us). Manually removing the drive restored the responsiveness of drbd. Thanks, Dirk