Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 19-01-16 11:33, Lars Ellenberg wrote: > On Sun, Jan 17, 2016 at 09:42:04AM +0100, Dirk Bonenkamp - ProActive Software wrote: >> Hi All, >> >> I've run into some problems on my DRDB cluster this week. This cluster >> has been running fine for over a year. All of a sudden the secondary failed: >> >> Jan 14 08:37:58 data2 kernel: [4895290.318176] drbd r0: meta connection >> shut down by peer. >> Jan 14 08:37:58 data2 kernel: [4895290.318361] drbd r0: peer( Primary -> >> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) >> Jan 14 08:37:58 data2 kernel: [4895290.318532] drbd r0: asender terminated >> Jan 14 08:37:58 data2 kernel: [4895290.318534] drbd r0: Terminating >> drbd_a_r0 >> Jan 14 08:38:07 data2 kernel: [4895298.502391] drbd r0: Connection closed >> Jan 14 08:38:07 data2 kernel: [4895298.502405] drbd r0: conn( >> NetworkFailure -> Unconnected ) >> Jan 14 08:38:07 data2 kernel: [4895298.502406] drbd r0: receiver terminated >> Jan 14 08:38:07 data2 kernel: [4895298.502408] drbd r0: Restarting >> receiver thread >> Jan 14 08:38:07 data2 kernel: [4895298.502409] drbd r0: receiver (re)started >> Jan 14 08:38:07 data2 kernel: [4895298.502415] drbd r0: conn( >> Unconnected -> WFConnection ) >> Jan 14 08:38:07 data2 kernel: [4895299.002586] drbd r0: Handshake >> successful: Agreed network protocol version 101 >> Jan 14 08:38:07 data2 kernel: [4895299.002592] drbd r0: Agreed to >> support TRIM on protocol level >> Jan 14 08:38:07 data2 kernel: [4895299.002813] drbd r0: Peer >> authenticated using 20 bytes HMAC >> Jan 14 08:38:07 data2 kernel: [4895299.002848] drbd r0: conn( >> WFConnection -> WFReportParams ) >> Jan 14 08:38:07 data2 kernel: [4895299.002852] drbd r0: Starting asender >> thread (from drbd_r_r0 [3400]) >> >> It would reconnect, sync, and disconnect again. I stopped the node, >> checked the hardware (all seems fine), rebooted and tried to start drbd >> again: >> >> root at data2:/var/log# drbdadm connect r0 >> r0: Failure: (158) Unknown resource > As you probably found out by now, > You can only "connect" something that is already there. > Try "up", or better yet, "adjust". > Actually, DRBD is controlled by pacemaker - I (mis)used the command line tools to keep things a bit simpler - my mistake. When I put the node on line (like I did dozens of times before with no problem), this is the kernel output: [45742.318104] drbd: module verification failed: signature and/or required key missing - tainting kernel [45742.322235] drbd: initialized. Version: 8.4.7-1 (api:1/proto:86-101) [45742.322238] drbd: GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 build by root at data2, 2016-01-17 08:27:34 [45742.322239] drbd: registered as block device major 147 [45742.408595] drbd r0: Starting worker thread (from drbdsetup-84 [4350]) [45742.408897] block drbd0: disk( Diskless -> Attaching ) [45742.409114] drbd r0: Method to ensure write ordering: drain [45742.409117] block drbd0: max BIO size = 327680 [45742.409121] block drbd0: drbd_bm_resize called with capacity == 78127009272 [45742.738885] block drbd0: resync bitmap: bits=9765876159 words=152591815 pages=298031 [45742.738890] block drbd0: size = 36 TB (39063504636 KB) [45747.413416] block drbd0: md_sync_timer expired! Worker calls drbd_md_sync(). This is the command that 'hangs' (from ps fax): drbdsetup-84 attach 0 /dev/sda1 /dev/sda1 internal --on-io-error=detach --fencing=resource-only --c-plan-ahead=200 --c-max-rate=300M --c-fill-target=100M --disk-barrier=no --disk-flushes=no --al-extents=3389 And this is how /proc/drbd looks: root at data2:~# cat /proc/drbd version: 8.4.7-1 (api:1/proto:86-101) GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 build by root at data2, 2016-01-17 08:27:34 0: cs:StandAlone ro:Secondary/Unknown ds:Attaching/DUnknown r---d- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:2 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 It seems as DRBD has trouble attaching the storage device. Which in this case is a disk array controlled by an LSI MegaRAID controller. The Megacli64 tools report that everything is fine with the controller and the (virtual) drives. Could this however be a sign of faulty hardware? I did replace the BBU recently, but it has worked after this. Thanks! Dirk