[DRBD-user] Trouble with connecting / starting drbd.

Tue Jan 19 21:38:46 CET 2016

On 19-01-16 11:33, Lars Ellenberg wrote:
> On Sun, Jan 17, 2016 at 09:42:04AM +0100, Dirk Bonenkamp - ProActive Software wrote:
>> Hi All,
>>
>> I've run into some problems on my DRDB cluster this week. This cluster
>> has been running fine for over a year. All of a sudden the secondary failed:
>>
>> Jan 14 08:37:58 data2 kernel: [4895290.318176] drbd r0: meta connection
>> shut down by peer.
>> Jan 14 08:37:58 data2 kernel: [4895290.318361] drbd r0: peer( Primary ->
>> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Jan 14 08:37:58 data2 kernel: [4895290.318532] drbd r0: asender terminated
>> Jan 14 08:37:58 data2 kernel: [4895290.318534] drbd r0: Terminating
>> drbd_a_r0
>> Jan 14 08:38:07 data2 kernel: [4895298.502391] drbd r0: Connection closed
>> Jan 14 08:38:07 data2 kernel: [4895298.502405] drbd r0: conn(
>> NetworkFailure -> Unconnected )
>> Jan 14 08:38:07 data2 kernel: [4895298.502406] drbd r0: receiver terminated
>> Jan 14 08:38:07 data2 kernel: [4895298.502408] drbd r0: Restarting
>> receiver thread
>> Jan 14 08:38:07 data2 kernel: [4895298.502409] drbd r0: receiver (re)started
>> Jan 14 08:38:07 data2 kernel: [4895298.502415] drbd r0: conn(
>> Unconnected -> WFConnection )
>> Jan 14 08:38:07 data2 kernel: [4895299.002586] drbd r0: Handshake
>> successful: Agreed network protocol version 101
>> Jan 14 08:38:07 data2 kernel: [4895299.002592] drbd r0: Agreed to
>> support TRIM on protocol level
>> Jan 14 08:38:07 data2 kernel: [4895299.002813] drbd r0: Peer
>> authenticated using 20 bytes HMAC
>> Jan 14 08:38:07 data2 kernel: [4895299.002848] drbd r0: conn(
>> WFConnection -> WFReportParams )
>> Jan 14 08:38:07 data2 kernel: [4895299.002852] drbd r0: Starting asender
>> thread (from drbd_r_r0 [3400])
>>
>> It would reconnect, sync, and disconnect again. I stopped the node,
>> checked the hardware (all seems fine), rebooted and tried to start drbd
>> again:
>>
>> root at data2:/var/log# drbdadm connect r0
>> r0: Failure: (158) Unknown resource
> As you probably found out by now,
> You can only "connect" something that is already there.
> Try "up", or better yet, "adjust".
>
Actually, DRBD is controlled by pacemaker - I (mis)used the command line
tools to keep things a bit simpler - my mistake.

When I put the node on line (like I did dozens of times before with no
problem), this is the kernel output:

[45742.318104] drbd: module verification failed: signature and/or 
required key missing - tainting kernel
[45742.322235] drbd: initialized. Version: 8.4.7-1 (api:1/proto:86-101)
[45742.322238] drbd: GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49
build by root at data2, 2016-01-17 08:27:34
[45742.322239] drbd: registered as block device major 147
[45742.408595] drbd r0: Starting worker thread (from drbdsetup-84 [4350])
[45742.408897] block drbd0: disk( Diskless -> Attaching )
[45742.409114] drbd r0: Method to ensure write ordering: drain
[45742.409117] block drbd0: max BIO size = 327680
[45742.409121] block drbd0: drbd_bm_resize called with capacity ==
78127009272
[45742.738885] block drbd0: resync bitmap: bits=9765876159
words=152591815 pages=298031
[45742.738890] block drbd0: size = 36 TB (39063504636 KB)
[45747.413416] block drbd0: md_sync_timer expired! Worker calls
drbd_md_sync().

This is the command that 'hangs' (from ps fax):

drbdsetup-84 attach 0 /dev/sda1 /dev/sda1 internal --on-io-error=detach
--fencing=resource-only --c-plan-ahead=200 --c-max-rate=300M
--c-fill-target=100M --disk-barrier=no --disk-flushes=no --al-extents=3389

And this is how /proc/drbd looks:

root at data2:~# cat /proc/drbd
version: 8.4.7-1 (api:1/proto:86-101)
GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 build by root at data2,
2016-01-17 08:27:34
 0: cs:StandAlone ro:Secondary/Unknown ds:Attaching/DUnknown   r---d-
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:2 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

It seems as DRBD has trouble attaching the storage device. Which in this
case is a disk array controlled by an LSI MegaRAID controller. The
Megacli64 tools report that everything is fine with the controller and
the (virtual) drives.

Could this however be a sign of faulty hardware? I did replace the BBU
recently, but it has worked after this.

Thanks!

Dirk