[DRBD-user] Trouble with connecting / starting drbd.

Sat Jan 23 09:42:38 CET 2016

On 19-01-16 21:38, Dirk Bonenkamp - ProActive wrote:
> On 19-01-16 11:33, Lars Ellenberg wrote:
>> On Sun, Jan 17, 2016 at 09:42:04AM +0100, Dirk Bonenkamp - ProActive Software wrote:
>>> Hi All,
>>>
>>> I've run into some problems on my DRDB cluster this week. This cluster
>>> has been running fine for over a year. All of a sudden the secondary failed:
>>>
>>> Jan 14 08:37:58 data2 kernel: [4895290.318176] drbd r0: meta connection
>>> shut down by peer.
>>> Jan 14 08:37:58 data2 kernel: [4895290.318361] drbd r0: peer( Primary ->
>>> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>>> Jan 14 08:37:58 data2 kernel: [4895290.318532] drbd r0: asender terminated
>>> Jan 14 08:37:58 data2 kernel: [4895290.318534] drbd r0: Terminating
>>> drbd_a_r0
>>> Jan 14 08:38:07 data2 kernel: [4895298.502391] drbd r0: Connection closed
>>> Jan 14 08:38:07 data2 kernel: [4895298.502405] drbd r0: conn(
>>> NetworkFailure -> Unconnected )
>>> Jan 14 08:38:07 data2 kernel: [4895298.502406] drbd r0: receiver terminated
>>> Jan 14 08:38:07 data2 kernel: [4895298.502408] drbd r0: Restarting
>>> receiver thread
>>> Jan 14 08:38:07 data2 kernel: [4895298.502409] drbd r0: receiver (re)started
>>> Jan 14 08:38:07 data2 kernel: [4895298.502415] drbd r0: conn(
>>> Unconnected -> WFConnection )
>>> Jan 14 08:38:07 data2 kernel: [4895299.002586] drbd r0: Handshake
>>> successful: Agreed network protocol version 101
>>> Jan 14 08:38:07 data2 kernel: [4895299.002592] drbd r0: Agreed to
>>> support TRIM on protocol level
>>> Jan 14 08:38:07 data2 kernel: [4895299.002813] drbd r0: Peer
>>> authenticated using 20 bytes HMAC
>>> Jan 14 08:38:07 data2 kernel: [4895299.002848] drbd r0: conn(
>>> WFConnection -> WFReportParams )
>>> Jan 14 08:38:07 data2 kernel: [4895299.002852] drbd r0: Starting asender
>>> thread (from drbd_r_r0 [3400])
>>>
>>> It would reconnect, sync, and disconnect again. I stopped the node,
>>> checked the hardware (all seems fine), rebooted and tried to start drbd
>>> again:
>>>
>>> root at data2:/var/log# drbdadm connect r0
>>> r0: Failure: (158) Unknown resource
>> As you probably found out by now,
>> You can only "connect" something that is already there.
>> Try "up", or better yet, "adjust".
>>
> Actually, DRBD is controlled by pacemaker - I (mis)used the command line
> tools to keep things a bit simpler - my mistake.
>
> When I put the node on line (like I did dozens of times before with no
> problem), this is the kernel output:
>
> [45742.318104] drbd: module verification failed: signature and/or 
> required key missing - tainting kernel
> [45742.322235] drbd: initialized. Version: 8.4.7-1 (api:1/proto:86-101)
> [45742.322238] drbd: GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49
> build by root at data2, 2016-01-17 08:27:34
> [45742.322239] drbd: registered as block device major 147
> [45742.408595] drbd r0: Starting worker thread (from drbdsetup-84 [4350])
> [45742.408897] block drbd0: disk( Diskless -> Attaching )
> [45742.409114] drbd r0: Method to ensure write ordering: drain
> [45742.409117] block drbd0: max BIO size = 327680
> [45742.409121] block drbd0: drbd_bm_resize called with capacity ==
> 78127009272
> [45742.738885] block drbd0: resync bitmap: bits=9765876159
> words=152591815 pages=298031
> [45742.738890] block drbd0: size = 36 TB (39063504636 KB)
> [45747.413416] block drbd0: md_sync_timer expired! Worker calls
> drbd_md_sync().
>
> This is the command that 'hangs' (from ps fax):
>
> drbdsetup-84 attach 0 /dev/sda1 /dev/sda1 internal --on-io-error=detach
> --fencing=resource-only --c-plan-ahead=200 --c-max-rate=300M
> --c-fill-target=100M --disk-barrier=no --disk-flushes=no --al-extents=3389
>
> And this is how /proc/drbd looks:
>
> root at data2:~# cat /proc/drbd
> version: 8.4.7-1 (api:1/proto:86-101)
> GIT-hash: 3a6a769340ef93b1ba2792c6461250790795db49 build by root at data2,
> 2016-01-17 08:27:34
>  0: cs:StandAlone ro:Secondary/Unknown ds:Attaching/DUnknown   r---d-
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:2 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
>
> It seems as DRBD has trouble attaching the storage device. Which in this
> case is a disk array controlled by an LSI MegaRAID controller. The
> Megacli64 tools report that everything is fine with the controller and
> the (virtual) drives.
>
> Could this however be a sign of faulty hardware? I did replace the BBU
> recently, but it has worked after this.
>
For completeness: It turned out to be faulty hardware. I had a bad disk
in one of the raidcontrollers. I however did not get kicked out by the
controller (strange, we use these LSI MegaRaid controllers a lot and
they have been very reliable to us). Manually removing the drive
restored the responsiveness of drbd.

Thanks,

Dirk