[DRBD-user] drbd issue?

Lars Ellenberg lars.ellenberg at linbit.com
Thu Aug 30 10:46:33 CEST 2018


On Wed, Aug 29, 2018 at 11:33:26AM +0000, Nicolas wrote:
> Hello
> 
> Sorry for the misunderstanding of utils version.
> 
> I'm using the kernel : 4.9.88-1+deb9u1 (4.9.0-6-amd64 debian).
> And the module version v8.4.7.
> srcversion: 0904DF2CCF7283ACE07D07A

Not that I think it has anything to do with this particular issue,
but I'd suggest you upgrade to 8.4.11 anyways.

> For example when a node says:
> 
> [Tue Aug 28 14:32:38 2018] drbd resource10: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
> [Tue Aug 28 14:32:38 2018] drbd resource10: ack_receiver terminated
> [Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_a_resource
> [Tue Aug 28 14:32:38 2018] drbd resource10: Connection closed
> [Tue Aug 28 14:32:38 2018] drbd resource10: conn( Disconnecting -> StandAlone ) 
> [Tue Aug 28 14:32:38 2018] drbd resource10: receiver terminated
> [Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_r_resource
> [Tue Aug 28 14:32:38 2018] block drbd10: disk( UpToDate -> Failed ) 
> [Tue Aug 28 14:32:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> [Tue Aug 28 14:32:38 2018] block drbd10: disk( Failed -> Diskless ) 
> [Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_w_resource
> [Tue Aug 28 14:32:40 2018] drbd resource10: Starting worker thread (from drbdsetup-84 [10222])

Okay. So this is "someone or something" doing a "drbdadm down ; drbdadm up"

> The second says:
> 
> [Tue Aug 28 14:35:33 2018] br0: port 8(tap6) entered disabled state
> [Tue Aug 28 14:35:33 2018] device tap6 left promiscuous mode

Uhm, time stamps do not match the excerpt above.

> [Tue Aug 28 14:35:33 2018] br0: port 8(tap6) entered disabled state
> [Tue Aug 28 14:35:37 2018] drbd resource10: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) 
> [Tue Aug 28 14:35:37 2018] drbd resource10: ack_receiver terminated
> [Tue Aug 28 14:35:37 2018] drbd resource10: Terminating drbd_a_resource
> [Tue Aug 28 14:35:37 2018] block drbd10: new current UUID 629F1036CD6CA2AF:0748EE11C429D3B5:FDAEFCD2E8D9890B:FDADFCD2E8D9890B
> [Tue Aug 28 14:35:37 2018] drbd resource10: Connection closed
> [Tue Aug 28 14:35:37 2018] drbd resource10: conn( TearDown -> Unconnected ) 
> [Tue Aug 28 14:35:37 2018] drbd resource10: receiver terminated
> [Tue Aug 28 14:35:37 2018] drbd resource10: Restarting receiver thread
> [Tue Aug 28 14:35:37 2018] drbd resource10: receiver (re)started
> [Tue Aug 28 14:35:37 2018] drbd resource10: conn( Unconnected -> WFConnection ) 

This is "peer node disconnected for some reason".

> [Tue Aug 28 14:35:38 2018] block drbd10: role( Primary -> Secondary ) 
> [Tue Aug 28 14:35:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> [Tue Aug 28 14:35:38 2018] drbd resource10: conn( WFConnection -> Disconnecting ) 
> [Tue Aug 28 14:35:38 2018] drbd resource10: Discarding network configuration.
> [Tue Aug 28 14:35:38 2018] drbd resource10: Connection closed
> [Tue Aug 28 14:35:38 2018] drbd resource10: conn( Disconnecting -> StandAlone ) 
> [Tue Aug 28 14:35:38 2018] drbd resource10: receiver terminated
> [Tue Aug 28 14:35:38 2018] drbd resource10: Terminating drbd_r_resource
> [Tue Aug 28 14:35:38 2018] block drbd10: disk( UpToDate -> Failed ) 
> [Tue Aug 28 14:35:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> [Tue Aug 28 14:35:38 2018] block drbd10: disk( Failed -> Diskless ) 
> [Tue Aug 28 14:35:38 2018] drbd resource10: Terminating drbd_w_resource

And again, this is a "drbdadm down ; drbdadm up"


> And it seems for this example the second node was the origin of this. 
> This night I got another error, saying network failure, but I'm sure there was no network issue:
> 
> First node: 
> 
> [Wed Aug 29 01:39:48 2018] drbd resource0: meta connection shut down by peer.
> [Wed Aug 29 01:39:48 2018] drbd resource0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
...

peer node shut down the connection,
and as a result this node goes through a state called NetworkFailure,
then all the motions,
then reconnects,
and syncs up.

> Second node:
> 
> [Wed Aug 29 01:42:48 2018] drbd resource0: PingAck did not arrive in time.

Again, time stamps do not match up.
But there is your reason for this incident: "PingAck did not arrive in time".

Find out why, or simply increase the ping ack timeout.

> -------- Message transféré -------
> De: "Lars Ellenberg" <lars.ellenberg at linbit.com (mailto:lars.ellenberg at linbit.com?to=%22Lars%20Ellenberg%22%20<lars.ellenberg at linbit.com>)>
> À: drbd-user at lists.linbit.com (mailto:drbd-user at lists.linbit.com)
> Envoyé: 29 août 2018 12:09
> Objet: Re: [DRBD-user] drbd issue? 
> 
> 	On Tue, Aug 28, 2018 at 02:43:47PM +0000, Nicolas wrote:  Hi
> 
> I'm using some servers on debian with ganeti and drbd.
> 
> Since I've upgraded them to debian 9, and drbd 8.9.10-2 (from debian repo). 
> "drbd 8.9.10" is the *utils* version
> (drbdadm, drbdsetup, drbdmeta, various scripts ...)
> 
> drbd utils version is meanwhile at 9.5.0, btw. And no, that has not
> much to do with what DRBD kernel module driver version you are using,
> since we ship the "unified utils" for both "drbd 8" and "drbd 9",
> which started years ago already, the utils version is decoupled from
> the module versions.
> 
> What kernel version,
> and what DRBD module version?
> 
> Maybe you want to make sure you use the latest 8.4 version (8.4.11
> currently), and not whatever "shipts with the debian kernel"?
>  I got a lot of issue with my drbd resources, I got randomly on my dmesg some resources disconnected:
> 
> today for example:
> 
> [Tue Aug 28 14:32:38 2018] drbd resource10: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
> Well, what does the other node say?
> Hit some timeouts?
> Some strangeness with the new NIC drivers?
> A bug in the "shipped with the debian kernel" DRBD version?


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed


More information about the drbd-user mailing list