[DRBD-user] drbd issue?

Nicolas nicolas at shivaserv.fr
Thu Aug 30 16:12:18 CEST 2018


Thanks a lot for your answer.

I will check if I can update the drbd module on debian.

For the logs saying "Connected -> TearDown", it appears Ganeti did it.
It's when someone is moving a virtual machine from one node to another. It explains the tapX cards going up/down.

For the pingAck I need to check ;)

Thanks ;)

Nicolas

30 août 2018 10:46 "Lars Ellenberg" <lars.ellenberg at linbit.com (mailto:lars.ellenberg at linbit.com?to=%22Lars%20Ellenberg%22%20<lars.ellenberg at linbit.com>)> a écrit:
	On Wed, Aug 29, 2018 at 11:33:26AM +0000, Nicolas wrote:  Hello

Sorry for the misunderstanding of utils version.

I'm using the kernel : 4.9.88-1+deb9u1 (4.9.0-6-amd64 debian).
And the module version v8.4.7.
srcversion: 0904DF2CCF7283ACE07D07A 
Not that I think it has anything to do with this particular issue,
but I'd suggest you upgrade to 8.4.11 anyways.
 For example when a node says:

[Tue Aug 28 14:32:38 2018] drbd resource10: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
[Tue Aug 28 14:32:38 2018] drbd resource10: ack_receiver terminated
[Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_a_resource
[Tue Aug 28 14:32:38 2018] drbd resource10: Connection closed
[Tue Aug 28 14:32:38 2018] drbd resource10: conn( Disconnecting -> StandAlone )
[Tue Aug 28 14:32:38 2018] drbd resource10: receiver terminated
[Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_r_resource
[Tue Aug 28 14:32:38 2018] block drbd10: disk( UpToDate -> Failed )
[Tue Aug 28 14:32:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
[Tue Aug 28 14:32:38 2018] block drbd10: disk( Failed -> Diskless )
[Tue Aug 28 14:32:38 2018] drbd resource10: Terminating drbd_w_resource
[Tue Aug 28 14:32:40 2018] drbd resource10: Starting worker thread (from drbdsetup-84 [10222]) 
Okay. So this is "someone or something" doing a "drbdadm down ; drbdadm up"
 The second says:

[Tue Aug 28 14:35:33 2018] br0: port 8(tap6) entered disabled state
[Tue Aug 28 14:35:33 2018] device tap6 left promiscuous mode 
Uhm, time stamps do not match the excerpt above.
 [Tue Aug 28 14:35:33 2018] br0: port 8(tap6) entered disabled state
[Tue Aug 28 14:35:37 2018] drbd resource10: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
[Tue Aug 28 14:35:37 2018] drbd resource10: ack_receiver terminated
[Tue Aug 28 14:35:37 2018] drbd resource10: Terminating drbd_a_resource
[Tue Aug 28 14:35:37 2018] block drbd10: new current UUID 629F1036CD6CA2AF:0748EE11C429D3B5:FDAEFCD2E8D9890B:FDADFCD2E8D9890B
[Tue Aug 28 14:35:37 2018] drbd resource10: Connection closed
[Tue Aug 28 14:35:37 2018] drbd resource10: conn( TearDown -> Unconnected )
[Tue Aug 28 14:35:37 2018] drbd resource10: receiver terminated
[Tue Aug 28 14:35:37 2018] drbd resource10: Restarting receiver thread
[Tue Aug 28 14:35:37 2018] drbd resource10: receiver (re)started
[Tue Aug 28 14:35:37 2018] drbd resource10: conn( Unconnected -> WFConnection ) 
This is "peer node disconnected for some reason".
 [Tue Aug 28 14:35:38 2018] block drbd10: role( Primary -> Secondary )
[Tue Aug 28 14:35:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
[Tue Aug 28 14:35:38 2018] drbd resource10: conn( WFConnection -> Disconnecting )
[Tue Aug 28 14:35:38 2018] drbd resource10: Discarding network configuration.
[Tue Aug 28 14:35:38 2018] drbd resource10: Connection closed
[Tue Aug 28 14:35:38 2018] drbd resource10: conn( Disconnecting -> StandAlone )
[Tue Aug 28 14:35:38 2018] drbd resource10: receiver terminated
[Tue Aug 28 14:35:38 2018] drbd resource10: Terminating drbd_r_resource
[Tue Aug 28 14:35:38 2018] block drbd10: disk( UpToDate -> Failed )
[Tue Aug 28 14:35:38 2018] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
[Tue Aug 28 14:35:38 2018] block drbd10: disk( Failed -> Diskless )
[Tue Aug 28 14:35:38 2018] drbd resource10: Terminating drbd_w_resource 
And again, this is a "drbdadm down ; drbdadm up"
 And it seems for this example the second node was the origin of this.
This night I got another error, saying network failure, but I'm sure there was no network issue:

First node:

[Wed Aug 29 01:39:48 2018] drbd resource0: meta connection shut down by peer.
[Wed Aug 29 01:39:48 2018] drbd resource0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
...

peer node shut down the connection,
and as a result this node goes through a state called NetworkFailure,
then all the motions,
then reconnects,
and syncs up.
 Second node:

[Wed Aug 29 01:42:48 2018] drbd resource0: PingAck did not arrive in time. 
Again, time stamps do not match up.
But there is your reason for this incident: "PingAck did not arrive in time".

Find out why, or simply increase the ping ack timeout.
 -------- Message transféré -------
De: "Lars Ellenberg" <lars.ellenberg at linbit.com (mailto:lars.ellenberg at linbit.com) (mailto:lars.ellenberg at linbit.com?to=%22Lars%20Ellenberg%22%20<lars.ellenberg at linbit.com (mailto:lars.ellenberg at linbit.com)>)>
À: drbd-user at lists.linbit.com (mailto:drbd-user at lists.linbit.com) (mailto:drbd-user at lists.linbit.com (mailto:drbd-user at lists.linbit.com))
Envoyé: 29 août 2018 12:09
Objet: Re: [DRBD-user] drbd issue?

On Tue, Aug 28, 2018 at 02:43:47PM +0000, Nicolas wrote: Hi

I'm using some servers on debian with ganeti and drbd.

Since I've upgraded them to debian 9, and drbd 8.9.10-2 (from debian repo).
"drbd 8.9.10" is the *utils* version
(drbdadm, drbdsetup, drbdmeta, various scripts ...)

drbd utils version is meanwhile at 9.5.0, btw. And no, that has not
much to do with what DRBD kernel module driver version you are using,
since we ship the "unified utils" for both "drbd 8" and "drbd 9",
which started years ago already, the utils version is decoupled from
the module versions.

What kernel version,
and what DRBD module version?

Maybe you want to make sure you use the latest 8.4 version (8.4.11
currently), and not whatever "shipts with the debian kernel"?
I got a lot of issue with my drbd resources, I got randomly on my dmesg some resources disconnected:

today for example:

[Tue Aug 28 14:32:38 2018] drbd resource10: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Well, what does the other node say?
Hit some timeouts?
Some strangeness with the new NIC drivers?
A bug in the "shipped with the debian kernel" DRBD version? 
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com (mailto:drbd-user at lists.linbit.com)
http://lists.linbit.com/mailman/listinfo/drbd-user (http://lists.linbit.com/mailman/listinfo/drbd-user)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20180830/5469bf2d/attachment.htm>


More information about the drbd-user mailing list