[DRBD-user] 3-Node DRBD with 2 standalone

Pezzani, Rocco Rocco.Pezzani at wuerth-phoenix.com
Thu Jul 18 11:57:47 CEST 2019


I already tried disconnecting and reconnecting the resources, also used the invalidate command.
Nothing changed.

Before trying to mess with the metadata, I tried a restart of all the drbd services and it solved the problem. So I didn’t messed with metadata.

Journalctl showed no differences for service drbd.service between each node.
The only “unusual” thing I noticed, a restart on the first secondary node hung until the restart on the second secondary has been done. Here what happened:

1. [Node3] systemctl restart drbd.service; restart OK
2. [Node2] systemctl restart drbd.service; restart hung, but the service seems up and running
1. [Node1] systemctl restart drbd.service; restart OK. Restart on Node2 completed at the same time.

I’ll try to examine the messages log on every node to understand what happened, but I don’t think I’ll find something useful.


Meanwhile, Thank you all.

Best regards,
Rocco Pezzani


From: Gianni Milo <gianni.milo22 at gmail.com>
Sent: mercoledì 17 luglio 2019 09:21
To: Pezzani, Rocco <Rocco.Pezzani at wuerth-phoenix.com>
Cc: drbd-user at lists.linbit.com
Subject: Re: [DRBD-user] 3-Node DRBD with 2 standalone

I would try disconnecting or bringing down the resource either on Node1 or Node2. Then write some data on the Primary and finally bring up or connect the resource. This should trigger a sync for the newly created data on this resource/node.
Last option would be to either invalidate the data of the affected resource on either Node1 or Node2 ,or re-create its metadata, but that will trigger a full sync, which may not be desirable.
Once you manage to sort this out, consider implementing the quorum feature in order to avoid split-brain situations in the future.

Gianni


On Wed, 17 Jul 2019 at 06:31, Pezzani, Rocco <Rocco.Pezzani at wuerth-phoenix.com<mailto:Rocco.Pezzani at wuerth-phoenix.com>> wrote:
Hi all,

I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered all resources except 1.
For this resource, connections Node3-Node1 and Node3-Node2 are fine, but the connection Node1-Node2 is not working, as both sides see the other one as Standalone.

***Node 3
[root at pbzne4demo-n3 ~]# drbdadm status influxdb
influxdb role:Primary
  disk:UpToDate
  pbzne4demo-n1.wp.lan role:Secondary
    peer-disk:UpToDate
  pbzne4demo-n2.wp.lan role:Secondary
    peer-disk:UpToDate
***Node 2
[root at pbzne4demo-n2 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n1.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate
***Node1
[root at pbzne4demo-n1 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n2.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate

I tried disconnecting and reconnecting the resource on every node, but the standalone always remain on both the same nodes.
What I tried:
1. Disconnect from all nodes, connect on the primary node, connect --discard-my-data on both secondary nodes.
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [7948])
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Connection closed
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating receiver thread
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Preparing remote state change 271906619
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Committing remote state change 271906619 (primary_nodes=8)
***Node 1
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30208])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Handshake to peer 3 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30210])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Connection closed
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating receiver thread

2. Tried using drbdadm adjust on both the secondary nodes
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting receiver thread (from drbd_w_influxdb [6563])
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [8026])
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Connection closed
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating receiver thread
***Node 1
Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30273])
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Connection closed
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating receiver thread

3. Disconnect from all nodes, invalidate on both secondary nodes, connect primary node then connect on both secondary nodes
Standalone remains.

I think next steps might be working with metadata, but since I am a novice, I’m asking for suggestion. Please, can you help me in resolving this issue?
This is not a critical system, I can rebuild it, but I’d like to come up with a procedure and a better understanding of how to handle this kind of cases, because I’m sure I will encounter it again.


Best regards,
Rocco Pezzani
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user at lists.linbit.com<mailto:drbd-user at lists.linbit.com>
http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190718/eb37d3c0/attachment-0001.htm>


More information about the drbd-user mailing list