[DRBD-user] 3-Node DRBD with 2 standalone

Pezzani, Rocco Rocco.Pezzani at wuerth-phoenix.com
Tue Jul 16 12:27:08 CEST 2019


Hi all,

I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered all resources except 1.
For this resource, connections Node3-Node1 and Node3-Node2 are fine, but the connection Node1-Node2 is not working, as both sides see the other one as Standalone.

***Node 3
[root at pbzne4demo-n3 ~]# drbdadm status influxdb
influxdb role:Primary
  disk:UpToDate
  pbzne4demo-n1.wp.lan role:Secondary
    peer-disk:UpToDate
  pbzne4demo-n2.wp.lan role:Secondary
    peer-disk:UpToDate
***Node 2
[root at pbzne4demo-n2 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n1.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate
***Node1
[root at pbzne4demo-n1 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n2.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate

I tried disconnecting and reconnecting the resource on every node, but the standalone always remain on both the same nodes.
What I tried:
1. Disconnect from all nodes, connect on the primary node, connect --discard-my-data on both secondary nodes.
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [7948])
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Connection closed
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating receiver thread
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Preparing remote state change 271906619
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Committing remote state change 271906619 (primary_nodes=8)
***Node 1
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30208])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Handshake to peer 3 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30210])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Connection closed
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating receiver thread

2. Tried using drbdadm adjust on both the secondary nodes
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting receiver thread (from drbd_w_influxdb [6563])
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [8026])
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Connection closed
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: Terminating receiver thread
***Node 1
Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( StandAlone -> Unconnected )
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Starting ack_recv thread (from drbd_r_influxdb [30273])
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Connection closed
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: Terminating receiver thread

3. Disconnect from all nodes, invalidate on both secondary nodes, connect primary node then connect on both secondary nodes
Standalone remains.

I think next steps might be working with metadata, but since I am a novice, I'm asking for suggestion. Please, can you help me in resolving this issue?
This is not a critical system, I can rebuild it, but I'd like to come up with a procedure and a better understanding of how to handle this kind of cases, because I'm sure I will encounter it again.


Best regards,
Rocco Pezzani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190716/749d735a/attachment-0001.htm>


More information about the drbd-user mailing list