[DRBD-user] 3-Node DRBD with 2 standalone

Gianni Milo gianni.milo22 at gmail.com
Wed Jul 17 09:21:14 CEST 2019


I would try disconnecting or bringing down the resource either on Node1 or
Node2. Then write some data on the Primary and finally bring up or connect
the resource. This should trigger a sync for the newly created data on this
resource/node.
Last option would be to either invalidate the data of the affected resource
on either Node1 or Node2 ,or re-create its metadata, but that will trigger
a full sync, which may not be desirable.
Once you manage to sort this out, consider implementing the quorum feature
in order to avoid split-brain situations in the future.

Gianni


On Wed, 17 Jul 2019 at 06:31, Pezzani, Rocco <
Rocco.Pezzani at wuerth-phoenix.com> wrote:

> Hi all,
>
>
>
> I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered
> all resources except 1.
>
> For this resource, connections Node3-Node1 and Node3-Node2 are fine, but
> the connection Node1-Node2 is not working, as both sides see the other one
> as Standalone.
>
>
>
> ***Node 3
>
> [root at pbzne4demo-n3 ~]# drbdadm status influxdb
>
> influxdb role:Primary
>
>   disk:UpToDate
>
>   pbzne4demo-n1.wp.lan role:Secondary
>
>     peer-disk:UpToDate
>
>   pbzne4demo-n2.wp.lan role:Secondary
>
>     peer-disk:UpToDate
>
> ***Node 2
>
> [root at pbzne4demo-n2 ~]# drbdadm status influxdb
>
> influxdb role:Secondary
>
>   disk:UpToDate
>
>   pbzne4demo-n1.wp.lan connection:StandAlone
>
>   pbzne4demo-n3.wp.lan role:Primary
>
>     peer-disk:UpToDate
>
> ***Node1
>
> [root at pbzne4demo-n1 ~]# drbdadm status influxdb
>
> influxdb role:Secondary
>
>   disk:UpToDate
>
>   pbzne4demo-n2.wp.lan connection:StandAlone
>
>   pbzne4demo-n3.wp.lan role:Primary
>
>     peer-disk:UpToDate
>
>
>
> I tried disconnecting and reconnecting the resource on every node, but the
> standalone always remain on both the same nodes.
>
> What I tried:
>
> 1. Disconnect from all nodes, connect on the primary node, connect
> --discard-my-data on both secondary nodes.
>
> Standalone remains.
>
> /var/log/messages reports this on secondary nodes:
>
> ***Node 2
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Handshake to peer 1 successful: Agreed network protocol version 114
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
> WRITE_ZEROES.
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Starting ack_recv thread (from drbd_r_influxdb [7948])
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> incompatible discard-my-data settings
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( Connecting -> Disconnecting )
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> error receiving P_PROTOCOL, e: -5 l: 1!
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> ack_receiver terminated
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Terminating ack_recv thread
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Connection closed
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( Disconnecting -> StandAlone )
>
> Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Terminating receiver thread
>
> Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Preparing remote state change 271906619
>
> Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Committing remote state change 271906619 (primary_nodes=8)
>
> ***Node 1
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( StandAlone -> Unconnected )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Starting receiver thread (from drbd_w_influxdb [6596])
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Unconnected -> Connecting )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> conn( StandAlone -> Unconnected )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Starting receiver thread (from drbd_w_influxdb [6596])
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> conn( Unconnected -> Connecting )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Handshake to peer 2 successful: Agreed network protocol version 114
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
> WRITE_ZEROES.
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Starting ack_recv thread (from drbd_r_influxdb [30208])
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> incompatible discard-my-data settings
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Connecting -> Disconnecting )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> error receiving P_PROTOCOL, e: -5 l: 1!
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Handshake to peer 3 successful: Agreed network protocol version 114
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
> WRITE_ZEROES.
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> ack_receiver terminated
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Terminating ack_recv thread
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
> Starting ack_recv thread (from drbd_r_influxdb [30210])
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Connection closed
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Disconnecting -> StandAlone )
>
> Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Terminating receiver thread
>
>
>
> 2. Tried using drbdadm adjust on both the secondary nodes
>
> Standalone remains.
>
> /var/log/messages reports this on secondary nodes:
>
> ***Node 2
>
> Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
>
> Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( StandAlone -> Unconnected )
>
> Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Starting receiver thread (from drbd_w_influxdb [6563])
>
> Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( Unconnected -> Connecting )
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Handshake to peer 1 successful: Agreed network protocol version 114
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
> WRITE_ZEROES.
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Starting ack_recv thread (from drbd_r_influxdb [8026])
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> incompatible discard-my-data settings
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( Connecting -> Disconnecting )
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> error receiving P_PROTOCOL, e: -5 l: 1!
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> ack_receiver terminated
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Terminating ack_recv thread
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Connection closed
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> conn( Disconnecting -> StandAlone )
>
> Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
> Terminating receiver thread
>
> ***Node 1
>
> Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
>
> Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( StandAlone -> Unconnected )
>
> Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Starting receiver thread (from drbd_w_influxdb [6596])
>
> Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Unconnected -> Connecting )
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Handshake to peer 2 successful: Agreed network protocol version 114
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
> WRITE_ZEROES.
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Starting ack_recv thread (from drbd_r_influxdb [30273])
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> incompatible discard-my-data settings
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Connecting -> Disconnecting )
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> error receiving P_PROTOCOL, e: -5 l: 1!
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> ack_receiver terminated
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Terminating ack_recv thread
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Connection closed
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> conn( Disconnecting -> StandAlone )
>
> Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
> Terminating receiver thread
>
>
>
> 3. Disconnect from all nodes, invalidate on both secondary nodes, connect
> primary node then connect on both secondary nodes
>
> Standalone remains.
>
>
>
> I think next steps might be working with metadata, but since I am a
> novice, I’m asking for suggestion. Please, can you help me in resolving
> this issue?
>
> This is not a critical system, I can rebuild it, but I’d like to come up
> with a procedure and a better understanding of how to handle this kind of
> cases, because I’m sure I will encounter it again.
>
>
>
>
>
> Best regards,
>
> *Rocco Pezzani*
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190717/675148a3/attachment-0001.htm>


More information about the drbd-user mailing list