[DRBD-user] DRBD 8.2.4 resources on both nodes "StandAlone" and can not connect.

Bas van Schaik bas at tuxes.nl
Mon Jan 28 13:08:48 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Francis,

Francis SOUYRI wrote:
> I recently migrated our test cluster to a Vmware env, both nodes are
> virtual guest with a virtual lan for the drbd.
>
> After some days running without problem, the resource on the both node
> become "StandAlone" ( I suppose the problem is due to a very high load
> of the Vmware host), but now I can not "connect" the resources.
>
> When I try to " connect" the drbd0 resources I have on one node these
> messages in the "/var/log/messages",
>
> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( StandAlone -> Unconnected )
> Jan 28 14:00:55 noeud1 kernel: drbd0: receiver (re)started
> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( Unconnected -> WFConnection )
> Jan 28 14:00:56 noeud1 kernel: drbd0: Handshake successful: Agreed
> network protocol version 88
> Jan 28 14:00:56 noeud1 kernel: drbd0: Peer authenticated using 64
> bytes of 'sha512' HMAC
> Jan 28 14:00:56 noeud1 kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> Jan 28 14:00:56 noeud1 kernel: drbd0: data-integrity-alg: <not-used>
> Jan 28 14:00:57 noeud1 kernel: drbd0: Split-Brain detected, dropping
> connection!
SPLIT BRAIN!

> Jan 28 14:00:57 noeud1 kernel: drbd0: self
> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
> Jan 28 14:00:57 noeud1 kernel: drbd0: peer
> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( WFReportParams ->
> Disconnecting )
> Jan 28 14:00:57 noeud1 kernel: drbd0: helper command: /sbin/drbdadm
> split-brain
> Jan 28 14:00:57 noeud1 kernel: drbd0: meta connection shut down by peer.
> Jan 28 14:00:57 noeud1 kernel: drbd0: asender terminated
> Jan 28 14:00:57 noeud1 kernel: drbd0: error receiving ReportState, l: 4!
> Jan 28 14:00:57 noeud1 kernel: drbd0: tl_clear()
> Jan 28 14:00:57 noeud1 kernel: drbd0: Connection closed
> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( Disconnecting -> StandAlone )
> Jan 28 14:00:57 noeud1 kernel: drbd0: receiver terminated
>
>
> On the other node I have these messages in the "/var/log/messages".
>
> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( StandAlone -> Unconnected )
> Jan 28 14:00:40 noeud2 kernel: drbd0: receiver (re)started
> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( Unconnected -> WFConnection )
> Jan 28 14:00:53 noeud2 kernel: drbd0: Handshake successful: Agreed
> network protocol version 88
> Jan 28 14:00:53 noeud2 kernel: drbd0: Peer authenticated using 64
> bytes of 'sha512' HMAC
> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> Jan 28 14:00:54 noeud2 kernel: drbd0: data-integrity-alg: <not-used>
> Jan 28 14:00:54 noeud2 kernel: drbd0: Split-Brain detected, dropping
> connection!
SPLIT BRAIN!

> Jan 28 14:00:54 noeud2 kernel: drbd0: self
> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
> Jan 28 14:00:54 noeud2 kernel: drbd0: peer
> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFReportParams ->
> Disconnecting )
> Jan 28 14:00:54 noeud2 kernel: drbd0: helper command: /sbin/drbdadm
> split-brain
> Jan 28 14:00:54 noeud2 kernel: drbd0: error receiving ReportState, l: 4!
> Jan 28 14:00:54 noeud2 kernel: drbd0: asender terminated
> Jan 28 14:00:54 noeud2 kernel: drbd0: tl_clear()
> Jan 28 14:00:54 noeud2 kernel: drbd0: Connection closed
> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( Disconnecting -> StandAlone )
> Jan 28 14:00:54 noeud2 kernel: drbd0: receiver terminated
>
> After a DRBD stop on each node I have these DRBD status.
>
> (... irrelevant information ...)

You suffer from schizophrenic DRBD, better known as "split brain".
Basically it means that both nodes became primary on your test resource
and started writing to it. The result is an inconsistent resource, which
is then called "split brain". Please read the documentation carefully,
you will have to discard the data on one of your nodes and let drbd sync
that node from scratch.

This is the way to do it (as posted by Florian Haas on October 31 2007):
> drbdadm -- --discard-my-data connect <resource> (on node with "bad" data)
> drbdadm connect <resource> (on node with "good" data)

You might want to take a look at combining DRBD with Heartbeat to
prevent this kind of situation.

Good luck!

  -- Bas




More information about the drbd-user mailing list