[DRBD-user] DRBD 8.2.4 resources on both nodes "StandAlone" and can not connect.

Mon Jan 28 14:26:02 CET 2008

Hi,

Martin, Bas, thank you for your help.

Like Bas suggested I do a

drbdadm -- --discard-my-data connect all (on node with "bad" data)
drbdadm connect all (on node with "good" data)

And now the resource are connected.

I used heartbeat, but after a very, very high load on the VMware host (I 
have other guests with the cluster) and a "rpm -Fvh *" on each nodes, 
one cluster node failed, go standby, the second take the package with 
lot of difficulties... finally I have the split brain. On the "failed" 
node the drbd resource are "standalone" with the status 
"Secondary/Unknown", on the "active" node the drbd resource are 
"standalone" with the status "Primary/Unknown"...
I suppose I have a too big server load and a too small heartbeat 
"deadtime" config, and the chance that it is a test Cluster...

Best regards.

Francis

Bas van Schaik wrote:
> Hi Francis,
>
> Francis SOUYRI wrote:
>   
>> I recently migrated our test cluster to a Vmware env, both nodes are
>> virtual guest with a virtual lan for the drbd.
>>
>> After some days running without problem, the resource on the both node
>> become "StandAlone" ( I suppose the problem is due to a very high load
>> of the Vmware host), but now I can not "connect" the resources.
>>
>> When I try to " connect" the drbd0 resources I have on one node these
>> messages in the "/var/log/messages",
>>
>> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( StandAlone -> Unconnected )
>> Jan 28 14:00:55 noeud1 kernel: drbd0: receiver (re)started
>> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( Unconnected -> WFConnection )
>> Jan 28 14:00:56 noeud1 kernel: drbd0: Handshake successful: Agreed
>> network protocol version 88
>> Jan 28 14:00:56 noeud1 kernel: drbd0: Peer authenticated using 64
>> bytes of 'sha512' HMAC
>> Jan 28 14:00:56 noeud1 kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Jan 28 14:00:56 noeud1 kernel: drbd0: data-integrity-alg: <not-used>
>> Jan 28 14:00:57 noeud1 kernel: drbd0: Split-Brain detected, dropping
>> connection!
>>     
> SPLIT BRAIN!
>
>   
>> Jan 28 14:00:57 noeud1 kernel: drbd0: self
>> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
>> Jan 28 14:00:57 noeud1 kernel: drbd0: peer
>> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
>> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Jan 28 14:00:57 noeud1 kernel: drbd0: helper command: /sbin/drbdadm
>> split-brain
>> Jan 28 14:00:57 noeud1 kernel: drbd0: meta connection shut down by peer.
>> Jan 28 14:00:57 noeud1 kernel: drbd0: asender terminated
>> Jan 28 14:00:57 noeud1 kernel: drbd0: error receiving ReportState, l: 4!
>> Jan 28 14:00:57 noeud1 kernel: drbd0: tl_clear()
>> Jan 28 14:00:57 noeud1 kernel: drbd0: Connection closed
>> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( Disconnecting -> StandAlone )
>> Jan 28 14:00:57 noeud1 kernel: drbd0: receiver terminated
>>
>>
>> On the other node I have these messages in the "/var/log/messages".
>>
>> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( StandAlone -> Unconnected )
>> Jan 28 14:00:40 noeud2 kernel: drbd0: receiver (re)started
>> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( Unconnected -> WFConnection )
>> Jan 28 14:00:53 noeud2 kernel: drbd0: Handshake successful: Agreed
>> network protocol version 88
>> Jan 28 14:00:53 noeud2 kernel: drbd0: Peer authenticated using 64
>> bytes of 'sha512' HMAC
>> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Jan 28 14:00:54 noeud2 kernel: drbd0: data-integrity-alg: <not-used>
>> Jan 28 14:00:54 noeud2 kernel: drbd0: Split-Brain detected, dropping
>> connection!
>>     
> SPLIT BRAIN!
>
>   
>> Jan 28 14:00:54 noeud2 kernel: drbd0: self
>> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
>> Jan 28 14:00:54 noeud2 kernel: drbd0: peer
>> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1
>> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Jan 28 14:00:54 noeud2 kernel: drbd0: helper command: /sbin/drbdadm
>> split-brain
>> Jan 28 14:00:54 noeud2 kernel: drbd0: error receiving ReportState, l: 4!
>> Jan 28 14:00:54 noeud2 kernel: drbd0: asender terminated
>> Jan 28 14:00:54 noeud2 kernel: drbd0: tl_clear()
>> Jan 28 14:00:54 noeud2 kernel: drbd0: Connection closed
>> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( Disconnecting -> StandAlone )
>> Jan 28 14:00:54 noeud2 kernel: drbd0: receiver terminated
>>
>> After a DRBD stop on each node I have these DRBD status.
>>
>> (... irrelevant information ...)
>>     
>
> You suffer from schizophrenic DRBD, better known as "split brain".
> Basically it means that both nodes became primary on your test resource
> and started writing to it. The result is an inconsistent resource, which
> is then called "split brain". Please read the documentation carefully,
> you will have to discard the data on one of your nodes and let drbd sync
> that node from scratch.
>
> This is the way to do it (as posted by Florian Haas on October 31 2007):
>   
>> drbdadm -- --discard-my-data connect <resource> (on node with "bad" data)
>> drbdadm connect <resource> (on node with "good" data)
>>     
>
> You might want to take a look at combining DRBD with Heartbeat to
> prevent this kind of situation.
>
> Good luck!
>
>   -- Bas
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080128/c52356ef/attachment.htm>