Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Martin, Bas, thank you for your help. Like Bas suggested I do a drbdadm -- --discard-my-data connect all (on node with "bad" data) drbdadm connect all (on node with "good" data) And now the resource are connected. I used heartbeat, but after a very, very high load on the VMware host (I have other guests with the cluster) and a "rpm -Fvh *" on each nodes, one cluster node failed, go standby, the second take the package with lot of difficulties... finally I have the split brain. On the "failed" node the drbd resource are "standalone" with the status "Secondary/Unknown", on the "active" node the drbd resource are "standalone" with the status "Primary/Unknown"... I suppose I have a too big server load and a too small heartbeat "deadtime" config, and the chance that it is a test Cluster... Best regards. Francis Bas van Schaik wrote: > Hi Francis, > > Francis SOUYRI wrote: > >> I recently migrated our test cluster to a Vmware env, both nodes are >> virtual guest with a virtual lan for the drbd. >> >> After some days running without problem, the resource on the both node >> become "StandAlone" ( I suppose the problem is due to a very high load >> of the Vmware host), but now I can not "connect" the resources. >> >> When I try to " connect" the drbd0 resources I have on one node these >> messages in the "/var/log/messages", >> >> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( StandAlone -> Unconnected ) >> Jan 28 14:00:55 noeud1 kernel: drbd0: receiver (re)started >> Jan 28 14:00:55 noeud1 kernel: drbd0: conn( Unconnected -> WFConnection ) >> Jan 28 14:00:56 noeud1 kernel: drbd0: Handshake successful: Agreed >> network protocol version 88 >> Jan 28 14:00:56 noeud1 kernel: drbd0: Peer authenticated using 64 >> bytes of 'sha512' HMAC >> Jan 28 14:00:56 noeud1 kernel: drbd0: conn( WFConnection -> >> WFReportParams ) >> Jan 28 14:00:56 noeud1 kernel: drbd0: data-integrity-alg: <not-used> >> Jan 28 14:00:57 noeud1 kernel: drbd0: Split-Brain detected, dropping >> connection! >> > SPLIT BRAIN! > > >> Jan 28 14:00:57 noeud1 kernel: drbd0: self >> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1 >> Jan 28 14:00:57 noeud1 kernel: drbd0: peer >> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1 >> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( WFReportParams -> >> Disconnecting ) >> Jan 28 14:00:57 noeud1 kernel: drbd0: helper command: /sbin/drbdadm >> split-brain >> Jan 28 14:00:57 noeud1 kernel: drbd0: meta connection shut down by peer. >> Jan 28 14:00:57 noeud1 kernel: drbd0: asender terminated >> Jan 28 14:00:57 noeud1 kernel: drbd0: error receiving ReportState, l: 4! >> Jan 28 14:00:57 noeud1 kernel: drbd0: tl_clear() >> Jan 28 14:00:57 noeud1 kernel: drbd0: Connection closed >> Jan 28 14:00:57 noeud1 kernel: drbd0: conn( Disconnecting -> StandAlone ) >> Jan 28 14:00:57 noeud1 kernel: drbd0: receiver terminated >> >> >> On the other node I have these messages in the "/var/log/messages". >> >> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( StandAlone -> Unconnected ) >> Jan 28 14:00:40 noeud2 kernel: drbd0: receiver (re)started >> Jan 28 14:00:40 noeud2 kernel: drbd0: conn( Unconnected -> WFConnection ) >> Jan 28 14:00:53 noeud2 kernel: drbd0: Handshake successful: Agreed >> network protocol version 88 >> Jan 28 14:00:53 noeud2 kernel: drbd0: Peer authenticated using 64 >> bytes of 'sha512' HMAC >> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFConnection -> >> WFReportParams ) >> Jan 28 14:00:54 noeud2 kernel: drbd0: data-integrity-alg: <not-used> >> Jan 28 14:00:54 noeud2 kernel: drbd0: Split-Brain detected, dropping >> connection! >> > SPLIT BRAIN! > > >> Jan 28 14:00:54 noeud2 kernel: drbd0: self >> 3850AFF7C480B265:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1 >> Jan 28 14:00:54 noeud2 kernel: drbd0: peer >> B3D5FA91510F45DE:D289EE2D946147A1:177216482D86F5DE:332BCB37708E13F1 >> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( WFReportParams -> >> Disconnecting ) >> Jan 28 14:00:54 noeud2 kernel: drbd0: helper command: /sbin/drbdadm >> split-brain >> Jan 28 14:00:54 noeud2 kernel: drbd0: error receiving ReportState, l: 4! >> Jan 28 14:00:54 noeud2 kernel: drbd0: asender terminated >> Jan 28 14:00:54 noeud2 kernel: drbd0: tl_clear() >> Jan 28 14:00:54 noeud2 kernel: drbd0: Connection closed >> Jan 28 14:00:54 noeud2 kernel: drbd0: conn( Disconnecting -> StandAlone ) >> Jan 28 14:00:54 noeud2 kernel: drbd0: receiver terminated >> >> After a DRBD stop on each node I have these DRBD status. >> >> (... irrelevant information ...) >> > > You suffer from schizophrenic DRBD, better known as "split brain". > Basically it means that both nodes became primary on your test resource > and started writing to it. The result is an inconsistent resource, which > is then called "split brain". Please read the documentation carefully, > you will have to discard the data on one of your nodes and let drbd sync > that node from scratch. > > This is the way to do it (as posted by Florian Haas on October 31 2007): > >> drbdadm -- --discard-my-data connect <resource> (on node with "bad" data) >> drbdadm connect <resource> (on node with "good" data) >> > > You might want to take a look at combining DRBD with Heartbeat to > prevent this kind of situation. > > Good luck! > > -- Bas > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080128/c52356ef/attachment.htm>