[DRBD-user] split brain - why does it not work out?

Alireza Nematollahi alirezan at redback.com
Wed Oct 15 19:29:30 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Split brain mainly happens because your primary node has gone down and the secondary node has been brought up as new "primary"... When the two "primaries" are up at the same time, split brain happens.
To resolve split brain, I usually try the following:

1- Disconnect both resources on both nodes (i.e. drbdadm disconnect resource)
2- Mark both nodes as secondary
3- on the node you want as your primary do: drbdadm -- --overwrite-data-of-peer primary resource
OR
On the actual secondary do: drbadm -- --discard-my-data connect resource

This should "most of the time" prevent split brains....

Hope this helps

-----Original Message-----
From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Raoul Bhatia [IPAX]
Sent: Wednesday, October 15, 2008 9:39 AM
To: drbd-user
Subject: [DRBD-user] split brain - why does it not work out?

hi,

i have two nodes: wc01 and wc02

wc01:
> GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52
> ...
>  1: cs:StandAlone st:Primary/Unknown ds:UpToDate/Outdated   r---
>     ns:0 nr:0 dw:180 dr:2997 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:260

wc02:
> GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52
> ...
>  1: cs:StandAlone st:Secondary/Unknown ds:Consistent/DUnknown   r---
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:14 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:69632

i see that both are standalone, wc01 is primary (and thus "productive")
and wc01 considers wc02 as outdated. so i would think that wc02 should
be overwritten.

if i try to connect both nodes, i get:

wc01:
> Oct 15 18:29:48 wc01 kernel: [522307.113592] drbd1: conn( StandAlone -> Unconnected )
> Oct 15 18:29:48 wc01 kernel: [522307.113699] drbd1: Starting receiver thread (from drbd1_worker [12283])
> Oct 15 18:29:48 wc01 kernel: [522307.113767] drbd1: receiver (re)started
> Oct 15 18:29:48 wc01 kernel: [522307.113800] drbd1: conn( Unconnected -> WFConnection )
> Oct 15 18:29:48 wc01 kernel: [522307.211176] drbd1: Handshake successful: Agreed network protocol version 88
> Oct 15 18:29:48 wc01 kernel: [522307.211229] drbd1: conn( WFConnection -> WFReportParams )
> Oct 15 18:29:48 wc01 kernel: [522307.211295] drbd1: Starting asender thread (from drbd1_receiver [20478])
> Oct 15 18:29:48 wc01 kernel: [522307.211443] drbd1: data-integrity-alg: <not-used>
> Oct 15 18:29:48 wc01 kernel: [522307.211481] drbd1: IO Suspended, no more requests in flight
> Oct 15 18:29:48 wc01 kernel: [522307.211484] drbd1: Resumed IO
> Oct 15 18:29:48 wc01 kernel: [522307.211664] drbd1: Split-Brain detected, dropping connection!
> Oct 15 18:29:48 wc01 kernel: [522307.211712] drbd1: self 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc01 kernel: [522307.211792] drbd1: peer FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc01 kernel: [522307.211856] drbd1: helper command: /sbin/drbdadm split-brain minor-1
> Oct 15 18:29:48 wc01 kernel: [522307.213476] drbd1: meta connection shut down by peer.
> Oct 15 18:29:48 wc01 kernel: [522307.213519] drbd1: conn( WFReportParams -> NetworkFailure )
> Oct 15 18:29:48 wc01 kernel: [522307.213564] drbd1: asender terminated
> Oct 15 18:29:48 wc01 kernel: [522307.213598] drbd1: Terminating asender thread
> Oct 15 18:29:48 wc01 kernel: [522307.214621] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Oct 15 18:29:48 wc01 kernel: [522307.214704] drbd1: conn( NetworkFailure -> Disconnecting )
> Oct 15 18:29:48 wc01 kernel: [522307.214745] drbd1: error receiving ReportState, l: 4!
> Oct 15 18:29:48 wc01 kernel: [522307.214821] drbd1: tl_clear()
> Oct 15 18:29:48 wc01 kernel: [522307.214861] drbd1: Connection closed
> Oct 15 18:29:48 wc01 kernel: [522307.214922] drbd1: conn( Disconnecting -> StandAlone )
> Oct 15 18:29:48 wc01 kernel: [522307.214972] drbd1: receiver terminated
> Oct 15 18:29:48 wc01 kernel: [522307.215005] drbd1: Terminating receiver thread


wc02
> Oct 15 18:29:15 wc02 kernel: [32186.455168] drbd1: conn( StandAlone -> Unconnected )
> Oct 15 18:29:15 wc02 kernel: [32186.455210] drbd1: Starting receiver thread (from drbd1_worker [18480])
> Oct 15 18:29:15 wc02 kernel: [32186.455266] drbd1: receiver (re)started
> Oct 15 18:29:15 wc02 kernel: [32186.455293] drbd1: conn( Unconnected -> WFConnection )
> Oct 15 18:29:48 wc02 kernel: [32218.824012] drbd1: Handshake successful: Agreed network protocol version 88
> Oct 15 18:29:48 wc02 kernel: [32218.824054] drbd1: conn( WFConnection -> WFReportParams )
> Oct 15 18:29:48 wc02 kernel: [32218.824108] drbd1: Starting asender thread (from drbd1_receiver [18546])
> Oct 15 18:29:48 wc02 kernel: [32218.824559] drbd1: data-integrity-alg: <not-used>
> Oct 15 18:29:48 wc02 kernel: [32218.824598] drbd1: IO Suspended, no more requests in flight
> Oct 15 18:29:48 wc02 kernel: [32218.824601] drbd1: Resumed IO
> Oct 15 18:29:48 wc02 kernel: [32218.824611] drbd1: Split-Brain detected, dropping connection!
> Oct 15 18:29:48 wc02 kernel: [32218.824639] drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc02 kernel: [32218.824696] drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc02 kernel: [32218.824745] drbd1: helper command: /sbin/drbdadm split-brain minor-1
> Oct 15 18:29:48 wc02 kernel: [32218.826115] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Oct 15 18:29:48 wc02 kernel: [32218.826167] drbd1: conn( WFReportParams -> Disconnecting )
> Oct 15 18:29:48 wc02 kernel: [32218.826198] drbd1: error receiving ReportState, l: 4!
> Oct 15 18:29:48 wc02 kernel: [32218.826229] drbd1: asender terminated
> Oct 15 18:29:48 wc02 kernel: [32218.826255] drbd1: Terminating asender thread
> Oct 15 18:29:48 wc02 kernel: [32218.826299] drbd1: tl_clear()
> Oct 15 18:29:48 wc02 kernel: [32218.826324] drbd1: Connection closed
> Oct 15 18:29:48 wc02 kernel: [32218.826358] drbd1: conn( Disconnecting -> StandAlone )
> Oct 15 18:29:48 wc02 kernel: [32218.826390] drbd1: receiver terminated
> Oct 15 18:29:48 wc02 kernel: [32218.826415] drbd1: Terminating receiver thread

ok - i "guess" that the problem lies within the two ids:
> drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32

so "self" (wc02) consideres the setup a problem, as FD074E7CD0616BDE
(self) > 8F1A5669ED08075D (peer).

this is the result of a couple of hard reboots:
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:32 - 18:33  (09:00)
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:30 - 18:33  (09:02)
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:26 - 18:33  (09:06)

and running heartbeat/pacemaker using the drbd ocf ra on top of that.

what would be the correct way to

a) anaylse what lead to this split-brain situation and
b) avoid it in the future

dpkg-version: 8.2.7~rc1-0 and, given that i have not messed up:
GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837

cheers,
raoul

ps. i know, that the drbd ocf agent is not considered stable, but i am
trying to improve that now ;)
--
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
Barawitzkagasse 10/2/2/11           email.            office at ipax.at
1190 Wien                           tel.               +43 1 3670030
FN 277995t HG Wien                  fax.            +43 1 3670030 15
____________________________________________________________________
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list