[DRBD-user] split brain - why does it not work out?

Raoul Bhatia [IPAX] r.bhatia at ipax.at
Wed Oct 15 18:38:52 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


hi,

i have two nodes: wc01 and wc02

wc01:
> GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52
> ...
>  1: cs:StandAlone st:Primary/Unknown ds:UpToDate/Outdated   r---
>     ns:0 nr:0 dw:180 dr:2997 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:260

wc02:
> GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52
> ...
>  1: cs:StandAlone st:Secondary/Unknown ds:Consistent/DUnknown   r---
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:14 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:69632

i see that both are standalone, wc01 is primary (and thus "productive")
and wc01 considers wc02 as outdated. so i would think that wc02 should
be overwritten.

if i try to connect both nodes, i get:

wc01:
> Oct 15 18:29:48 wc01 kernel: [522307.113592] drbd1: conn( StandAlone -> Unconnected ) 
> Oct 15 18:29:48 wc01 kernel: [522307.113699] drbd1: Starting receiver thread (from drbd1_worker [12283])
> Oct 15 18:29:48 wc01 kernel: [522307.113767] drbd1: receiver (re)started
> Oct 15 18:29:48 wc01 kernel: [522307.113800] drbd1: conn( Unconnected -> WFConnection ) 
> Oct 15 18:29:48 wc01 kernel: [522307.211176] drbd1: Handshake successful: Agreed network protocol version 88
> Oct 15 18:29:48 wc01 kernel: [522307.211229] drbd1: conn( WFConnection -> WFReportParams ) 
> Oct 15 18:29:48 wc01 kernel: [522307.211295] drbd1: Starting asender thread (from drbd1_receiver [20478])
> Oct 15 18:29:48 wc01 kernel: [522307.211443] drbd1: data-integrity-alg: <not-used>
> Oct 15 18:29:48 wc01 kernel: [522307.211481] drbd1: IO Suspended, no more requests in flight
> Oct 15 18:29:48 wc01 kernel: [522307.211484] drbd1: Resumed IO
> Oct 15 18:29:48 wc01 kernel: [522307.211664] drbd1: Split-Brain detected, dropping connection!
> Oct 15 18:29:48 wc01 kernel: [522307.211712] drbd1: self 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc01 kernel: [522307.211792] drbd1: peer FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc01 kernel: [522307.211856] drbd1: helper command: /sbin/drbdadm split-brain minor-1
> Oct 15 18:29:48 wc01 kernel: [522307.213476] drbd1: meta connection shut down by peer.
> Oct 15 18:29:48 wc01 kernel: [522307.213519] drbd1: conn( WFReportParams -> NetworkFailure ) 
> Oct 15 18:29:48 wc01 kernel: [522307.213564] drbd1: asender terminated
> Oct 15 18:29:48 wc01 kernel: [522307.213598] drbd1: Terminating asender thread
> Oct 15 18:29:48 wc01 kernel: [522307.214621] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Oct 15 18:29:48 wc01 kernel: [522307.214704] drbd1: conn( NetworkFailure -> Disconnecting ) 
> Oct 15 18:29:48 wc01 kernel: [522307.214745] drbd1: error receiving ReportState, l: 4!
> Oct 15 18:29:48 wc01 kernel: [522307.214821] drbd1: tl_clear()
> Oct 15 18:29:48 wc01 kernel: [522307.214861] drbd1: Connection closed
> Oct 15 18:29:48 wc01 kernel: [522307.214922] drbd1: conn( Disconnecting -> StandAlone ) 
> Oct 15 18:29:48 wc01 kernel: [522307.214972] drbd1: receiver terminated
> Oct 15 18:29:48 wc01 kernel: [522307.215005] drbd1: Terminating receiver thread


wc02
> Oct 15 18:29:15 wc02 kernel: [32186.455168] drbd1: conn( StandAlone -> Unconnected ) 
> Oct 15 18:29:15 wc02 kernel: [32186.455210] drbd1: Starting receiver thread (from drbd1_worker [18480])
> Oct 15 18:29:15 wc02 kernel: [32186.455266] drbd1: receiver (re)started
> Oct 15 18:29:15 wc02 kernel: [32186.455293] drbd1: conn( Unconnected -> WFConnection ) 
> Oct 15 18:29:48 wc02 kernel: [32218.824012] drbd1: Handshake successful: Agreed network protocol version 88
> Oct 15 18:29:48 wc02 kernel: [32218.824054] drbd1: conn( WFConnection -> WFReportParams ) 
> Oct 15 18:29:48 wc02 kernel: [32218.824108] drbd1: Starting asender thread (from drbd1_receiver [18546])
> Oct 15 18:29:48 wc02 kernel: [32218.824559] drbd1: data-integrity-alg: <not-used>
> Oct 15 18:29:48 wc02 kernel: [32218.824598] drbd1: IO Suspended, no more requests in flight
> Oct 15 18:29:48 wc02 kernel: [32218.824601] drbd1: Resumed IO
> Oct 15 18:29:48 wc02 kernel: [32218.824611] drbd1: Split-Brain detected, dropping connection!
> Oct 15 18:29:48 wc02 kernel: [32218.824639] drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc02 kernel: [32218.824696] drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32
> Oct 15 18:29:48 wc02 kernel: [32218.824745] drbd1: helper command: /sbin/drbdadm split-brain minor-1
> Oct 15 18:29:48 wc02 kernel: [32218.826115] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Oct 15 18:29:48 wc02 kernel: [32218.826167] drbd1: conn( WFReportParams -> Disconnecting ) 
> Oct 15 18:29:48 wc02 kernel: [32218.826198] drbd1: error receiving ReportState, l: 4!
> Oct 15 18:29:48 wc02 kernel: [32218.826229] drbd1: asender terminated
> Oct 15 18:29:48 wc02 kernel: [32218.826255] drbd1: Terminating asender thread
> Oct 15 18:29:48 wc02 kernel: [32218.826299] drbd1: tl_clear()
> Oct 15 18:29:48 wc02 kernel: [32218.826324] drbd1: Connection closed
> Oct 15 18:29:48 wc02 kernel: [32218.826358] drbd1: conn( Disconnecting -> StandAlone ) 
> Oct 15 18:29:48 wc02 kernel: [32218.826390] drbd1: receiver terminated
> Oct 15 18:29:48 wc02 kernel: [32218.826415] drbd1: Terminating receiver thread

ok - i "guess" that the problem lies within the two ids:
> drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32
> drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32

so "self" (wc02) consideres the setup a problem, as FD074E7CD0616BDE
(self) > 8F1A5669ED08075D (peer).

this is the result of a couple of hard reboots:
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:32 - 18:33  (09:00)    
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:30 - 18:33  (09:02)    
> reboot   system boot  2.6.27-rc5       Wed Oct 15 09:26 - 18:33  (09:06)    

and running heartbeat/pacemaker using the drbd ocf ra on top of that.

what would be the correct way to

a) anaylse what lead to this split-brain situation and
b) avoid it in the future

dpkg-version: 8.2.7~rc1-0 and, given that i have not messed up:
GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837

cheers,
raoul

ps. i know, that the drbd ocf agent is not considered stable, but i am
trying to improve that now ;)
-- 
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
Barawitzkagasse 10/2/2/11           email.            office at ipax.at
1190 Wien                           tel.               +43 1 3670030
FN 277995t HG Wien                  fax.            +43 1 3670030 15
____________________________________________________________________



More information about the drbd-user mailing list