Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Split brain mainly happens because your primary node has gone down and the secondary node has been brought up as new "primary"... When the two "primaries" are up at the same time, split brain happens. To resolve split brain, I usually try the following: 1- Disconnect both resources on both nodes (i.e. drbdadm disconnect resource) 2- Mark both nodes as secondary 3- on the node you want as your primary do: drbdadm -- --overwrite-data-of-peer primary resource OR On the actual secondary do: drbadm -- --discard-my-data connect resource This should "most of the time" prevent split brains.... Hope this helps -----Original Message----- From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Raoul Bhatia [IPAX] Sent: Wednesday, October 15, 2008 9:39 AM To: drbd-user Subject: [DRBD-user] split brain - why does it not work out? hi, i have two nodes: wc01 and wc02 wc01: > GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52 > ... > 1: cs:StandAlone st:Primary/Unknown ds:UpToDate/Outdated r--- > ns:0 nr:0 dw:180 dr:2997 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:260 wc02: > GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52 > ... > 1: cs:StandAlone st:Secondary/Unknown ds:Consistent/DUnknown r--- > ns:0 nr:0 dw:0 dr:0 al:0 bm:14 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:69632 i see that both are standalone, wc01 is primary (and thus "productive") and wc01 considers wc02 as outdated. so i would think that wc02 should be overwritten. if i try to connect both nodes, i get: wc01: > Oct 15 18:29:48 wc01 kernel: [522307.113592] drbd1: conn( StandAlone -> Unconnected ) > Oct 15 18:29:48 wc01 kernel: [522307.113699] drbd1: Starting receiver thread (from drbd1_worker [12283]) > Oct 15 18:29:48 wc01 kernel: [522307.113767] drbd1: receiver (re)started > Oct 15 18:29:48 wc01 kernel: [522307.113800] drbd1: conn( Unconnected -> WFConnection ) > Oct 15 18:29:48 wc01 kernel: [522307.211176] drbd1: Handshake successful: Agreed network protocol version 88 > Oct 15 18:29:48 wc01 kernel: [522307.211229] drbd1: conn( WFConnection -> WFReportParams ) > Oct 15 18:29:48 wc01 kernel: [522307.211295] drbd1: Starting asender thread (from drbd1_receiver [20478]) > Oct 15 18:29:48 wc01 kernel: [522307.211443] drbd1: data-integrity-alg: <not-used> > Oct 15 18:29:48 wc01 kernel: [522307.211481] drbd1: IO Suspended, no more requests in flight > Oct 15 18:29:48 wc01 kernel: [522307.211484] drbd1: Resumed IO > Oct 15 18:29:48 wc01 kernel: [522307.211664] drbd1: Split-Brain detected, dropping connection! > Oct 15 18:29:48 wc01 kernel: [522307.211712] drbd1: self 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc01 kernel: [522307.211792] drbd1: peer FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc01 kernel: [522307.211856] drbd1: helper command: /sbin/drbdadm split-brain minor-1 > Oct 15 18:29:48 wc01 kernel: [522307.213476] drbd1: meta connection shut down by peer. > Oct 15 18:29:48 wc01 kernel: [522307.213519] drbd1: conn( WFReportParams -> NetworkFailure ) > Oct 15 18:29:48 wc01 kernel: [522307.213564] drbd1: asender terminated > Oct 15 18:29:48 wc01 kernel: [522307.213598] drbd1: Terminating asender thread > Oct 15 18:29:48 wc01 kernel: [522307.214621] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) > Oct 15 18:29:48 wc01 kernel: [522307.214704] drbd1: conn( NetworkFailure -> Disconnecting ) > Oct 15 18:29:48 wc01 kernel: [522307.214745] drbd1: error receiving ReportState, l: 4! > Oct 15 18:29:48 wc01 kernel: [522307.214821] drbd1: tl_clear() > Oct 15 18:29:48 wc01 kernel: [522307.214861] drbd1: Connection closed > Oct 15 18:29:48 wc01 kernel: [522307.214922] drbd1: conn( Disconnecting -> StandAlone ) > Oct 15 18:29:48 wc01 kernel: [522307.214972] drbd1: receiver terminated > Oct 15 18:29:48 wc01 kernel: [522307.215005] drbd1: Terminating receiver thread wc02 > Oct 15 18:29:15 wc02 kernel: [32186.455168] drbd1: conn( StandAlone -> Unconnected ) > Oct 15 18:29:15 wc02 kernel: [32186.455210] drbd1: Starting receiver thread (from drbd1_worker [18480]) > Oct 15 18:29:15 wc02 kernel: [32186.455266] drbd1: receiver (re)started > Oct 15 18:29:15 wc02 kernel: [32186.455293] drbd1: conn( Unconnected -> WFConnection ) > Oct 15 18:29:48 wc02 kernel: [32218.824012] drbd1: Handshake successful: Agreed network protocol version 88 > Oct 15 18:29:48 wc02 kernel: [32218.824054] drbd1: conn( WFConnection -> WFReportParams ) > Oct 15 18:29:48 wc02 kernel: [32218.824108] drbd1: Starting asender thread (from drbd1_receiver [18546]) > Oct 15 18:29:48 wc02 kernel: [32218.824559] drbd1: data-integrity-alg: <not-used> > Oct 15 18:29:48 wc02 kernel: [32218.824598] drbd1: IO Suspended, no more requests in flight > Oct 15 18:29:48 wc02 kernel: [32218.824601] drbd1: Resumed IO > Oct 15 18:29:48 wc02 kernel: [32218.824611] drbd1: Split-Brain detected, dropping connection! > Oct 15 18:29:48 wc02 kernel: [32218.824639] drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc02 kernel: [32218.824696] drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc02 kernel: [32218.824745] drbd1: helper command: /sbin/drbdadm split-brain minor-1 > Oct 15 18:29:48 wc02 kernel: [32218.826115] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) > Oct 15 18:29:48 wc02 kernel: [32218.826167] drbd1: conn( WFReportParams -> Disconnecting ) > Oct 15 18:29:48 wc02 kernel: [32218.826198] drbd1: error receiving ReportState, l: 4! > Oct 15 18:29:48 wc02 kernel: [32218.826229] drbd1: asender terminated > Oct 15 18:29:48 wc02 kernel: [32218.826255] drbd1: Terminating asender thread > Oct 15 18:29:48 wc02 kernel: [32218.826299] drbd1: tl_clear() > Oct 15 18:29:48 wc02 kernel: [32218.826324] drbd1: Connection closed > Oct 15 18:29:48 wc02 kernel: [32218.826358] drbd1: conn( Disconnecting -> StandAlone ) > Oct 15 18:29:48 wc02 kernel: [32218.826390] drbd1: receiver terminated > Oct 15 18:29:48 wc02 kernel: [32218.826415] drbd1: Terminating receiver thread ok - i "guess" that the problem lies within the two ids: > drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 so "self" (wc02) consideres the setup a problem, as FD074E7CD0616BDE (self) > 8F1A5669ED08075D (peer). this is the result of a couple of hard reboots: > reboot system boot 2.6.27-rc5 Wed Oct 15 09:32 - 18:33 (09:00) > reboot system boot 2.6.27-rc5 Wed Oct 15 09:30 - 18:33 (09:02) > reboot system boot 2.6.27-rc5 Wed Oct 15 09:26 - 18:33 (09:06) and running heartbeat/pacemaker using the drbd ocf ra on top of that. what would be the correct way to a) anaylse what lead to this split-brain situation and b) avoid it in the future dpkg-version: 8.2.7~rc1-0 and, given that i have not messed up: GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 cheers, raoul ps. i know, that the drbd ocf agent is not considered stable, but i am trying to improve that now ;) -- ____________________________________________________________________ DI (FH) Raoul Bhatia M.Sc. email. r.bhatia at ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email. office at ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax. +43 1 3670030 15 ____________________________________________________________________ _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user