Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
hi, i have two nodes: wc01 and wc02 wc01: > GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52 > ... > 1: cs:StandAlone st:Primary/Unknown ds:UpToDate/Outdated r--- > ns:0 nr:0 dw:180 dr:2997 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:260 wc02: > GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 build by root at wc01, 2008-09-11 14:44:52 > ... > 1: cs:StandAlone st:Secondary/Unknown ds:Consistent/DUnknown r--- > ns:0 nr:0 dw:0 dr:0 al:0 bm:14 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:69632 i see that both are standalone, wc01 is primary (and thus "productive") and wc01 considers wc02 as outdated. so i would think that wc02 should be overwritten. if i try to connect both nodes, i get: wc01: > Oct 15 18:29:48 wc01 kernel: [522307.113592] drbd1: conn( StandAlone -> Unconnected ) > Oct 15 18:29:48 wc01 kernel: [522307.113699] drbd1: Starting receiver thread (from drbd1_worker [12283]) > Oct 15 18:29:48 wc01 kernel: [522307.113767] drbd1: receiver (re)started > Oct 15 18:29:48 wc01 kernel: [522307.113800] drbd1: conn( Unconnected -> WFConnection ) > Oct 15 18:29:48 wc01 kernel: [522307.211176] drbd1: Handshake successful: Agreed network protocol version 88 > Oct 15 18:29:48 wc01 kernel: [522307.211229] drbd1: conn( WFConnection -> WFReportParams ) > Oct 15 18:29:48 wc01 kernel: [522307.211295] drbd1: Starting asender thread (from drbd1_receiver [20478]) > Oct 15 18:29:48 wc01 kernel: [522307.211443] drbd1: data-integrity-alg: <not-used> > Oct 15 18:29:48 wc01 kernel: [522307.211481] drbd1: IO Suspended, no more requests in flight > Oct 15 18:29:48 wc01 kernel: [522307.211484] drbd1: Resumed IO > Oct 15 18:29:48 wc01 kernel: [522307.211664] drbd1: Split-Brain detected, dropping connection! > Oct 15 18:29:48 wc01 kernel: [522307.211712] drbd1: self 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc01 kernel: [522307.211792] drbd1: peer FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc01 kernel: [522307.211856] drbd1: helper command: /sbin/drbdadm split-brain minor-1 > Oct 15 18:29:48 wc01 kernel: [522307.213476] drbd1: meta connection shut down by peer. > Oct 15 18:29:48 wc01 kernel: [522307.213519] drbd1: conn( WFReportParams -> NetworkFailure ) > Oct 15 18:29:48 wc01 kernel: [522307.213564] drbd1: asender terminated > Oct 15 18:29:48 wc01 kernel: [522307.213598] drbd1: Terminating asender thread > Oct 15 18:29:48 wc01 kernel: [522307.214621] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) > Oct 15 18:29:48 wc01 kernel: [522307.214704] drbd1: conn( NetworkFailure -> Disconnecting ) > Oct 15 18:29:48 wc01 kernel: [522307.214745] drbd1: error receiving ReportState, l: 4! > Oct 15 18:29:48 wc01 kernel: [522307.214821] drbd1: tl_clear() > Oct 15 18:29:48 wc01 kernel: [522307.214861] drbd1: Connection closed > Oct 15 18:29:48 wc01 kernel: [522307.214922] drbd1: conn( Disconnecting -> StandAlone ) > Oct 15 18:29:48 wc01 kernel: [522307.214972] drbd1: receiver terminated > Oct 15 18:29:48 wc01 kernel: [522307.215005] drbd1: Terminating receiver thread wc02 > Oct 15 18:29:15 wc02 kernel: [32186.455168] drbd1: conn( StandAlone -> Unconnected ) > Oct 15 18:29:15 wc02 kernel: [32186.455210] drbd1: Starting receiver thread (from drbd1_worker [18480]) > Oct 15 18:29:15 wc02 kernel: [32186.455266] drbd1: receiver (re)started > Oct 15 18:29:15 wc02 kernel: [32186.455293] drbd1: conn( Unconnected -> WFConnection ) > Oct 15 18:29:48 wc02 kernel: [32218.824012] drbd1: Handshake successful: Agreed network protocol version 88 > Oct 15 18:29:48 wc02 kernel: [32218.824054] drbd1: conn( WFConnection -> WFReportParams ) > Oct 15 18:29:48 wc02 kernel: [32218.824108] drbd1: Starting asender thread (from drbd1_receiver [18546]) > Oct 15 18:29:48 wc02 kernel: [32218.824559] drbd1: data-integrity-alg: <not-used> > Oct 15 18:29:48 wc02 kernel: [32218.824598] drbd1: IO Suspended, no more requests in flight > Oct 15 18:29:48 wc02 kernel: [32218.824601] drbd1: Resumed IO > Oct 15 18:29:48 wc02 kernel: [32218.824611] drbd1: Split-Brain detected, dropping connection! > Oct 15 18:29:48 wc02 kernel: [32218.824639] drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc02 kernel: [32218.824696] drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 > Oct 15 18:29:48 wc02 kernel: [32218.824745] drbd1: helper command: /sbin/drbdadm split-brain minor-1 > Oct 15 18:29:48 wc02 kernel: [32218.826115] drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) > Oct 15 18:29:48 wc02 kernel: [32218.826167] drbd1: conn( WFReportParams -> Disconnecting ) > Oct 15 18:29:48 wc02 kernel: [32218.826198] drbd1: error receiving ReportState, l: 4! > Oct 15 18:29:48 wc02 kernel: [32218.826229] drbd1: asender terminated > Oct 15 18:29:48 wc02 kernel: [32218.826255] drbd1: Terminating asender thread > Oct 15 18:29:48 wc02 kernel: [32218.826299] drbd1: tl_clear() > Oct 15 18:29:48 wc02 kernel: [32218.826324] drbd1: Connection closed > Oct 15 18:29:48 wc02 kernel: [32218.826358] drbd1: conn( Disconnecting -> StandAlone ) > Oct 15 18:29:48 wc02 kernel: [32218.826390] drbd1: receiver terminated > Oct 15 18:29:48 wc02 kernel: [32218.826415] drbd1: Terminating receiver thread ok - i "guess" that the problem lies within the two ids: > drbd1: self FD074E7CD0616BDE:E834A02D55F25B6A:00A945ED5EE5B422:5C6DB81D0F96EF32 > drbd1: peer 8F1A5669ED08075D:E834A02D55F25B6A:00A945ED5EE5B423:5C6DB81D0F96EF32 so "self" (wc02) consideres the setup a problem, as FD074E7CD0616BDE (self) > 8F1A5669ED08075D (peer). this is the result of a couple of hard reboots: > reboot system boot 2.6.27-rc5 Wed Oct 15 09:32 - 18:33 (09:00) > reboot system boot 2.6.27-rc5 Wed Oct 15 09:30 - 18:33 (09:02) > reboot system boot 2.6.27-rc5 Wed Oct 15 09:26 - 18:33 (09:06) and running heartbeat/pacemaker using the drbd ocf ra on top of that. what would be the correct way to a) anaylse what lead to this split-brain situation and b) avoid it in the future dpkg-version: 8.2.7~rc1-0 and, given that i have not messed up: GIT-hash: 9ce425f51860f1205395bf7127cad3c427070837 cheers, raoul ps. i know, that the drbd ocf agent is not considered stable, but i am trying to improve that now ;) -- ____________________________________________________________________ DI (FH) Raoul Bhatia M.Sc. email. r.bhatia at ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email. office at ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax. +43 1 3670030 15 ____________________________________________________________________