Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hallo DRBD-meisters und lovers ! As far as i am aware of, i never had any real problems using DRBD, but that changed a couple of days ago. Both nodes suddenly have status "StandAlone" and "messages" shows that i am blessed with "Split-Brain". I suspect myself forgotting to change the state of node foc1 to Secondary before starting Heartbeat. There is no other clue coming up into my mind to explain what caused this situation.... grinzz On both nodes fsck is happy, even with "fsck -n" (readonly) on the physical device after stopping DRBD. I can mount (did that 1-at-a-time) both sides and my data looks about the same (no real comparison made). The cluster is a test-setup in my lab, the data has no real value, but i like to understand what's wrong. Thanks in advance for your valued answers. Nico van der Horn Questions: ---------- 1. how can i determine the real cause of the split-brain ? 2. how to correct the situation ? Setup: ------ Two nodes: foc1 and foc2, both running openSUSE-10.2, DRBD-8.0.3 /etc/drbd.conf: --------------- global { usage-count yes; } common { syncer { rate 10M ; } } resource r0 { protocol C; net { cram-hmac-alg sha1; shared-secret "DRBD is a blessing !"; } on foc1 { device /dev/drbd0; disk /dev/sdb1; address 192.168.0.1:7789; meta-disk internal; } on foc2 { device /dev/drbd0; disk /dev/hdc1; address 192.168.0.2:7789; meta-disk internal; } } Observations: ------------- Fresh boot of both nodes. foc1:~ # rcdrbd status drbd driver loaded OK; device status: version: 8.0.3 (api:86/proto:86) SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:71 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 foc2:~ # rcdrbd status drbd driver loaded OK; device status: version: 8.0.3 (api:86/proto:86) SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 foc1:~ # drbdadm show-gi all +--< Current data generation UUID >- | +--< Bitmap's base data generation UUID >- | | +--< younger historiy UUID >- | | | +-< older history >- V V V V E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557:1:1:1:0:0:0 ^ ^ ^ ^ ^ ^ -< Data consistancy flag >--+ | | | | | -< Data was/is currently up-to-date >--+ | | | | -< Node was/is currently primary >--+ | | | -< Node was/is currently connected >--+ | | -< Node was in the progress of setting all bits in the bitmap >--+ | -< The peer's disk was out-dated or inconsistent >--+ foc2:~ # drbdadm get-gi all 30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557:1:1:0:0:0:0 This shows that "Node was/is currently primary" but "rcdrbd status" (/proc/drbd) reports "Secondary" ! foc1:/var/log/messages ---------------------- Jun 11 19:22:13 foc1 kernel: drbd: initialised. Version: 8.0.3 (api:86/proto:86) Jun 11 19:22:13 foc1 kernel: drbd: SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57 Jun 11 19:22:13 foc1 kernel: drbd: registered as block device major 147 Jun 11 19:22:13 foc1 kernel: drbd: minor_table @ 0xc539eb40 Jun 11 19:22:13 foc1 kernel: drbd0: disk( Diskless -> Attaching ) Jun 11 19:22:13 foc1 kernel: klogd 1.4.1, ---------- state change ---------- Jun 11 19:22:13 foc1 kernel: drbd0: Found 4 transactions (136 active extents) in activity log. Jun 11 19:22:13 foc1 kernel: drbd0: max_segment_size ( = BIO size ) = 32768 Jun 11 19:22:13 foc1 kernel: drbd0: drbd_bm_resize called with capacity == 160066632 Jun 11 19:22:13 foc1 kernel: drbd0: resync bitmap: bits=20008329 words=625262 Jun 11 19:22:13 foc1 kernel: drbd0: size = 76 GB (80033316 KB) Jun 11 19:22:13 foc1 kernel: drbd0: reading of bitmap took 21 jiffies Jun 11 19:22:13 foc1 kernel: drbd0: recounting of set bits took additional 5 jiffies Jun 11 19:22:13 foc1 kernel: drbd0: 508 MB marked out-of-sync by on disk bit-map. Jun 11 19:22:13 foc1 kernel: drbd0: Marked additional 0 KB as out-of-sync based on AL. Jun 11 19:22:13 foc1 kernel: drbd0: disk( Attaching -> UpToDate ) Jun 11 19:22:13 foc1 kernel: drbd0: Writing meta data super block now. Jun 11 19:22:13 foc1 kernel: drbd0: conn( StandAlone -> Unconnected ) Jun 11 19:22:13 foc1 kernel: drbd0: receiver (re)started Jun 11 19:22:13 foc1 kernel: drbd0: conn( Unconnected -> WFConnection ) Jun 11 19:22:26 foc1 kernel: drbd0: conn( WFConnection -> WFReportParams ) Jun 11 19:22:26 foc1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Jun 11 19:22:26 foc1 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jun 11 19:22:26 foc1 kernel: drbd0: Split-Brain detected, dropping connection! Jun 11 19:22:26 foc1 kernel: drbd0: self E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557 Jun 11 19:22:26 foc1 kernel: drbd0: peer 30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557 Jun 11 19:22:26 foc1 kernel: drbd0: conn( WFReportParams -> Disconnecting ) Jun 11 19:22:26 foc1 kernel: drbd0: error receiving ReportState, l: 4! Jun 11 19:22:26 foc1 kernel: drbd0: asender terminated Jun 11 19:22:26 foc1 kernel: drbd0: tl_clear() Jun 11 19:22:26 foc1 kernel: drbd0: Connection closed Jun 11 19:22:26 foc1 kernel: drbd0: conn( Disconnecting -> StandAlone ) Jun 11 19:22:26 foc1 kernel: drbd0: receiver terminated foc2:/var/log/messages ---------------------- Jun 11 19:22:26 foc2 kernel: drbd: initialised. Version: 8.0.3 (api:86/proto:86) Jun 11 19:22:26 foc2 kernel: drbd: SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57 Jun 11 19:22:26 foc2 kernel: drbd: registered as block device major 147 Jun 11 19:22:26 foc2 kernel: drbd: minor_table @ 0xca6ac2a0 Jun 11 19:22:26 foc2 kernel: drbd0: disk( Diskless -> Attaching ) Jun 11 19:22:26 foc2 kernel: klogd 1.4.1, ---------- state change ---------- Jun 11 19:22:26 foc2 kernel: drbd0: Found 4 transactions (66 active extents) in activity log. Jun 11 19:22:26 foc2 kernel: drbd0: max_segment_size ( = BIO size ) = 32768 Jun 11 19:22:26 foc2 kernel: drbd0: drbd_bm_resize called with capacity == 160066632 Jun 11 19:22:26 foc2 kernel: drbd0: resync bitmap: bits=20008329 words=625262 Jun 11 19:22:26 foc2 kernel: drbd0: size = 76 GB (80033316 KB) Jun 11 19:22:26 foc2 kernel: drbd0: reading of bitmap took 21 jiffies Jun 11 19:22:26 foc2 kernel: drbd0: recounting of set bits took additional 7 jiffies Jun 11 19:22:26 foc2 kernel: drbd0: 220 KB marked out-of-sync by on disk bit-map. Jun 11 19:22:26 foc2 kernel: drbd0: disk( Attaching -> UpToDate ) Jun 11 19:22:26 foc2 kernel: drbd0: Writing meta data super block now. Jun 11 19:22:26 foc2 kernel: drbd0: conn( StandAlone -> Unconnected ) Jun 11 19:22:26 foc2 kernel: drbd0: receiver (re)started Jun 11 19:22:26 foc2 kernel: drbd0: conn( Unconnected -> WFConnection ) Jun 11 19:22:26 foc2 kernel: drbd0: conn( WFConnection -> WFReportParams ) Jun 11 19:22:26 foc2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Jun 11 19:22:26 foc2 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jun 11 19:22:26 foc2 kernel: drbd0: Split-Brain detected, dropping connection! Jun 11 19:22:26 foc2 kernel: drbd0: self 30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557 Jun 11 19:22:26 foc2 kernel: drbd0: peer E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557 Jun 11 19:22:26 foc2 kernel: drbd0: conn( WFReportParams -> Disconnecting ) Jun 11 19:22:26 foc2 kernel: drbd0: meta connection shut down by peer. Jun 11 19:22:26 foc2 kernel: drbd0: asender terminated Jun 11 19:22:26 foc2 kernel: drbd0: error receiving ReportState, l: 4! Jun 11 19:22:26 foc2 kernel: drbd0: tl_clear() Jun 11 19:22:26 foc2 kernel: drbd0: Connection closed Jun 11 19:22:26 foc2 kernel: drbd0: conn( Disconnecting -> StandAlone ) Jun 11 19:22:26 foc2 kernel: drbd0: receiver terminated mvg Nico --- Met vriendelijke groeten / Mit freundlichen Grüßen / Kind Regards / Meilleures Salutations / Saludos Cordiales Parhain terveisin / Med vänlig hälsning / Namashkaar / Wassalam Alaikom / Pollous chairetismous -- N.J. van der Horn, http://www.vanderhorn.nl, http://www.inet.nl, Vanderhorn IT-Works, Voorstraat 55, 3135 HW Vlaardingen, The Netherlands, Tel +31 10 2486060, Fax +31 10 2486061