[DRBD-user] Split-brain

N.J. van der Horn (Nico) nico at vanderhorn.nl
Mon Jun 11 22:40:23 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hallo DRBD-meisters und lovers !

As far as i am aware of, i never had any real 
problems using DRBD, but that changed a couple of days ago.
Both nodes suddenly have status "StandAlone" and 
"messages" shows that i am blessed with "Split-Brain".

I suspect myself forgotting to change the state 
of node foc1 to Secondary before starting Heartbeat.
There is no other clue coming up into my mind to 
explain what caused this situation.... grinzz

On both nodes fsck is happy, even with "fsck -n" 
(readonly) on the physical device after stopping DRBD.
I can mount (did that 1-at-a-time) both sides and 
my data looks about the same (no real comparison made).

The cluster is a test-setup in my lab, the data 
has no real value, but i like to understand what's wrong.

Thanks in advance for your valued answers.

Nico van der Horn


Questions:
----------
1. how can i determine the real cause of the split-brain ?
2. how to correct the situation ?

Setup:
------
Two nodes: foc1 and foc2, both running openSUSE-10.2, DRBD-8.0.3
/etc/drbd.conf:
---------------
global { usage-count yes; }
common { syncer { rate 10M ; } }
resource r0
{
         protocol C;
         net
         {
                 cram-hmac-alg sha1;
                 shared-secret "DRBD is a blessing !";
         }
         on foc1
         {
                 device          /dev/drbd0;
                 disk            /dev/sdb1;
                 address         192.168.0.1:7789;
                 meta-disk       internal;
         }
         on foc2
         {
                 device          /dev/drbd0;
                 disk            /dev/hdc1;
                 address         192.168.0.2:7789;
                 meta-disk       internal;
         }
}

Observations:
-------------
Fresh boot of both nodes.

foc1:~ # rcdrbd status
drbd driver loaded OK; device status:
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57
  0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown   r---
     ns:0 nr:0 dw:0 dr:0 al:0 bm:71 lo:0 pe:0 ua:0 ap:0
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

foc2:~ # rcdrbd status
drbd driver loaded OK; device status:
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at mobilin, 2007-05-08 01:30:57
  0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown   r---
     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

foc1:~ # drbdadm show-gi all
        +--<  Current data generation UUID  >-
        |               +--<  Bitmap's base data generation UUID  >-
        |               |                 +--<  younger historiy UUID  >-
        |               |                 |         +-<  older history  >-
        V               V                 V         V
E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557:1:1:1:0:0:0
                                                                     ^ 
^ ^ ^ ^ ^
                                       -<  Data 
consistancy flag  >--+ | | | | |
                              -<  Data was/is 
currently up-to-date  >--+ | | | |
                                   -<  Node 
was/is currently primary  >--+ | | |
                                   -<  Node 
was/is currently connected  >--+ | |
          -<  Node was in the progress of setting 
all bits in the bitmap  >--+ |
                         -<  The peer's disk was 
out-dated or inconsistent  >--+

foc2:~ # drbdadm get-gi all
30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557:1:1:0:0:0:0

This shows that "Node was/is currently primary" 
but "rcdrbd status" (/proc/drbd) reports "Secondary" !

foc1:/var/log/messages
----------------------
Jun 11 19:22:13 foc1 kernel: drbd: initialised. 
Version: 8.0.3 (api:86/proto:86)
Jun 11 19:22:13 foc1 kernel: drbd: SVN Revision: 
2881 build by root at mobilin, 2007-05-08 01:30:57
Jun 11 19:22:13 foc1 kernel: drbd: registered as block device major 147
Jun 11 19:22:13 foc1 kernel: drbd: minor_table @ 0xc539eb40
Jun 11 19:22:13 foc1 kernel: drbd0: disk( Diskless -> Attaching )
Jun 11 19:22:13 foc1 kernel: klogd 1.4.1, ---------- state change ----------
Jun 11 19:22:13 foc1 kernel: drbd0: Found 4 
transactions (136 active extents) in activity log.
Jun 11 19:22:13 foc1 kernel: drbd0: max_segment_size ( = BIO size ) = 32768
Jun 11 19:22:13 foc1 kernel: drbd0: 
drbd_bm_resize called with capacity == 160066632
Jun 11 19:22:13 foc1 kernel: drbd0: resync bitmap: bits=20008329 words=625262
Jun 11 19:22:13 foc1 kernel: drbd0: size = 76 GB (80033316 KB)
Jun 11 19:22:13 foc1 kernel: drbd0: reading of bitmap took 21 jiffies
Jun 11 19:22:13 foc1 kernel: drbd0: recounting of 
set bits took additional 5 jiffies
Jun 11 19:22:13 foc1 kernel: drbd0: 508 MB marked 
out-of-sync by on disk bit-map.
Jun 11 19:22:13 foc1 kernel: drbd0: Marked 
additional 0 KB as out-of-sync based on AL.
Jun 11 19:22:13 foc1 kernel: drbd0: disk( Attaching -> UpToDate )
Jun 11 19:22:13 foc1 kernel: drbd0: Writing meta data super block now.
Jun 11 19:22:13 foc1 kernel: drbd0: conn( StandAlone -> Unconnected )
Jun 11 19:22:13 foc1 kernel: drbd0: receiver (re)started
Jun 11 19:22:13 foc1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun 11 19:22:26 foc1 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 11 19:22:26 foc1 kernel: drbd0: Handshake 
successful: DRBD Network Protocol version 86
Jun 11 19:22:26 foc1 kernel: drbd0: Peer 
authenticated using 20 bytes of 'sha1' HMAC
Jun 11 19:22:26 foc1 kernel: drbd0: Split-Brain detected, dropping connection!
Jun 11 19:22:26 foc1 kernel: drbd0: self 
E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557
Jun 11 19:22:26 foc1 kernel: drbd0: peer 
30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557
Jun 11 19:22:26 foc1 kernel: drbd0: conn( WFReportParams -> Disconnecting )
Jun 11 19:22:26 foc1 kernel: drbd0: error receiving ReportState, l: 4!
Jun 11 19:22:26 foc1 kernel: drbd0: asender terminated
Jun 11 19:22:26 foc1 kernel: drbd0: tl_clear()
Jun 11 19:22:26 foc1 kernel: drbd0: Connection closed
Jun 11 19:22:26 foc1 kernel: drbd0: conn( Disconnecting -> StandAlone )
Jun 11 19:22:26 foc1 kernel: drbd0: receiver terminated

foc2:/var/log/messages
----------------------
Jun 11 19:22:26 foc2 kernel: drbd: initialised. 
Version: 8.0.3 (api:86/proto:86)
Jun 11 19:22:26 foc2 kernel: drbd: SVN Revision: 
2881 build by root at mobilin, 2007-05-08 01:30:57
Jun 11 19:22:26 foc2 kernel: drbd: registered as block device major 147
Jun 11 19:22:26 foc2 kernel: drbd: minor_table @ 0xca6ac2a0
Jun 11 19:22:26 foc2 kernel: drbd0: disk( Diskless -> Attaching )
Jun 11 19:22:26 foc2 kernel: klogd 1.4.1, ---------- state change ----------
Jun 11 19:22:26 foc2 kernel: drbd0: Found 4 
transactions (66 active extents) in activity log.
Jun 11 19:22:26 foc2 kernel: drbd0: max_segment_size ( = BIO size ) = 32768
Jun 11 19:22:26 foc2 kernel: drbd0: 
drbd_bm_resize called with capacity == 160066632
Jun 11 19:22:26 foc2 kernel: drbd0: resync bitmap: bits=20008329 words=625262
Jun 11 19:22:26 foc2 kernel: drbd0: size = 76 GB (80033316 KB)
Jun 11 19:22:26 foc2 kernel: drbd0: reading of bitmap took 21 jiffies
Jun 11 19:22:26 foc2 kernel: drbd0: recounting of 
set bits took additional 7 jiffies
Jun 11 19:22:26 foc2 kernel: drbd0: 220 KB marked 
out-of-sync by on disk bit-map.
Jun 11 19:22:26 foc2 kernel: drbd0: disk( Attaching -> UpToDate )
Jun 11 19:22:26 foc2 kernel: drbd0: Writing meta data super block now.
Jun 11 19:22:26 foc2 kernel: drbd0: conn( StandAlone -> Unconnected )
Jun 11 19:22:26 foc2 kernel: drbd0: receiver (re)started
Jun 11 19:22:26 foc2 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun 11 19:22:26 foc2 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 11 19:22:26 foc2 kernel: drbd0: Handshake 
successful: DRBD Network Protocol version 86
Jun 11 19:22:26 foc2 kernel: drbd0: Peer 
authenticated using 20 bytes of 'sha1' HMAC
Jun 11 19:22:26 foc2 kernel: drbd0: Split-Brain detected, dropping connection!
Jun 11 19:22:26 foc2 kernel: drbd0: self 
30528E971B737575:08C04AA01256AAAC:E482A1C10041241E:0836BE58931AB557
Jun 11 19:22:26 foc2 kernel: drbd0: peer 
E6D2483DE53A183D:08C04AA01256AAAD:E482A1C10041241F:0836BE58931AB557
Jun 11 19:22:26 foc2 kernel: drbd0: conn( WFReportParams -> Disconnecting )
Jun 11 19:22:26 foc2 kernel: drbd0: meta connection shut down by peer.
Jun 11 19:22:26 foc2 kernel: drbd0: asender terminated
Jun 11 19:22:26 foc2 kernel: drbd0: error receiving ReportState, l: 4!
Jun 11 19:22:26 foc2 kernel: drbd0: tl_clear()
Jun 11 19:22:26 foc2 kernel: drbd0: Connection closed
Jun 11 19:22:26 foc2 kernel: drbd0: conn( Disconnecting -> StandAlone )
Jun 11 19:22:26 foc2 kernel: drbd0: receiver terminated


mvg Nico
---
Met vriendelijke groeten / Mit freundlichen 
Grüßen / Kind Regards / Meilleures Salutations / Saludos Cordiales
Parhain terveisin / Med vänlig hälsning / 
Namashkaar / Wassalam Alaikom / Pollous chairetismous
--
N.J. van der Horn, http://www.vanderhorn.nl, http://www.inet.nl,
Vanderhorn IT-Works, Voorstraat 55, 3135 HW Vlaardingen,
The Netherlands, Tel +31 10 2486060, Fax +31 10 2486061 





More information about the drbd-user mailing list