[DRBD-user] Please help... After reboot I'm always getting unresolved split brain (DRBD+OCFS2)

Jacek Osiecki cjosh at silvercube.pl
Tue Jan 22 17:04:15 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I'm doing something wrong... Definitely, since when I have two primary-primary 
nodes running and reboot one of the machines - it comes back with:

root at oscar ~> cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: EBD919353D1D1CCDD0DFBD3
  0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----
      ns:0 nr:0 dw:789 dr:263507 al:43 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
      oos:172036

Then I have to disconnect, set it to secondary, reconnect with discarding own 
data, then setting it as primary again.

This is how it looks like in dmesg of the machine that wasn't rebooted:

[41706.085879] block drbd0: PingAck did not arrive in time.
[41706.085888] block drbd0: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown )
[41706.086007] block drbd0: new current UUID 
62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD
[41706.136868] block drbd0: asender terminated
[41706.136874] block drbd0: Terminating drbd0_asender
[41706.145263] block drbd0: Connection closed
[41706.145270] block drbd0: conn( NetworkFailure -> Unconnected )
[41706.145275] block drbd0: receiver terminated
[41706.145278] block drbd0: Restarting drbd0_receiver
[41706.145281] block drbd0: receiver (re)started
[41706.145285] block drbd0: conn( Unconnected -> WFConnection )
[41790.980795] block drbd0: Handshake successful: Agreed network protocol 
version 96
[41790.980807] block drbd0: conn( WFConnection -> WFReportParams )
[41790.980927] block drbd0: Starting asender thread (from drbd0_receiver 
[32089])
[41790.981155] block drbd0: data-integrity-alg: <not-used>
[41790.981170] block drbd0: drbd_sync_handshake:
[41790.981178] block drbd0: self 
62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD bits:87 
flags:0
[41790.981183] block drbd0: peer 
F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD bits:43008 
flags:2
[41790.981187] block drbd0: uuid_compare()=100 by rule 90
[41790.981191] block drbd0: helper command: /sbin/drbdadm initial-split-brain 
minor-0
[41790.983528] block drbd0: helper command: /sbin/drbdadm initial-split-brain 
minor-0 exit code 0 (0x0)
[41790.983531] block drbd0: Split-Brain detected but unresolved, dropping 
connection!
[41790.983534] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
[41790.985235] block drbd0: meta connection shut down by peer.
[41790.985238] block drbd0: conn( WFReportParams -> NetworkFailure )
[41790.985242] block drbd0: asender terminated
[41790.985243] block drbd0: Terminating drbd0_asender
[41790.985274] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 
exit code 0 (0x0)
[41790.985277] block drbd0: conn( NetworkFailure -> Disconnecting )
[41790.985279] block drbd0: error receiving ReportState, l: 4!
[41790.985310] block drbd0: Connection closed
[41790.985315] block drbd0: conn( Disconnecting -> StandAlone )
[41790.985325] block drbd0: receiver terminated
[41790.985327] block drbd0: Terminating drbd0_receiver

and this is how it looks in dmesg of rebooted machine:

[    7.709704] drbd: initialized. Version: 8.3.11 (api:88/proto:86-96)
[    7.709707] drbd: srcversion: EBD919353D1D1CCDD0DFBD3
[    7.709708] drbd: registered as block device major 147
[    7.709709] drbd: minor_table @ 0xffff8805fab93000
[    7.747400]  md2: unknown partition table
[    7.747804] block drbd0: Starting worker thread (from drbdsetup-83 [1792])
[    7.747919] block drbd0: disk( Diskless -> Attaching )
[    7.816449] block drbd0: Found 4 transactions (49 active extents) in 
activity log.
[    7.816453] block drbd0: Method to ensure write ordering: flush
[    7.816457] block drbd0: max BIO size = 131072
[    7.816461] block drbd0: drbd_bm_resize called with capacity == 838832808
[    7.818314] block drbd0: resync bitmap: bits=104854101 words=1638346 
pages=3200
[    7.818318] block drbd0: size = 400 GB (419416404 KB)
[    8.394005] block drbd0: bitmap READ of 3200 pages took 144 jiffies
[    8.395579] block drbd0: recounting of set bits took additional 1 jiffies
[    8.395582] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk 
bit-map.
[    8.395593] block drbd0: Marked additional 168 MB as out-of-sync based on 
AL.
[    8.395622] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
[    8.425658] block drbd0: 168 MB (43008 bits) marked out-of-sync by on disk 
bit-map.
[    8.425667] block drbd0: disk( Attaching -> UpToDate )
[    8.425671] block drbd0: attached to UUIDs 
F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD
[    8.458239] block drbd0: conn( StandAlone -> Unconnected )
[    8.458246] block drbd0: Starting receiver thread (from drbd0_worker [1793])
[    8.458301] block drbd0: receiver (re)started
[    8.458306] block drbd0: conn( Unconnected -> WFConnection )
[    8.635026] block drbd0: role( Secondary -> Primary )
[    8.701089] OCFS2 Node Manager 1.5.0
[    8.707547] OCFS2 DLM 1.5.0
[    8.713672] OCFS2 DLMFS 1.5.0
[    8.713730] OCFS2 User DLM kernel interface loaded
[    9.185831] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
Rx
[   13.436595] block drbd0: Handshake successful: Agreed network protocol 
version 96
[   13.436605] block drbd0: conn( WFConnection -> WFReportParams )
[   13.436733] block drbd0: Starting asender thread (from drbd0_receiver 
[1799])
[   13.437262] block drbd0: data-integrity-alg: <not-used>
[   13.437460] block drbd0: drbd_sync_handshake:
[   13.437466] block drbd0: self 
F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD bits:43008 
flags:0
[   13.437471] block drbd0: peer 
62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD bits:87 
flags:0
[   13.437475] block drbd0: uuid_compare()=100 by rule 90
[   13.437480] block drbd0: helper command: /sbin/drbdadm initial-split-brain 
minor-0
[   13.439492] block drbd0: helper command: /sbin/drbdadm initial-split-brain 
minor-0 exit code 0 (0x0)
[   13.439495] block drbd0: Split-Brain detected but unresolved, dropping 
connection!
[   13.439498] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
[   13.441096] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 
exit code 0 (0x0)
[   13.441099] block drbd0: conn( WFReportParams -> Disconnecting )
[   13.441103] block drbd0: error receiving ReportState, l: 4!
[   13.441111] block drbd0: asender terminated
[   13.441115] block drbd0: Terminating drbd0_asender
[   13.441133] block drbd0: Connection closed
[   13.441137] block drbd0: conn( Disconnecting -> StandAlone )
[   13.441151] block drbd0: receiver terminated
[   13.441153] block drbd0: Terminating drbd0_receiver

So, somehow it notices that there had been a split brain situation and turns 
receiver off..

I'm running drbd on two identical machines, both running kernel 3.4.6 with 
linux-vserver patch vs2.3.3.6. drbd is set up on software-raid1 device 
/dev/md2. I don't know if it is important, but when I start drbd manually, I 
get:

Starting DRBD resources: DRBD module version: 8.3.11
    userland version: 8.4.1
preferably kernel and userland versions should match.

Configs are identical:

resource home
{
    	protocol C;
    meta-disk internal;
    device    /dev/drbd0;
    disk      /dev/md2;
 	 net {
      		allow-two-primaries;
      after-sb-0pri discard-zero-changes;
      after-sb-1pri discard-secondary;
      after-sb-2pri disconnect;
 	 }
 	 startup { become-primary-on both; }
    on oscar { address 1.2.3.4:7789; }
    on papa { address 1.2.3.5:7789; }
}

DRBD main configuration:

global {
 	usage-count yes;
}

common {
 	 handlers {
 		 pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
 		 /usr/lib/drbd/notify-emergency-reboot.sh; echo b >
 		 /proc/sysrq-trigger ; reboot -f";
 		 pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
 		 /usr/lib/drbd/notify-emergency-reboot.sh; echo b >
 		 /proc/sysrq-trigger ; reboot -f";
 		 local-io-error "/usr/lib/drbd/notify-io-error.sh;
 		 /usr/lib/drbd/notify-emergency-shutdown.sh; echo o >
 		 /proc/sysrq-trigger ; halt -f";
 	 }


 	disk { on-io-error detach; } # Drop the disk on io error

    syncer
    {
      rate 50M;            # Limit sync speed to 10 MByte/s for FastEthernet
    }

 	 net {
 	 }
}

-- 
Jacek Osiecki
josiecki at silvercube.pl

Silvercube s.c.
ul. Makuszynskiego 4
31-752 KrakĂłw
+48 (12) 684 21 00


More information about the drbd-user mailing list