[DRBD-user] DRBD split brain(first time reboot only)

wang xuchen ben.wxc at gmail.com
Mon Apr 11 23:01:25 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

I am in a very wired situation where I have four DRBD devices setup as
primary/primary configuration. I do have a customized upper layer that takes
care of the failover related issue(similar to Pacemaker).

In my test environment, first time after OS and DRBD binaries get installed,
I initiate some IOs from one my server, reboot the machine after 4 or 5 mins
while IO is still running. During the machine reboots, All IO fails over to
the partner server without any problem. When the rebooted machine comes
back, split brain happens only to the disk that has IO going on. But If
repeat the above experiment again after split brain been manually resolved,
DRBD somehow figures out the sync direction correctly according to my
configuration.  Can someone help me interpret the syslog and give me a hint
on why split brain happens only for the first time but not after its been
resolved manually.

Here is my drbd.conf file:
common {
  protocol C;
}

#=#= 1
resource drbd1 {
  on f33 {
    device /dev/drbd1;
    #meta sd /dev/sdb
    disk /dev/disk/by-id/scsi-360030480003ae2e0151e54b20c1f82e0;
    address 192.168.250.1:7790;
    meta-disk internal;
  }
  on f34 {
    device /dev/drbd1;
    #meta sd /dev/sdb
    disk /dev/disk/by-id/scsi-360030480003ae32015095d1f11bf902b;
    address 192.168.250.2:7790;
    meta-disk internal;
  }
  net {
    allow-two-primaries;
    after-sb-0pri discard-least-changes;
    after-sb-1pri consensus;
    after-sb-2pri violently-as0p;
    rr-conflict violently;
    max-buffers 8000;
    max-epoch-size 8000;
    unplug-watermark 16;
    sndbuf-size 0;
  }
  syncer {
    rate 300M;
    verify-alg crc32c;
    al-extents 3800;
  }
  startup {
    become-primary-on both;
  }
  handlers {
    before-resync-target "/sbin/before_resync_target.sh";
    after-resync-target "/sbin/after_resync_target.sh";
  }
}
# some other drbd resource

Apr 11 16:16:07 f33 kernel: block drbd1: Starting worker thread (from
cqueue/9 [356])
Apr 11 16:16:07 f33 kernel: block drbd1: disk( Diskless -> Attaching )
Apr 11 16:16:07 f33 kernel: block drbd1: Found 39 transactions (39 active
extents) in activity log.
Apr 11 16:16:07 f33 kernel: block drbd1: Method to ensure write ordering:
barrier
Apr 11 16:16:07 f33 kernel: block drbd1: max_segment_size ( = BIO size ) =
65536
Apr 11 16:16:07 f33 kernel: block drbd1: drbd_bm_resize called with capacity
== 25164984
Apr 11 16:16:07 f33 kernel: block drbd1: resync bitmap: bits=3145623
words=49151
Apr 11 16:16:07 f33 kernel: block drbd1: size = 12 GB (12582492 KB)
Apr 11 16:16:07 f33 kernel: block drbd1: recounting of set bits took
additional 0 jiffies
Apr 11 16:16:07 f33 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by
on disk bit-map.
Apr 11 16:16:07 f33 kernel: block drbd1: Marked additional 156 MB as
out-of-sync based on AL.
Apr 11 16:16:07 f33 kernel: block drbd1: disk( Attaching -> UpToDate )
Apr 11 16:16:08 f33 kernel: block drbd1: conn( StandAlone -> Unconnected )
Apr 11 16:16:08 f33 kernel: block drbd1: Starting receiver thread (from
drbd1_worker [4269])
Apr 11 16:16:08 f33 kernel: block drbd1: receiver (re)started
Apr 11 16:16:08 f33 kernel: block drbd1: conn( Unconnected -> WFConnection )
Apr 11 16:16:08 f33 kernel: block drbd1: Handshake successful: Agreed
network protocol version 95
Apr 11 16:16:08 f33 kernel: block drbd1: conn( WFConnection ->
WFReportParams )
Apr 11 16:16:08 f33 kernel: block drbd1: Starting asender thread (from
drbd1_receiver [4359])
Apr 11 16:16:08 f33 kernel: block drbd1: data-integrity-alg: <not-used>
Apr 11 16:16:08 f33 kernel: block drbd1: max_segment_size ( = BIO size ) =
65536
Apr 11 16:16:08 f33 kernel: block drbd1: drbd_sync_handshake:
Apr 11 16:16:08 f33 kernel: block drbd1: self
264D92621CE57A74:CBC54463A29032C9:72FCD7718269F032:0000000000000004
bits:39936 flags:0
Apr 11 16:16:08 f33 kernel: block drbd1: peer
1CB5F82C62D88D81:CBC54463A29032C9:72FCD7718269F033:0000000000000004 bits:1
flags:0
Apr 11 16:16:08 f33 kernel: block drbd1: uuid_compare()=100 by rule 90
Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1
Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1 exit code 0 (0x0)
Apr 11 16:16:08 f33 kernel: block drbd1: Split-Brain detected but
unresolved, dropping connection!
Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1
Apr 11 16:16:08 f33 kernel: block drbd1: meta connection shut down by peer.
Apr 11 16:16:08 f33 kernel: block drbd1: conn( WFReportParams ->
NetworkFailure )
Apr 11 16:16:08 f33 kernel: block drbd1: asender terminated
Apr 11 16:16:08 f33 kernel: block drbd1: Terminating asender thread
Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1 exit code 0 (0x0)
Apr 11 16:16:08 f33 kernel: block drbd1: conn( NetworkFailure ->
Disconnecting )
Apr 11 16:16:08 f33 kernel: block drbd1: error receiving ReportState, l: 4!
Apr 11 16:16:08 f33 kernel: block drbd1: Connection closed
Apr 11 16:16:08 f33 kernel: block drbd1: conn( Disconnecting -> StandAlone )
Apr 11 16:16:08 f33 kernel: block drbd1: receiver terminated
Apr 11 16:16:08 f33 kernel: block drbd1: Terminating receiver thread





Here is the related log from syslog:


Commit yourself to constant self-improvement
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110411/a9136e90/attachment.htm>


More information about the drbd-user mailing list