[DRBD-user] DRBD + OCFS2 - Split-Brain detected but unresolved

Tue Apr 17 21:11:50 CEST 2012

On Tue, 17 Apr 2012 18:31:09 +0200, Felix Frank wrote:

> On 04/17/2012 05:06 PM, Jacek Osiecki wrote:
>> automatic recovery sometimes works and sometimes does
>> not.

> we seem to be lacking your drbd config.

Right, my bad :)

> How is automatic split brain recovery configured?

Probably it isn't - here's the config:

global {usage-count yes;}

common {
	handlers {
		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
		local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger 
; halt -f";
	}
	disk { on-io-error detach; }
	syncer {rate 100M;}
}

and the resource config:

resource home
{
   protocol C;
   meta-disk internal;
   device    /dev/drbd0;
   disk      /dev/md4;
   net {
     allow-two-primaries;
     after-sb-0pri discard-zero-changes;
     after-sb-1pri discard-secondary;
     after-sb-2pri disconnect;
   }
   startup { become-primary-on both; }
   on mike { address 176.xx.xx.xx:7789; }
   on november { address 176.yy.yy.yy:7789; }
}

> I get the feeling it's not. What split-brain situations have you
> perceived as being automatically solved?

Something like this:

[287856.619503] block drbd0: Handshake successful: Agreed network 
protocol version 96
[287856.619512] block drbd0: conn( WFConnection -> WFReportParams )
[287856.619682] block drbd0: Starting asender thread (from 
drbd0_receiver [24712])
[287856.619885] block drbd0: data-integrity-alg: <not-used>
[287856.619967] block drbd0: max BIO size = 130560
[287856.619978] block drbd0: drbd_sync_handshake:
[287856.619982] block drbd0: self 
18D97D7348BC1031:232CE4A32F2915DB:B873B3F48F57A893:B872B3F48F57A893 
bits:50 flags:0
[287856.619987] block drbd0: peer 
8359D2DF4D7761E0:232CE4A32F2915DB:B873B3F48F57A893:B872B3F48F57A893 
bits:3072 flags:2
[287856.619992] block drbd0: uuid_compare()=100 by rule 90
[287856.619995] block drbd0: helper command: /sbin/drbdadm 
initial-split-brain minor-0
[287856.622133] block drbd0: helper command: /sbin/drbdadm 
initial-split-brain minor-0 exit code 0 (0x0)
[287856.622136] block drbd0: Split-Brain detected, 1 primaries, 
automatically solved. Sync from this node
[287856.622141] block drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
[287856.639285] block drbd0: peer( Secondary -> Primary )
[287856.986857] block drbd0: helper command: /sbin/drbdadm 
before-resync-source minor-0
[287856.988873] block drbd0: helper command: /sbin/drbdadm 
before-resync-source minor-0 exit code 0 (0x0)
[287856.988879] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( 
Consistent -> Inconsistent )
[287856.988884] block drbd0: Began resync as SyncSource (will sync 
12484 KB [3121 bits set]).
[287856.988895] block drbd0: updated sync UUID 
18D97D7348BC1031:232DE4A32F2915DB:232CE4A32F2915DB:B873B3F48F57A893
[287857.202264] block drbd0: Resync done (total 1 sec; paused 0 sec; 
12484 K/sec)
[287857.202268] block drbd0: updated UUIDs 
18D97D7348BC1031:0000000000000000:232DE4A32F2915DB:232CE4A32F2915DB
[287857.202272] block drbd0: conn( SyncSource -> Connected ) pdsk( 
Inconsistent -> UpToDate )
[287857.347396] block drbd0: bitmap WRITE of 4793 pages took 29 jiffies
[287857.419057] block drbd0: 0 KB (0 bits) marked out-of-sync by on 
disk bit-map.

but now I see that those were probably split-brains after secondary 
node being rebooted
when I've been testing a lot automatic set-up of drbd after reboot. Am 
I right?

>> 73E08E8E06754D97:A9BC3587FC5AA879:0BF8587A4ABA37B5:0BF7587A4ABA37B5
>> bits:0 flags:0
> This looks fine - the peer has set 0 bits, so it's probably indeed
> unchanged.
>> why the case isn't solved, since second server doesn't write to 
>> drbd0,
>> sometimes even partition wasn't mounted (I can't be 100% sure, but 
>> it
>> seems so).
> A policy of discard-zero-changes could solve this for you, but only 
> if
> configured thus.

Seems that my config is lacking this.

My planis to use DRBD+OCFS2 for a HA configuration, with two machines 
behind
hardware load-balancer. So far I've been modifying filesystem on one 
machine
only. I'm wondering how to handle the situation, where nodes can't see 
each
other but are still available through the internet (that's possible, 
for
distant locations. Are there any mechanisms that would be capable of
synchronizing the nodes (when node-node communication is up again) on 
filesystem
level? I mean, that sometimes even though both filesystems are modified 
-
the changes don't cause any conflicts...

Is anyone using such a configuration? What policies are you using?

>> P.S. Any suggestions how to measure real performance 
>> (read/write/copy)
>> of DRBD+OCFS2? UnixBench gives crazy results (read performance about 
>> 10%
>> of local filesystem)...
> Is this crazy? I wouldn't know. But bear in mind that stat can be an
> expensive operation on a cluster file system vs. a regular old fs.

Here are the results from UnixBench, where I compared:
  - local ext3 filesystem
  - drbd+ocfs2 in master-master cluster :)
  - NFS from NAS provided by OVH hosting

Results in KBps, for copy/read/write. I even didn't dig the exact 
meaning
of UnixBench parameters or its methodology, rather wanted to compare 
raw
values in similar circumstances:

+-----------------------+-----------+----------------+------------------+
|X bufsize,Y maxblocks  |ext3(local)| (drbd+ocfs2)   | NFS (ovh-nas)    
|
+-----------------------+-----------+----------------+------------------+
| CP 1024 buf 2000 mxbl |  1001513.5|  329691.5 (33%)|     8439.9 
(0.8%)|
| CP 256 buf 500 mxbl   |   289354.4|   83344.5 (29%)|     7545.5 
(2.6%)|
| RD 1024 buf 2000 mxbl | 16683047.3| 1627301.6 (10%)| 16026036.4 ( 
96%)|
| RD 256 buf 500 mxbl   |  4737836.5|  413126.7 ( 9%)|  4509106.6 ( 
95%)|
| RD 4096 buf 8000 mxbl | 35705631.9| 6872806.6 (19%)| 34967996.7 ( 
97%)|
| WR 256 buf 500 mxbl   |   315172.2|   87545.4 (28%)|     8711.3 
(2.8%)|
| WR 4096 buf 8000 mxbl |  3522086.9| 1290255.5 (37%)|    10991.6 
(0.3%)|
+-----------------------+-----------+----------------+------------------+

I wrote "crazy" since 10% seems to be quite a low value, especially 
when
comparing to copy/write, which seem to be running at 33% of local fs 
speed.
Now I realize, that read speed is still much higher than write/copy 
speed.
However - could someone verify those values? I just realized that 
UnixBench
results are hard to believe and seem to be muuch to high :)

Greetings,
-- 
Jacek Osiecki