[DRBD-user] Dual-primary split-brain recovery after reboot

Sat Jun 25 00:31:38 CEST 2011

On 06/24/2011 02:40 PM, Lars Ellenberg wrote:
>> /etc/drbd.conf:
> You cannot configure away users of DRBD.
> You need to stop whatever is using DRBD first.
>
> Maybe start by checking whether "bad things"
> happen on shutdown, on reboot, or on both?
>
> If you do reboot into single user mode,
> then manually configure DRBD,
> does it stil "detect split brain"?
>
> If not, shutdown is ok,
> and reboot is the problem.
>

Single user mode was very revealing, thank you for the suggestion Lars. 
There is a problem in the reboot.  Once I got into single user mode, I
brought up the networking, loaded the drbd module, ran "drbd up r0" and
"drbd primary all" and everything came up flawlessly.  The fact that Ubuntu
doesn't verify packets are actually being passed when it brings up
networking and the fact that I'm using 802.3ad bonding which takes some
time to establish is causing the network to be down by the time the drbd
init.d script execs.  I fixed the problem rather sloppily by inserting the
following into the drbd "start" section:

    while ( ! ping -c 1 peer-ip-address ); do
       echo peer-ip-address not up
    done

It sticks here until the ethernet ports bond and networking actually comes
up.  When it does, drbd initializes and goes dual-primary without issue.

Is there a better, more elegant way to do this?  Does the drbd init.d
script do some verification of networking before it attempts bringing up
resources?  I realize my method is problematic if the other peer is down
and a reboot happens.

>> resource r0 {
>>         protocol C;
>>         startup {
>>                 wfc-timeout  15;
>>                 degr-wfc-timeout 60;
> This is very unusual, and likely not what you meant.
> typically wfc-timeout is (much) larger than degr-wfc-timeout.
>
Thanks.  I have fixed this.

>>         }
>>         net {
>>                 cram-hmac-alg sha1;
>>                 shared-secret "secret";
>>         allow-two-primaries;
>>         after-sb-0pri discard-younger-primary;
>>         after-sb-1pri discard-secondary;
>>         after-sb-2pri call-pri-lost-after-sb;
> You are aware that you configured automatic data loss there, right?
> Just because something is "younger" does not mean a thing.
> Just because something is "secondary" *at the point in time of the
> connection handshake* does not mean it has bad data.
>
If the the rebooting peer shuts down properly, there shouldn't be any data
written between drbd going down and coming up right?  Nothing written,
nothing lost is what I'm presuming. 

> Jun 24 12:30:39 serverb kernel: [  221.439427] block drbd0: incompatible after-sb-0pri settings
> Jun 24 12:30:39 serverb kernel: [  221.443142] block drbd0: conn( WFReportParams -> Disconnecting ) 
>
> How come the after-sb settings are "incompatible"?
> They should be the same on both peers.
Fixed this.  I think I changed the other peer without properly restarting
drbd.  Thanks.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110624/c86923b1/attachment.pgp>