[DRBD-user] Help with recovering from network failure (failover and back again)

Lars Ellenberg Lars.Ellenberg at linbit.com
Tue Sep 13 11:28:15 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


> I have one drbd resource, r1, everything started and ok (connected and
> consistent), fs1 = primary, fs2 = secondary.
> At that point I shutdown the switchport of the drbd nic of fs1. A few
> seconds later both sides notice they can't connect to the other side and
> change the status of that side to Unknown.
> Now I want fs2 to become
> primary (apperently something is wrong with fs1, so I want application
> servers on the location of fs2 to take over with fs2 as fileserver), so
> I do a drbdadm primary all on fs2 and a drbdadm secondary all on fs1
> (just to be sure, can't have 2 primaries when I re-enable the
> switchport). Both sides update their status accordingly.

you force a split brain.

> If I then re-enable the switchport, both sides "see" each other again,
> but won't reconnect, because fs1 wants to sync as source with fs2 as
> target. That seems totally wrong to me. I expect fs1 to become a
> secondary with fs2 primary.

well, recovery from a split brain is counter intuitive at times.

it is not "you expect", but "you want to".
because you know more than drbd does.

> Fs2 does refuse the sync (as it should) and aborts.
> The strange part is that if I stop the drbd device on fs2 en
> restart it, it comes up as secondary (correct) and syncs back to fs1
> with fs2 source and fs1 target, just as it should!

well, this may be a bug, then.
or a feature. need to think about how I want to call it :)

though, drbd should be able to identify the situation
as "connect after split brain", and require explicit
configuration or operator interaction,
as will be the case with drbd 0.8.

> I'm running 0.7.13 on a 2.4 kernel. 
> I hope someone can help me out with this!
> 
> Here are some parts from syslog:
> 
> Interface has been shutdown, make fs1 secondary:
> fs1 kernel: drbd0: Primary/Unknown --> Secondary/Unknown
> Idem make fs2 primary:
> fs2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown
> 
> Re-enabled the switchport:
> 
> On fs1:
> 
> fs1 drbd0: Handshake successful: DRBD Network Protocol version 74
> fs1 drbd0: Connection established.
> fs1 drbd0: I am(S): 1:00000002:00000001:0000001c:00000010:00
> fs1 drbd0: Peer(P): 1:00000002:00000001:0000001b:00000011:10

> On fs2:
> 
> fs2 drbd0: Handshake successful: DRBD Network Protocol version 74
> fs2 drbd0: Connection established.
> fs2 drbd0: I am(P): 1:00000002:00000001:0000001b:00000011:10
> fs2 drbd0: Peer(S): 1:00000002:00000001:0000001c:00000010:00
> fs2 drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption.
> fs2 drbd0: drbd0_receiver [17971]: cstate WFReportParams --> StandAlone
> 
> After stopping and starting the drbd device on fs2:
> 
> On fs1:
> 
> fs1 drbd0: Handshake successful: DRBD Network Protocol version 74
> fs1 drbd0: Connection established.
> fs1 drbd0: I am(S): 1:00000002:00000001:0000001c:00000010:00
> fs1 drbd0: Peer(S): 1:00000002:00000001:0000001c:00000011:00
> fs1 drbd0: drbd0_receiver [4357]: cstate WFReportParams --> WFBitMapT
> fs1 drbd0: Secondary/Unknown --> Secondary/Secondary
> fs1 drbd0: drbd0_receiver [4357]: cstate WFBitMapT --> SyncTarget
> fs1 drbd0: Resync started as SyncTarget (need to sync 0 KB [0 bits set]).
> fs1 drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> fs1 drbd0: drbd0_receiver [4357]: cstate SyncTarget --> Connected
> 
> On fs2:
> 
> fs2 drbd0: Connection established.
> fs2 drbd0: I am(S): 1:00000002:00000001:0000001c:00000011:00
                                                ^^
I have to think about whether this should have been increased in
the previous step, or not.
hm...
yes, I think it is ok the way it is,
and I am tempted to call it feature :)

> fs2 drbd0: Peer(S): 1:00000002:00000001:0000001c:00000010:00
> fs2 drbd0: drbd0_receiver [18061]: cstate WFReportParams --> WFBitMapS
> fs2 drbd0: Secondary/Unknown --> Secondary/Secondary
> fs2 drbd0: drbd0_receiver [18061]: cstate WFBitMapS --> SyncSource
> fs2 drbd0: Resync started as SyncSource (need to sync 0 KB [0 bits set]).
> fs2 drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> fs2 drbd0: drbd0_receiver [18061]: cstate SyncSource --> Connected
> 
> I hope someone can help me debug this or tell me what I did wrong. TIA!


from the point of view of fs1, fs2 (the secondary) died,
and fs1 stayed primary for a while.
so, when fs2 comes back, fs1 expects it to be out of date,
and wants to sync fs1 -> fs2.

to avoid this, you could have used the --human flag
to the primary command on fs2.

to better understand how drbd decides sync direction:
http://www.drbd.org/publications.hml
 drbd_paper_for_NLUUG_2001.pdf
(yes, it is listed under "0.5 or earlier")
section 6.3 meta-data, generation counters...
it has been modified slightly since, but that should
improve your understanding of what is going on.

drbd 0.8 will drop the whole generation counter thing,
and go uuid, which makes it possible upon connect to
reliably detect previous split brain situations,
regardless of sequence of events on the nodes during
split brain...  and, in effect, cry out loud.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list