[DRBD-user] Default Split Brain Behaviour

Lew ls at redgrid.net
Fri Jan 28 02:12:07 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Felix,
 
> > From the logs, I'm curious about the lines...
> > Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044910] block drbd9: 0 KB
> > (0 bits) marked out-of-sync by on disk bit-map.
> > Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044929] block drbd9: Marked
> > additional 508 MB as out-of-sync based on AL.
> >
> > ...then a little further down
> > Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9:
> > role( Secondary -> Primary )
> 
> A little further down as in "45 minutes later" ;-)
First was immediately on start-up, was doing other things in the interim the issued singular commands to set required resources as primary,
while the VLAN network connection between the nodes was still down.

> > On both nodes, I also noticed ... block drbd9: helper command:
> > /sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)
> 
> Not sure about this one, but if you *had* split brain, there would
> have
> been no sync-back (or any standing DRBD connection).

Still baffled how you can get split brain with only one node ever being primary.
I had no problem setting the resource as primary, as with the other nodes.
What was different in the instance (not sure if it is relevant), was that I did not make any writes to the primary node until after the VLAN between the nodes was restored.
i.e the KVM instance was not started until some 14 hours after it had been set primary following the reboot.
I do recall that I was unable to connect, attach prior to this.

At no time did I issue any command to set the secondary node as primary, just connect & attach long after the primary node had been set (which failed).

> The thing about the AL (activity log) is: It is only active when your
> nodes are in sync. Your primary will only sync back hot extents from
> its
> peer if the peer's data is known to be up to date! It should not be
> possible to loose any data because of sync-back of hot AL extents.
> 
> To make this more clear: Whenever your nodes are in sync, DRBD keeps
> track of the last ~500 MB (in your case, it depends on the al_extents
> setting) that were written. This information is stored permanently in
> the metadata of the Primary. When it goes down and comes up again, it
> marks those 500MB as "out of date". This is helpful: When coming up
> after a hard crash with possible data loss, the Primary can restore
> any
> lost writes from the Secondary. Note that this will not destroy data:
> The Secondary will become SyncSource only if it's UpToDate.

That's the thing, the secondary node had been out of action & detached for 8 days or so.
The secondary node never become SyncSource, and is still in un-synched and disconnected which inclines me to suspect that the AL log has been rolled back on the primary node via some aberrant means.

> Are you absolutely certain that you lost data? Because from here, it
> sure doesn't look like it. (As a matter of fact, it doesn't even look
> like split brain - has any of your notify scripts fired?)

You can bet the world on it, data was lost.
Although I can't be 100% certain of where the data loss rolled back to, it appears from all angles to be the point where the secondary node was manually detached from the resource.

Hope this clarifies things a little.

Thanks again,

Lew



More information about the drbd-user mailing list