[DRBD-user] Default Split Brain Behaviour

Sat Jan 29 02:43:14 CET 2011

Hi Lars,

> Maybe the logs you posted do not match the incident described.

There are no other logs available for this resource, and no additional information in they system logs that I'm able to find.

> Or you attached to stale data, thinking a rollback had taken place,
> but actually it is just stale data and the more recent data is still
> on the other node.
>
> But the logs you posted do not show any sync taking place, even cleary
> show that DRBD refuses to do a sync because it detected data
> divergence.
> There cannot have been a rollback, because there has been no sync,
> again according to the logs you posted.

That's correct no sync has taken place & it is still un-synced.

> Go back to your logs, and find the logs that match the incident
> described.
> 
> What is the status of that pair of DRBD now?
> Is it actually "cs:Connected, UpToDate/UpToDate" ?
> 
> Find out when it became so, and how. Because, again, the logs you
> showed previously, state, that DRBD refused to connect.
> If it finnaly synced up and connected anyways, likely someone told it
> to
> "--discard-my-data" on one of the nodes (or "invalidate" or something
> to
> that regard).
> And if that has been the side with the data you lost,
> well, then that someone told DRBD to throw it away.

The resource nodes are still disconnected and no override has been used to force the situation.
The only commands issued have been drbdadm connect all, drbdadm connect x2, drbdadm primary x2 (on the only node that has ever been primary) and drbd attach.
I'm the only one with access to these machines, I can assure you sync has not been forced at any time.

The only log record against this resource in all  archived messages prior to the system restart is..
Jan 11 11:30:31 emlsurit-v4 kernel: [7745016.672246] block drbd9: disk( UpToDate -> Diskless )
I expect this is the point at which the drbdadm detach was issued, while the node was primary and active. 
>From the command history I can't determine which node the detach was issued from.
Does it matter which node a drbdadm detach is issued from?

I've attached the logs from each node (since the point of system restart, both created using grep drbd /var/log/messages >>
The resource in question is drbd9.

On node A it details system start Jan 23 15:07:16,  
The resource was later set primary before network connection between the nodes...
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary -> Primary ) 
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.122546] block drbd9: Creating new current UUID
A minute later we can see the KVM instance start up and libvirt access the resource...

Jan 23 15:55:06 emlsurit-v4 kernel: [ 2880.172752] type=1503 audit(1295758506.227:17):  operation="open" pid=8340 parent=1787 profile="/usr/lib/libvirt/virt-aa-helper" requested_mask="r::" denied_mask="r::" fsuid=0 ouid=0 name="/dev/drbd9"

Later that evening the VLAN connectivity is restored and I issue a drbdadm connect all
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806263] block drbd9: conn( StandAlone -> Unconnected ) 
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806312] block drbd9: Starting receiver thread (from drbd9_worker [2126])
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806353] block drbd9: receiver (re)started
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806359] block drbd9: conn( Unconnected -> WFConnection )

handshake proceeds and spil-brains...
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self 49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143432 flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer 6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381 flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905975] block drbd9: uuid_compare()=100 by rule 90
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.906273] block drbd9: helper command: /sbin/drbdadm split-brain minor-9

Perhaps what has been confusing the matter is my initial post associating split-brain with the data loss.
The node was primary and active prior to any split brain, and it seems to me that the roll back/loss of data had occurred prior to split-brain. 
The only conceivable possibility to me, is still that NodeA has rolled back or discarded changes in it's activity log following the restart.
As far as I can determine this occurred prior to the split-brain, while the resource nodes where still disconnected (prior to restoration of network connectivity).

Just to be thorough, I'll export the KVM instance XML and start it up to investigate the other node, but do not expect to find the data that's missing there.

Thanks for all the efforts so far. 

Cheers,

Lew

-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_nodeA.gz
Type: application/x-gzip
Size: 6293 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110129/86bf8a68/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_nodeB.gz
Type: application/x-gzip
Size: 3740 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110129/86bf8a68/attachment-0001.bin>