[DRBD-user] strange split-brain problem

Tue Dec 7 19:45:30 CET 2010

On Mon, Dec 06, 2010 at 06:08:19PM +0100, Klaus Darilion wrote:
> Hi all!
> 
> Today I had a strange experience.
> 
> node A: 192.168.100.100, cc1-vie
>   /dev/drbd1: primary
>   /dev/drbd5: primary
> 
> node B: 192.168.100.101, cc1-sbg
>   /dev/drbd1: secondary
>   /dev/drbd5: secondary
> 
> The /dev/drbdX devices are used by a xen domU.
> 
> resource manager-ha {
>   startup {
>     become-primary-on cc1-vie;
>   }
>   on cc1-vie {
>     device    /dev/drbd1;
>     disk      /dev/mapper/cc1--vienna-manager--disk--drbd;
>     address   192.168.100.100:7789;
>     meta-disk internal;
>   }
>   on cc1-sbg {
>     device    /dev/drbd1;
>     disk      /dev/mapper/cc1--sbg-manager--disk--drbd;
>     address   192.168.100.101:7789;
>     meta-disk internal;
>   }
> }
> 
> resource cc-manager-templates-ha {
>   startup {
>     become-primary-on cc1-vie;
>   }
>   on cc1-vie {
>     device    /dev/drbd5;
>     disk      /dev/mapper/cc1--vienna-cc--manager--templates--drbd
>     address   192.168.100.100:7793;
>     meta-disk internal;
>   }
>   on cc1-sbg {
>     device    /dev/drbd5;
>     disk      /dev/mapper/cc1--sbg-cc--manager--templates--drbd
>     address   192.168.100.101:7793;
>     meta-disk internal;
>   }
> }
> 
> Everything was running fine. Then I rebooted both servers. Then I spotted:
> 
> block drbd5: Starting worker thread (from cqueue [1573])
> block drbd5: disk( Diskless -> Attaching )
> block drbd5: Found 4 transactions (192 active extents) in activity log.
> block drbd5: Method to ensure write ordering: barrier
> block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
> block drbd5: max_segment_size ( = BIO size ) = 4096
> block drbd5: drbd_bm_resize called with capacity == 41941688
> block drbd5: resync bitmap: bits=5242711 words=81918
> block drbd5: size = 20 GB (20970844 KB)
> block drbd5: recounting of set bits took additional 0 jiffies
> block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> block drbd5: Marked additional 508 MB as out-of-sync based on AL.
> block drbd5: disk( Attaching -> UpToDate )
> 
> 
> This is the first thing which makes me nervous: There were 500MB to
> synchronize although the server was idle and everything was
> synchronized before rebooting.

As was pointed out already,
read up on what we call the activitly log.

> Then some more reboots on node A and suddenly:
> 
> block drbd5: State change failed: Refusing to be Primary without at
> least one UpToDate disk
> block drbd5:   state = { cs:WFConnection ro:Secondary/Unknown
> ds:Diskless/DUnknown r--- }
     ^^^^^^^^

You failed to attach, you have not yet connected,
so DRBD refuses to become Primary: which data should it be Primary with?

> Then the status on node A was:
> 
> cc-manager-templates-ha  Connected Primary/Secondary
> Diskless/UpToDate A r----

It was able to establish the connection,
and was going Primary with the data of the peer.

> When I tried to manually attach the device I got error messages:
> "Split-Brain detected, dropping connection".

Hm.  Ugly.
It should refuse the attach instead.
Did it just get the error message wrong,
or did it actually disconnect there?
What DRBD version would that be?

> After some googling without finding any hint suddenly the status changed:
> 
> cc-manager-templates-ha  StandAlone Primary/Unknown
> UpToDate/DUnknown r---- xen-vbd: _cc-manager
> 
> 
> So, suddenly this one device is not connected anymore. All the other
> drbd devices are still connected and working fine - only this single
> device is making problems, although it has identical configuration.
> 
> 
> What could cause such an issue? Everything was working fine, I just
> rebooted the servers.
> 
> Any hints what to do now to solve this issue?

Your setup is broken.
Apparently something in your boot process, at least "sometimes",
claims the lower level devices so DRBD fails to attach.
 Fix that.

Your shutdown process is apparently broken enough to
not really shutdown everything and demote/down DRBD
so it stays Primary. That makes an "orderly" shutdown/reboot
look like a Primary crash to DRBD.
 Fix that.

Are you sure that you have been the only one tampering with DRBD at the
time, or would heartbeat/pacemaker/whatever try to do something at the
same time?

And, BTW, no.
Your /etc/hosts file has zero to do with how DRBD behaves.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed