[DRBD-user] strange split-brain problem

Mon Dec 6 18:08:19 CET 2010

Hi all!

Today I had a strange experience.

node A: 192.168.100.100, cc1-vie
   /dev/drbd1: primary
   /dev/drbd5: primary

node B: 192.168.100.101, cc1-sbg
   /dev/drbd1: secondary
   /dev/drbd5: secondary

The /dev/drbdX devices are used by a xen domU.

resource manager-ha {
   startup {
     become-primary-on cc1-vie;
   }
   on cc1-vie {
     device    /dev/drbd1;
     disk      /dev/mapper/cc1--vienna-manager--disk--drbd;
     address   192.168.100.100:7789;
     meta-disk internal;
   }
   on cc1-sbg {
     device    /dev/drbd1;
     disk      /dev/mapper/cc1--sbg-manager--disk--drbd;
     address   192.168.100.101:7789;
     meta-disk internal;
   }
}

resource cc-manager-templates-ha {
   startup {
     become-primary-on cc1-vie;
   }
   on cc1-vie {
     device    /dev/drbd5;
     disk      /dev/mapper/cc1--vienna-cc--manager--templates--drbd
     address   192.168.100.100:7793;
     meta-disk internal;
   }
   on cc1-sbg {
     device    /dev/drbd5;
     disk      /dev/mapper/cc1--sbg-cc--manager--templates--drbd
     address   192.168.100.101:7793;
     meta-disk internal;
   }
}

Everything was running fine. Then I rebooted both servers. Then I spotted:

block drbd5: Starting worker thread (from cqueue [1573])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (192 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 508 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )

This is the first thing which makes me nervous: There were 500MB to 
synchronize although the server was idle and everything was synchronized 
before rebooting.

Then I did some more reboots on node A and spotted again:

block drbd5: Starting worker thread (from cqueue [1630])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (126 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 488 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )

So, why again resynchronizing almost 500Mb although the partition is not 
used at all (just mounted in a domU).

Then some more reboots on node A and suddenly:

block drbd5: State change failed: Refusing to be Primary without at 
least one UpToDate disk
block drbd5:   state = { cs:WFConnection ro:Secondary/Unknown 
ds:Diskless/DUnknown r--- }
block drbd5:  wanted = { cs:WFConnection ro:Primary/Unknown 
ds:Diskless/DUnknown r--- }
block drbd5: State change failed: Refusing to be Primary without at 
least one UpToDate disk
block drbd5:   state = { cs:WFConnection ro:Secondary/Unknown 
ds:Diskless/DUnknown r--- }
block drbd5:  wanted = { cs:WFConnection ro:Primary/Unknown 
ds:Diskless/DUnknown r--- }

Then the status on node A was:

cc-manager-templates-ha  Connected Primary/Secondary Diskless/UpToDate A 
r----

When I tried to manually attach the device I got error messages: 
"Split-Brain detected, dropping connection".

After some googling without finding any hint suddenly the status changed:

cc-manager-templates-ha  StandAlone Primary/Unknown   UpToDate/DUnknown 
r---- xen-vbd: _cc-manager

So, suddenly this one device is not connected anymore. All the other 
drbd devices are still connected and working fine - only this single 
device is making problems, although it has identical configuration.

What could cause such an issue? Everything was working fine, I just 
rebooted the servers.

Any hints what to do now to solve this issue?

thanks
Klaus