[DRBD-user] Understanding degr-wfc-timeout

Fri Dec 3 18:12:23 CET 2010

On Fri, 03 Dec 2010 09:43:11 +0100, Lars Ellenberg wrote:

> See if my post
> [DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary
> dated Tue Jan 8 11:56:00 CET 2008 helps.
> http://lists.linbit.com/pipermail/drbd-user/2008-January/008223.html and
> other archives

Perhaps I'm still not grasping this, but - based on that URL - I thought 
the situation below would make use of degr-wfc-timeout:

I'd two nodes, both primary.

Using iptables, I "broke" the connection.  Both nodes were still up, but 
reporting:

 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

and

 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

I then crashed ("xm destroy") one node and then booted it.  As I 
understand the above-cited post, this should have make use of the degr-
wfc-timeout value but - apparently - it did not:

Starting drbd:  Starting DRBD resources: [ 
drbd1
Found valid meta data in the expected location, 16105058304 bytes into /
dev/xvdb1.
d(drbd1) drbd: bd_claim(cfe1ad00,cc00c800); failed [d108e4d0;c0478e79;1]
1: Failure: (114) Lower device is already claimed. This usually means it 
is mounted.

[drbd1] cmd /sbin/drbdsetup 1 disk /dev/xvdb1 /dev/xvdb1 internal --set-
defaults --create-device  failed - continuing!

s(drbd1) n(drbd1) ]..........
***************************************************************
 DRBD's startup script waits for the peer node(s) to appear.
 - In case this node was already a degraded cluster before the
   reboot the timeout is 60 seconds. [degr-wfc-timeout]
 - If the peer was available before the reboot the timeout will
   expire after 0 seconds. [wfc-timeout]
   (These values are for resource 'drbd1'; 0 sec -> wait forever)
 To abort waiting enter 'yes' [ 208]:

What am I doing/understanding wrong?

> BTW, that setting only affects drbdadm/drbdsetup wait-connect, as used
> for example by the init script, if used without an explicit timeout. It
> does not affect anything else.
> 
> What is it you are trying to prove/trying to achieve?

At this point, I'm trying to understand DRBD.  Specifically in this case, 
I'm trying to understand the startup process and it deals with various 
partition/split-brain cases.  I come from a Cluster Suite world, where 
"majority voting" is the answer to these issues, so I'm working to come 
up to speed on how these issues are addressed by DRBD.

The idea of waiting forever seems like a problem if only one node is 
available to go back into production.  I know that the wait can be 
overridden manually, but is there a way to not wait forever?

This is the context in which I started looking at degr-wfc-timeout.  

FWIW, I've also posted in the thread "RedHat Clustering Services does not 
fence when DRBD breaks" trying to understand the fencing process.  I 
think I managed to suspend all I/O in the case of a fence failure (the 
handler returning a value of 6), but I'm not sure.  Does:

 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s----
    ns:0 nr:0 dw:4096 dr:28 al:1 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

indicate suspension?  Is that what "s----" means?  I've failed to find 
documentation for that bit of string in /proc/drbd.

Thanks...

	Andrew