Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, 03 Dec 2010 09:43:11 +0100, Lars Ellenberg wrote:
> See if my post
> [DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary
> dated Tue Jan 8 11:56:00 CET 2008 helps.
> http://lists.linbit.com/pipermail/drbd-user/2008-January/008223.html and
> other archives
Perhaps I'm still not grasping this, but - based on that URL - I thought
the situation below would make use of degr-wfc-timeout:
I'd two nodes, both primary.
Using iptables, I "broke" the connection. Both nodes were still up, but
reporting:
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
and
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
I then crashed ("xm destroy") one node and then booted it. As I
understand the above-cited post, this should have make use of the degr-
wfc-timeout value but - apparently - it did not:
Starting drbd: Starting DRBD resources: [
drbd1
Found valid meta data in the expected location, 16105058304 bytes into /
dev/xvdb1.
d(drbd1) drbd: bd_claim(cfe1ad00,cc00c800); failed [d108e4d0;c0478e79;1]
1: Failure: (114) Lower device is already claimed. This usually means it
is mounted.
[drbd1] cmd /sbin/drbdsetup 1 disk /dev/xvdb1 /dev/xvdb1 internal --set-
defaults --create-device failed - continuing!
s(drbd1) n(drbd1) ]..........
***************************************************************
DRBD's startup script waits for the peer node(s) to appear.
- In case this node was already a degraded cluster before the
reboot the timeout is 60 seconds. [degr-wfc-timeout]
- If the peer was available before the reboot the timeout will
expire after 0 seconds. [wfc-timeout]
(These values are for resource 'drbd1'; 0 sec -> wait forever)
To abort waiting enter 'yes' [ 208]:
What am I doing/understanding wrong?
> BTW, that setting only affects drbdadm/drbdsetup wait-connect, as used
> for example by the init script, if used without an explicit timeout. It
> does not affect anything else.
>
> What is it you are trying to prove/trying to achieve?
At this point, I'm trying to understand DRBD. Specifically in this case,
I'm trying to understand the startup process and it deals with various
partition/split-brain cases. I come from a Cluster Suite world, where
"majority voting" is the answer to these issues, so I'm working to come
up to speed on how these issues are addressed by DRBD.
The idea of waiting forever seems like a problem if only one node is
available to go back into production. I know that the wait can be
overridden manually, but is there a way to not wait forever?
This is the context in which I started looking at degr-wfc-timeout.
FWIW, I've also posted in the thread "RedHat Clustering Services does not
fence when DRBD breaks" trying to understand the fencing process. I
think I managed to suspend all I/O in the case of a fence failure (the
handler returning a value of 6), but I'm not sure. Does:
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s----
ns:0 nr:0 dw:4096 dr:28 al:1 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
indicate suspension? Is that what "s----" means? I've failed to find
documentation for that bit of string in /proc/drbd.
Thanks...
Andrew