Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, 03 Dec 2010 09:43:11 +0100, Lars Ellenberg wrote: > See if my post > [DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary > dated Tue Jan 8 11:56:00 CET 2008 helps. > http://lists.linbit.com/pipermail/drbd-user/2008-January/008223.html and > other archives Perhaps I'm still not grasping this, but - based on that URL - I thought the situation below would make use of degr-wfc-timeout: I'd two nodes, both primary. Using iptables, I "broke" the connection. Both nodes were still up, but reporting: 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 and 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 I then crashed ("xm destroy") one node and then booted it. As I understand the above-cited post, this should have make use of the degr- wfc-timeout value but - apparently - it did not: Starting drbd: Starting DRBD resources: [ drbd1 Found valid meta data in the expected location, 16105058304 bytes into / dev/xvdb1. d(drbd1) drbd: bd_claim(cfe1ad00,cc00c800); failed [d108e4d0;c0478e79;1] 1: Failure: (114) Lower device is already claimed. This usually means it is mounted. [drbd1] cmd /sbin/drbdsetup 1 disk /dev/xvdb1 /dev/xvdb1 internal --set- defaults --create-device failed - continuing! s(drbd1) n(drbd1) ].......... *************************************************************** DRBD's startup script waits for the peer node(s) to appear. - In case this node was already a degraded cluster before the reboot the timeout is 60 seconds. [degr-wfc-timeout] - If the peer was available before the reboot the timeout will expire after 0 seconds. [wfc-timeout] (These values are for resource 'drbd1'; 0 sec -> wait forever) To abort waiting enter 'yes' [ 208]: What am I doing/understanding wrong? > BTW, that setting only affects drbdadm/drbdsetup wait-connect, as used > for example by the init script, if used without an explicit timeout. It > does not affect anything else. > > What is it you are trying to prove/trying to achieve? At this point, I'm trying to understand DRBD. Specifically in this case, I'm trying to understand the startup process and it deals with various partition/split-brain cases. I come from a Cluster Suite world, where "majority voting" is the answer to these issues, so I'm working to come up to speed on how these issues are addressed by DRBD. The idea of waiting forever seems like a problem if only one node is available to go back into production. I know that the wait can be overridden manually, but is there a way to not wait forever? This is the context in which I started looking at degr-wfc-timeout. FWIW, I've also posted in the thread "RedHat Clustering Services does not fence when DRBD breaks" trying to understand the fencing process. I think I managed to suspend all I/O in the case of a fence failure (the handler returning a value of 6), but I'm not sure. Does: 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---- ns:0 nr:0 dw:4096 dr:28 al:1 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 indicate suspension? Is that what "s----" means? I've failed to find documentation for that bit of string in /proc/drbd. Thanks... Andrew