Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I think this is a problem with DRBD and not cman+pacemaker, so I'm posting here first. I'm trying to set up an active/active HA cluster as explained in : <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08.html> I'll give versions and config files below, but I'll start with what happens. I start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing enabled. My fencing mechanism cuts power to a node by turning the load off in its UPS. The two nodes are hypatia-tb and orestes-tb. I want to test fencing and recovery. I start with both nodes running, and resources properly running on both nodes. Then I simulate failure on one node, e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker off", or by pulling the pull. The results are the same: all the resources move to hypatia-tb, with the drbd resource as Primary. When I try to bring orestes-tb back into the cluster with "crm node online" or "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced. OK, that makes sense, I guess; there's a potential split-brain situation. I bring orestes-tb back up, with the intent of adding it back into the cluster. I make sure cman, pacemaker, and drbd services were off at system start. On orestes-tb, I type "service drbd start". What I expect to happen is that the drbd resource on orestes-tb is marked "Outdated" or something like that. Then I'd fix it with "drbdadm --discard-my-data connect admin" or whatever is appropriate, as in <http://www.drbd.org/users-guide/s-resolve-split-brain.html> What actually happens is that hypatia-tb is fenced. Since this is the node running all the resources, this is bad behavior. It's even more puzzling when I consider that at, the time, there isn't any fencing resource actually running on orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself. Eventually hypatia-tb reboots, I fiddle with things, and the cluster is back to normal. But as a fencing/stability/HA test, this is a failure. Any ideas? Versions: Scientific Linux 6.2 2.6.32 cman-3.0.12 corosync-1.4.1 pacemaker-1.1.6 drbd-8.4.1 /etc/drbd.d/global-common.conf: global { usage-count yes; } common { startup { wfc-timeout 60; degr-wfc-timeout 60; outdated-wfc-timeout 60; } net { ping-timeout 11; } } /etc/drbd.d/admin.res: resource admin { protocol C; on hypatia-tb.nevis.columbia.edu { volume 0 { device /dev/drbd0; disk /dev/md2; flexible-meta-disk internal; } address 192.168.100.7:7788; } on orestes-tb.nevis.columbia.edu { volume 0 { device /dev/drbd0; disk /dev/md2; flexible-meta-disk internal; } address 192.168.100.6:7788; } startup { } net { allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; sndbuf-size 0; } disk { resync-rate 100M; c-max-rate 100M; al-extents 3389; fencing resource-only; } An edited output of "crm configure show": node hypatia-tb.nevis.columbia.edu node orestes-tb.nevis.columbia.edu primitive StonithHypatia stonith:fence_nut \ params pcmk_host_check="static-list" \ pcmk_host_list="hypatia-tb.nevis.columbia.edu" \ ups="sofia-ups" username="admin" password="XXX" primitive StonithOrestes stonith:fence_nut \ params pcmk_host_check="static-list" \ pcmk_host_list="orestes-tb.nevis.columbia.edu" ups="dc-test-stand-ups" username="admin" password="XXX" location StonithHypatiaLocation StonithHypatia \ -inf: hypatia-tb.nevis.columbia.edu location StonithOrestesLocation StonithOrestes \ -inf: orestes-tb.nevis.columbia.edu /etc/cluster/cluster.conf: <?xml version="1.0"?> <cluster config_version="17" name="Nevis_HA"> <logging debug="off"/> <cman expected_votes="1" two_node="1" /> <clusternodes> <clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1"> <altname name="hypatia-private.nevis.columbia.edu" port="5405" mcast="226.94.1.1"/> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/> </method> </fence> </clusternode> <clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2"> <altname name="orestes-private.nevis.columbia.edu" port="5405" mcast="226.94.1.1"/> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="orestes-tb.nevis.columbia.edu"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="pcmk" agent="fence_pcmk"/> </fencedevices> <fence_daemon post_join_delay="30" /> <rm disabled="1" /> </cluster> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no messages in the hypatia-tb log for this time): Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1 (api:1/proto:86-100) Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root at orestes-tb.nevis.columbia.edu, 2012-02-14 17:05:32 Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147 Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from drbdsetup [2570]) Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching ) Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering: barrier Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560 Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing device's (32 -> 768) Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with capacity == 5611549368 Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671 words=10960058 pages=21407 Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took 634 jiffies Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took additional 92 jiffies Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated ) Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected ) Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from drbd_w_admin [2572]) Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection ) Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed network protocol version 100 Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection -> WFReportParams ) Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from drbd_r_admin [2579]) Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake: Feb 15 16:52:29 orestes-tb kernel: block drbd0: self F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0 Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0 Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50 Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0% Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0% Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time. Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn( WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247 jiffies Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked out-of-sync by on disk bit-map. Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure -> Unconnected ) Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection ) -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4497 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120215/eb86f108/attachment.bin>