[DRBD-user] DRBD+Pacemaker: Won't promote with only one node

Fri Jan 6 17:54:20 CET 2012

> Message: 1
> Date: Thu, 5 Jan 2012 23:01:24 +0100
> From: Florian Haas <florian at hastexo.com>
> Subject: Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one
> 	node
> To: drbd-user at lists.linbit.com
> Message-ID:
> 	<CAPUexz9jgpq0V49eZmwDH3cThrGEJ7VopXCh1_27yjX-6JC+fQ at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> On Thu, Jan 5, 2012 at 6:36 PM, William Seligman
> <seligman at nevis.columbia.edu> wrote:
>> Sure. I didn't do this before, since the configuration is complex. I also don't
>> know which would be more comprehensible, so I've attached both cib.xml and the
>> result of "crm configure show". I should mention that I'm a lazy bum, so I use
>> crm-gui to configure corosync; that's why these files look more baroque than usual.

> In the CIB you posted, both nodes are in the "UNCLEAN (offline)"
> state. In that state, nothing gets promoted, nothing gets started. Are
> you sure you posted the right CIB dump?

That caused me to panic! I rushed to check that my cluster was running.
It's OK; an occasional panic is good once in a while.

I assume you're talking about the following lines:

    <nodes>
      <node id="orestes.nevis.columbia.edu" uname="orestes.nevis.columbia.edu"
type="normal">
        <instance_attributes id="nodes-orestes.nevis.columbia.edu">
          <nvpair id="nodes-orestes.nevis.columbia.edu-standby" name="standby"
value="off"/>
        </instance_attributes>
      </node>
      <node id="hypatia.nevis.columbia.edu" uname="hypatia.nevis.columbia.edu"
type="normal">
        <instance_attributes id="nodes-hypatia.nevis.columbia.edu">
          <nvpair id="nodes-hypatia.nevis.columbia.edu-standby" name="standby"
value="off"/>
        </instance_attributes>
      </node>
    </nodes>

For both nodes, the attribute "standby" is set to "off", which evidently means
it's not in standby mode, so it's online. According to a web page I found at
home last night, but can't find at work today, "standby:off" gets added to a
node's attributes if it was ever marked "UNCLEAN (offline)" and brought online
again; this has indeed happened with both my nodes.

To confirm, the first few lines from running "crm status" are:

============
Last updated: Fri Jan  6 11:29:23 2012
Stack: openais
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
25 Resources configured.
============

Online: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]

 Master/Slave Set: Admin
     Masters: [ hypatia.nevis.columbia.edu ]
     Slaves: [ orestes.nevis.columbia.edu ]

Getting back to my question, I hunted through my logs and saw that during time
of "one-node; no promotion" crm reported the above status as:

 Master/Slave Set: Admin
     Slaves: [ hypatia.nevis.columbia.edu ]
     Stopped: [ AdminDrbd:1 ]

... and it simply stayed like that. Since all the other cluster resources depend
on Admin:promote, nothing else would happen. The relevant drbd messages in the
log file appear to be:

Dec 30 19:24:26 hypatia kernel: events: mcg drbd: 1
Dec 30 19:24:26 hypatia kernel: drbd: initialized. Version: 8.4.1
(api:1/proto:86-100)
Dec 30 19:24:26 hypatia kernel: drbd: GIT-hash:
91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
root at hypatia.nevis.columbia.edu, 2011-12-21 13:42:51
Dec 30 19:24:26 hypatia kernel: drbd: registered as block device major 147
Dec 30 19:24:26 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
Dec 30 19:24:26 hypatia kernel: igb: eth1 NIC Link is Down
Dec 30 19:24:27 hypatia kernel: d-con admin: Starting worker thread (from
drbdsetup [7213])
Dec 30 19:24:27 hypatia kernel: block drbd1: disk( Diskless -> Attaching )
Dec 30 19:24:27 hypatia kernel: d-con admin: Method to ensure write ordering:
barrier
Dec 30 19:24:27 hypatia kernel: block drbd1: max BIO size = 130560
Dec 30 19:24:27 hypatia kernel: block drbd1: drbd_bm_resize called with capacity
== 1953460176
Dec 30 19:24:27 hypatia kernel: block drbd1: resync bitmap: bits=244182522
words=3815352 pages=7452
Dec 30 19:24:27 hypatia kernel: block drbd1: size = 931 GB (976730088 KB)
Dec 30 19:24:27 hypatia lrmd: [5800]: info: RA output:
(AdminDrbd:0:start:stderr) Marked additional 1028 MB as out-of-sync based on AL.
Dec 30 19:24:27 hypatia crmd: [5803]: info: process_lrm_event: LRM operation
WorkDrbd:0_start_0 (call=49, rc=0, cib-update=56, confirmed=true) ok
Dec 30 19:24:28 hypatia kernel: block drbd1: bitmap READ of 7452 pages took 131
jiffies
Dec 30 19:24:28 hypatia kernel: block drbd1: recounting of set bits took
additional 21 jiffies
Dec 30 19:24:28 hypatia kernel: block drbd1: 1028 MB (263168 bits) marked
out-of-sync by on disk bit-map.
Dec 30 19:24:28 hypatia kernel: block drbd1: disk( Attaching -> Consistent )
Dec 30 19:24:28 hypatia kernel: block drbd1: attached to UUIDs
A82875A514F576EB:0000000000000000:5283A3879DE4DED1:5282A3879DE4DED1
Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
Dec 30 19:24:28 hypatia kernel: d-con admin: conn( StandAlone -> Unconnected )
Dec 30 19:24:28 hypatia kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [7214])
Dec 30 19:24:28 hypatia kernel: d-con admin: receiver (re)started
Dec 30 19:24:28 hypatia kernel: d-con admin: conn( Unconnected -> WFConnection )
Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)

I noticed today a post on Linux-HA that appears to be the same problem as mine:

<http://www.gossamer-threads.com/lists/drbd/users/19943>

Unfortunately no resolution to the problem was posted then.

Any thoughts?
-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4497 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120106/fe0e2328/attachment.bin>