[DRBD-user] DRBD+Pacemaker: Won't promote with only one node

Wed Jan 11 20:34:46 CET 2012

I didn't get a response up-thread, so please forgive my sin of top-posting and
ask a more restricted question:

I've re-posted a section of my log file below. What I suspect happened is this:

- I have two servers running DRBD+Pacemaker, hypatia (master) and orestes (slave).
- Master crashes due to a power outage. It's STONITHed.
- Slave becomes master, then it too crashes due to power outage.
- I bring up both systems, but things are confused and there are multiple
STONITHs, even a case when both systems STONITH each other.
- I go more slowly and just bring up hypatia.
- hypatia starts Pacemaker which starts up DRBD.
- What I think the log says is that hypatia DRBD sees that some of its sectors
are out-of-sync. It waits for the slave to back to sync them before it will
allow the resource to be promoted to master.

Is this scenario consistent with these log entries, or am I way off course?

Dec 30 19:24:26 hypatia kernel: events: mcg drbd: 1
Dec 30 19:24:26 hypatia kernel: drbd: initialized. Version: 8.4.1
(api:1/proto:86-100)
Dec 30 19:24:26 hypatia kernel: drbd: GIT-hash:
91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
root at hypatia.nevis.columbia.edu, 2011-12-21 13:42:51
Dec 30 19:24:26 hypatia kernel: drbd: registered as block device major 147
Dec 30 19:24:26 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
Dec 30 19:24:26 hypatia kernel: igb: eth1 NIC Link is Down
Dec 30 19:24:27 hypatia kernel: d-con admin: Starting worker thread (from
drbdsetup [7213])
Dec 30 19:24:27 hypatia kernel: block drbd1: disk( Diskless -> Attaching )
Dec 30 19:24:27 hypatia kernel: d-con admin: Method to ensure write ordering:
barrier
Dec 30 19:24:27 hypatia kernel: block drbd1: max BIO size = 130560
Dec 30 19:24:27 hypatia kernel: block drbd1: drbd_bm_resize called with capacity
== 1953460176
Dec 30 19:24:27 hypatia kernel: block drbd1: resync bitmap: bits=244182522
words=3815352 pages=7452
Dec 30 19:24:27 hypatia kernel: block drbd1: size = 931 GB (976730088 KB)
Dec 30 19:24:27 hypatia lrmd: [5800]: info: RA output:
(AdminDrbd:0:start:stderr) Marked additional 1028 MB as out-of-sync based on AL.
Dec 30 19:24:27 hypatia crmd: [5803]: info: process_lrm_event: LRM operation
WorkDrbd:0_start_0 (call=49, rc=0, cib-update=56, confirmed=true) ok
Dec 30 19:24:28 hypatia kernel: block drbd1: bitmap READ of 7452 pages took 131
jiffies
Dec 30 19:24:28 hypatia kernel: block drbd1: recounting of set bits took
additional 21 jiffies
Dec 30 19:24:28 hypatia kernel: block drbd1: 1028 MB (263168 bits) marked
out-of-sync by on disk bit-map.
Dec 30 19:24:28 hypatia kernel: block drbd1: disk( Attaching -> Consistent )
Dec 30 19:24:28 hypatia kernel: block drbd1: attached to UUIDs
A82875A514F576EB:0000000000000000:5283A3879DE4DED1:5282A3879DE4DED1
Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
Dec 30 19:24:28 hypatia kernel: d-con admin: conn( StandAlone -> Unconnected )
Dec 30 19:24:28 hypatia kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [7214])
Dec 30 19:24:28 hypatia kernel: d-con admin: receiver (re)started
Dec 30 19:24:28 hypatia kernel: d-con admin: conn( Unconnected -> WFConnection )
Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)

On 1/6/12 11:54 AM, William Seligman wrote:
>> Message: 1
>> Date: Thu, 5 Jan 2012 23:01:24 +0100
>> From: Florian Haas <florian at hastexo.com>
>> Subject: Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one
>> 	node
>> To: drbd-user at lists.linbit.com
>> Message-ID:
>> 	<CAPUexz9jgpq0V49eZmwDH3cThrGEJ7VopXCh1_27yjX-6JC+fQ at mail.gmail.com>
>> Content-Type: text/plain; charset=UTF-8
>>
>> On Thu, Jan 5, 2012 at 6:36 PM, William Seligman
>> <seligman at nevis.columbia.edu> wrote:
>>> Sure. I didn't do this before, since the configuration is complex. I also don't
>>> know which would be more comprehensible, so I've attached both cib.xml and the
>>> result of "crm configure show". I should mention that I'm a lazy bum, so I use
>>> crm-gui to configure corosync; that's why these files look more baroque than usual.
> 
>> In the CIB you posted, both nodes are in the "UNCLEAN (offline)"
>> state. In that state, nothing gets promoted, nothing gets started. Are
>> you sure you posted the right CIB dump?
> 
> That caused me to panic! I rushed to check that my cluster was running.
> It's OK; an occasional panic is good once in a while.
> 
> I assume you're talking about the following lines:
> 
>     <nodes>
>       <node id="orestes.nevis.columbia.edu" uname="orestes.nevis.columbia.edu"
> type="normal">
>         <instance_attributes id="nodes-orestes.nevis.columbia.edu">
>           <nvpair id="nodes-orestes.nevis.columbia.edu-standby" name="standby"
> value="off"/>
>         </instance_attributes>
>       </node>
>       <node id="hypatia.nevis.columbia.edu" uname="hypatia.nevis.columbia.edu"
> type="normal">
>         <instance_attributes id="nodes-hypatia.nevis.columbia.edu">
>           <nvpair id="nodes-hypatia.nevis.columbia.edu-standby" name="standby"
> value="off"/>
>         </instance_attributes>
>       </node>
>     </nodes>
> 
> For both nodes, the attribute "standby" is set to "off", which evidently means
> it's not in standby mode, so it's online. According to a web page I found at
> home last night, but can't find at work today, "standby:off" gets added to a
> node's attributes if it was ever marked "UNCLEAN (offline)" and brought online
> again; this has indeed happened with both my nodes.
> 
> To confirm, the first few lines from running "crm status" are:
> 
> ============
> Last updated: Fri Jan  6 11:29:23 2012
> Stack: openais
> Current DC: hypatia.nevis.columbia.edu - partition with quorum
> Version: 1.0.12-unknown
> 2 Nodes configured, 2 expected votes
> 25 Resources configured.
> ============
> 
> Online: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
> 
>  Master/Slave Set: Admin
>      Masters: [ hypatia.nevis.columbia.edu ]
>      Slaves: [ orestes.nevis.columbia.edu ]
> 
> 
> Getting back to my question, I hunted through my logs and saw that during time
> of "one-node; no promotion" crm reported the above status as:
> 
>  Master/Slave Set: Admin
>      Slaves: [ hypatia.nevis.columbia.edu ]
>      Stopped: [ AdminDrbd:1 ]
> 
> ... and it simply stayed like that. Since all the other cluster resources depend
> on Admin:promote, nothing else would happen. The relevant drbd messages in the
> log file appear to be:
> 
> Dec 30 19:24:26 hypatia kernel: events: mcg drbd: 1
> Dec 30 19:24:26 hypatia kernel: drbd: initialized. Version: 8.4.1
> (api:1/proto:86-100)
> Dec 30 19:24:26 hypatia kernel: drbd: GIT-hash:
> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
> root at hypatia.nevis.columbia.edu, 2011-12-21 13:42:51
> Dec 30 19:24:26 hypatia kernel: drbd: registered as block device major 147
> Dec 30 19:24:26 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
> Dec 30 19:24:26 hypatia kernel: igb: eth1 NIC Link is Down
> Dec 30 19:24:27 hypatia kernel: d-con admin: Starting worker thread (from
> drbdsetup [7213])
> Dec 30 19:24:27 hypatia kernel: block drbd1: disk( Diskless -> Attaching )
> Dec 30 19:24:27 hypatia kernel: d-con admin: Method to ensure write ordering:
> barrier
> Dec 30 19:24:27 hypatia kernel: block drbd1: max BIO size = 130560
> Dec 30 19:24:27 hypatia kernel: block drbd1: drbd_bm_resize called with capacity
> == 1953460176
> Dec 30 19:24:27 hypatia kernel: block drbd1: resync bitmap: bits=244182522
> words=3815352 pages=7452
> Dec 30 19:24:27 hypatia kernel: block drbd1: size = 931 GB (976730088 KB)
> Dec 30 19:24:27 hypatia lrmd: [5800]: info: RA output:
> (AdminDrbd:0:start:stderr) Marked additional 1028 MB as out-of-sync based on AL.
> Dec 30 19:24:27 hypatia crmd: [5803]: info: process_lrm_event: LRM operation
> WorkDrbd:0_start_0 (call=49, rc=0, cib-update=56, confirmed=true) ok
> Dec 30 19:24:28 hypatia kernel: block drbd1: bitmap READ of 7452 pages took 131
> jiffies
> Dec 30 19:24:28 hypatia kernel: block drbd1: recounting of set bits took
> additional 21 jiffies
> Dec 30 19:24:28 hypatia kernel: block drbd1: 1028 MB (263168 bits) marked
> out-of-sync by on disk bit-map.
> Dec 30 19:24:28 hypatia kernel: block drbd1: disk( Attaching -> Consistent )
> Dec 30 19:24:28 hypatia kernel: block drbd1: attached to UUIDs
> A82875A514F576EB:0000000000000000:5283A3879DE4DED1:5282A3879DE4DED1
> Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
> Dec 30 19:24:28 hypatia kernel: d-con admin: conn( StandAlone -> Unconnected )
> Dec 30 19:24:28 hypatia kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [7214])
> Dec 30 19:24:28 hypatia kernel: d-con admin: receiver (re)started
> Dec 30 19:24:28 hypatia kernel: d-con admin: conn( Unconnected -> WFConnection )
> Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout)
> 
> 
> I noticed today a post on Linux-HA that appears to be the same problem as mine:
> 
> <http://www.gossamer-threads.com/lists/drbd/users/19943>
> 
> Unfortunately no resolution to the problem was posted then.
> 
> Any thoughts?

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4497 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120111/789da235/attachment.bin>