Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I didn't get a response up-thread, so please forgive my sin of top-posting and ask a more restricted question: I've re-posted a section of my log file below. What I suspect happened is this: - I have two servers running DRBD+Pacemaker, hypatia (master) and orestes (slave). - Master crashes due to a power outage. It's STONITHed. - Slave becomes master, then it too crashes due to power outage. - I bring up both systems, but things are confused and there are multiple STONITHs, even a case when both systems STONITH each other. - I go more slowly and just bring up hypatia. - hypatia starts Pacemaker which starts up DRBD. - What I think the log says is that hypatia DRBD sees that some of its sectors are out-of-sync. It waits for the slave to back to sync them before it will allow the resource to be promoted to master. Is this scenario consistent with these log entries, or am I way off course? Dec 30 19:24:26 hypatia kernel: events: mcg drbd: 1 Dec 30 19:24:26 hypatia kernel: drbd: initialized. Version: 8.4.1 (api:1/proto:86-100) Dec 30 19:24:26 hypatia kernel: drbd: GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root at hypatia.nevis.columbia.edu, 2011-12-21 13:42:51 Dec 30 19:24:26 hypatia kernel: drbd: registered as block device major 147 Dec 30 19:24:26 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) Dec 30 19:24:26 hypatia kernel: igb: eth1 NIC Link is Down Dec 30 19:24:27 hypatia kernel: d-con admin: Starting worker thread (from drbdsetup [7213]) Dec 30 19:24:27 hypatia kernel: block drbd1: disk( Diskless -> Attaching ) Dec 30 19:24:27 hypatia kernel: d-con admin: Method to ensure write ordering: barrier Dec 30 19:24:27 hypatia kernel: block drbd1: max BIO size = 130560 Dec 30 19:24:27 hypatia kernel: block drbd1: drbd_bm_resize called with capacity == 1953460176 Dec 30 19:24:27 hypatia kernel: block drbd1: resync bitmap: bits=244182522 words=3815352 pages=7452 Dec 30 19:24:27 hypatia kernel: block drbd1: size = 931 GB (976730088 KB) Dec 30 19:24:27 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stderr) Marked additional 1028 MB as out-of-sync based on AL. Dec 30 19:24:27 hypatia crmd: [5803]: info: process_lrm_event: LRM operation WorkDrbd:0_start_0 (call=49, rc=0, cib-update=56, confirmed=true) ok Dec 30 19:24:28 hypatia kernel: block drbd1: bitmap READ of 7452 pages took 131 jiffies Dec 30 19:24:28 hypatia kernel: block drbd1: recounting of set bits took additional 21 jiffies Dec 30 19:24:28 hypatia kernel: block drbd1: 1028 MB (263168 bits) marked out-of-sync by on disk bit-map. Dec 30 19:24:28 hypatia kernel: block drbd1: disk( Attaching -> Consistent ) Dec 30 19:24:28 hypatia kernel: block drbd1: attached to UUIDs A82875A514F576EB:0000000000000000:5283A3879DE4DED1:5282A3879DE4DED1 Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) Dec 30 19:24:28 hypatia kernel: d-con admin: conn( StandAlone -> Unconnected ) Dec 30 19:24:28 hypatia kernel: d-con admin: Starting receiver thread (from drbd_w_admin [7214]) Dec 30 19:24:28 hypatia kernel: d-con admin: receiver (re)started Dec 30 19:24:28 hypatia kernel: d-con admin: conn( Unconnected -> WFConnection ) Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) On 1/6/12 11:54 AM, William Seligman wrote: >> Message: 1 >> Date: Thu, 5 Jan 2012 23:01:24 +0100 >> From: Florian Haas <florian at hastexo.com> >> Subject: Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one >> node >> To: drbd-user at lists.linbit.com >> Message-ID: >> <CAPUexz9jgpq0V49eZmwDH3cThrGEJ7VopXCh1_27yjX-6JC+fQ at mail.gmail.com> >> Content-Type: text/plain; charset=UTF-8 >> >> On Thu, Jan 5, 2012 at 6:36 PM, William Seligman >> <seligman at nevis.columbia.edu> wrote: >>> Sure. I didn't do this before, since the configuration is complex. I also don't >>> know which would be more comprehensible, so I've attached both cib.xml and the >>> result of "crm configure show". I should mention that I'm a lazy bum, so I use >>> crm-gui to configure corosync; that's why these files look more baroque than usual. > >> In the CIB you posted, both nodes are in the "UNCLEAN (offline)" >> state. In that state, nothing gets promoted, nothing gets started. Are >> you sure you posted the right CIB dump? > > That caused me to panic! I rushed to check that my cluster was running. > It's OK; an occasional panic is good once in a while. > > I assume you're talking about the following lines: > > <nodes> > <node id="orestes.nevis.columbia.edu" uname="orestes.nevis.columbia.edu" > type="normal"> > <instance_attributes id="nodes-orestes.nevis.columbia.edu"> > <nvpair id="nodes-orestes.nevis.columbia.edu-standby" name="standby" > value="off"/> > </instance_attributes> > </node> > <node id="hypatia.nevis.columbia.edu" uname="hypatia.nevis.columbia.edu" > type="normal"> > <instance_attributes id="nodes-hypatia.nevis.columbia.edu"> > <nvpair id="nodes-hypatia.nevis.columbia.edu-standby" name="standby" > value="off"/> > </instance_attributes> > </node> > </nodes> > > For both nodes, the attribute "standby" is set to "off", which evidently means > it's not in standby mode, so it's online. According to a web page I found at > home last night, but can't find at work today, "standby:off" gets added to a > node's attributes if it was ever marked "UNCLEAN (offline)" and brought online > again; this has indeed happened with both my nodes. > > To confirm, the first few lines from running "crm status" are: > > ============ > Last updated: Fri Jan 6 11:29:23 2012 > Stack: openais > Current DC: hypatia.nevis.columbia.edu - partition with quorum > Version: 1.0.12-unknown > 2 Nodes configured, 2 expected votes > 25 Resources configured. > ============ > > Online: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ] > > Master/Slave Set: Admin > Masters: [ hypatia.nevis.columbia.edu ] > Slaves: [ orestes.nevis.columbia.edu ] > > > Getting back to my question, I hunted through my logs and saw that during time > of "one-node; no promotion" crm reported the above status as: > > Master/Slave Set: Admin > Slaves: [ hypatia.nevis.columbia.edu ] > Stopped: [ AdminDrbd:1 ] > > ... and it simply stayed like that. Since all the other cluster resources depend > on Admin:promote, nothing else would happen. The relevant drbd messages in the > log file appear to be: > > Dec 30 19:24:26 hypatia kernel: events: mcg drbd: 1 > Dec 30 19:24:26 hypatia kernel: drbd: initialized. Version: 8.4.1 > (api:1/proto:86-100) > Dec 30 19:24:26 hypatia kernel: drbd: GIT-hash: > 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by > root at hypatia.nevis.columbia.edu, 2011-12-21 13:42:51 > Dec 30 19:24:26 hypatia kernel: drbd: registered as block device major 147 > Dec 30 19:24:26 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) > Dec 30 19:24:26 hypatia kernel: igb: eth1 NIC Link is Down > Dec 30 19:24:27 hypatia kernel: d-con admin: Starting worker thread (from > drbdsetup [7213]) > Dec 30 19:24:27 hypatia kernel: block drbd1: disk( Diskless -> Attaching ) > Dec 30 19:24:27 hypatia kernel: d-con admin: Method to ensure write ordering: > barrier > Dec 30 19:24:27 hypatia kernel: block drbd1: max BIO size = 130560 > Dec 30 19:24:27 hypatia kernel: block drbd1: drbd_bm_resize called with capacity > == 1953460176 > Dec 30 19:24:27 hypatia kernel: block drbd1: resync bitmap: bits=244182522 > words=3815352 pages=7452 > Dec 30 19:24:27 hypatia kernel: block drbd1: size = 931 GB (976730088 KB) > Dec 30 19:24:27 hypatia lrmd: [5800]: info: RA output: > (AdminDrbd:0:start:stderr) Marked additional 1028 MB as out-of-sync based on AL. > Dec 30 19:24:27 hypatia crmd: [5803]: info: process_lrm_event: LRM operation > WorkDrbd:0_start_0 (call=49, rc=0, cib-update=56, confirmed=true) ok > Dec 30 19:24:28 hypatia kernel: block drbd1: bitmap READ of 7452 pages took 131 > jiffies > Dec 30 19:24:28 hypatia kernel: block drbd1: recounting of set bits took > additional 21 jiffies > Dec 30 19:24:28 hypatia kernel: block drbd1: 1028 MB (263168 bits) marked > out-of-sync by on disk bit-map. > Dec 30 19:24:28 hypatia kernel: block drbd1: disk( Attaching -> Consistent ) > Dec 30 19:24:28 hypatia kernel: block drbd1: attached to UUIDs > A82875A514F576EB:0000000000000000:5283A3879DE4DED1:5282A3879DE4DED1 > Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) > Dec 30 19:24:28 hypatia kernel: d-con admin: conn( StandAlone -> Unconnected ) > Dec 30 19:24:28 hypatia kernel: d-con admin: Starting receiver thread (from > drbd_w_admin [7214]) > Dec 30 19:24:28 hypatia kernel: d-con admin: receiver (re)started > Dec 30 19:24:28 hypatia kernel: d-con admin: conn( Unconnected -> WFConnection ) > Dec 30 19:24:28 hypatia lrmd: [5800]: info: RA output: (AdminDrbd:0:start:stdout) > > > I noticed today a post on Linux-HA that appears to be the same problem as mine: > > <http://www.gossamer-threads.com/lists/drbd/users/19943> > > Unfortunately no resolution to the problem was posted then. > > Any thoughts? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4497 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120111/789da235/attachment.bin>