Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everybody,
Today I had the problem that after a reboot, a node wouldn't come back
into Connected State. It was always like WFConnection or Disconnected
and so on. The secondary node did not reconnect and so it wasn't
syncing. I thought I need to recreate the device and do a manual
split-brain recovery. Nothing worked. The DRBD stayed outdated
respectively inconsistent.
I was able resolve the issue and hopefully the following explaination is
correct (did it of my memories) and does help some other admins which
sturggled with this issue for days.
Some System Info (Debian Stable with Backport packages):
--- Cluster Config & Status Dump --
Created: Do 30. Sep 13:25:21 CEST 2010 on pilot01-node1 by uid=0(root)
gid=0(root) Gruppen=0(root)
Systeminfo: Linux pilot01-node1 2.6.28-1-amd64 #1 SMP Wed Feb 18
17:16:12 UTC 2009 x86_64 GNU/Linux
#####################
### 1. DRBD State ###
#####################
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at prolog01-pilot1, 2010-06-07 17:34:47
m:res cs ro ds p mounted
fstype
0:pilot0 Connected Primary/Secondary UpToDate/UpToDate C
/mnt/cluster xfs
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------
That was the initial drbd state of the secondary node:
root at pilot01-node2:/home/nwadmin# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at prolog01-pilot1, 2010-06-07 17:34:47
0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C
r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:1951768
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------
Resolution:
Then I looked at the logs (maybe a little too late) and saw that there
were erros concerning the drbd.conf. The ocf:linbit:drbd uses
/etc/drbd.conf as the OCF_RESKEY_drbdconf and my drbdadm tool always
wanted to use /usr/local/etc/drbd.conf (maybe this is compiled into the
drb-utils, I wasn't able to figure that out) therefor the pacemaker
always refused to let the secondary node connect to the drbd device.
What I did to resolve it was:
1. Change my resource to something like this:
primitive drbd_pilot0 ocf:linbit:drbd \
params drbd_resource="pilot0"
drbdconf="/usr/local/etc/drbd.conf" \
operations $id="drbd_pilot0-operations" \
op monitor interval="15s"
2. Cleaned up all the erros on the resource:
Crm resource cleanup ms_drbd_pilot0
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------
Here is the State of the Syncing:
root at pilot01-node1:/home/nwadmin# crm resource cleanup res_MySQL
Cleaning up res_MySQL on pilot01-node1
Cleaning up res_MySQL on pilot01-node2
root at pilot01-node1:/home/nwadmin# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at prolog01-pilot1, 2010-06-07 17:34:47
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:879617 nr:0 dw:2121 dr:890151 al:5 bm:53 lo:1 pe:39 ua:189 ap:0
ep:1 wo:b oos:1073972
[========>...........] sync'ed: 45.1% (1073972/1951768)K
finish: 0:00:22 speed: 47,920 (48,764) K/sec
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------
Here is a relevant log output:
crmd: [2326]: info: do_lrm_rsc_op: Performing
key=81:445:0:106b9e8c-1ea2-475f-b2c9-ddb3088ea7aa
op=drbd_pilot0:1_notify_0 )
Sep 30 11:29:24 s_all at pilot01-node2 lrmd: [2323]: info: RA output:
(drbd_pilot0:1:notify:stderr) Warning: resource pilot0 last used config
file: /etc/drbd.conf current config file: /usr/local/etc/drbd.conf
Sep 30 11:29:24 s_all at pilot01-node2 lrmd: [2323]: info: RA output:
(drbd_pilot0:1:notify:stderr) /usr/lib/ocf/resource.d//linbit/drbd: line
762: [: too many arguments
Sep 30 11:29:24 s_all at pilot01-node2 crmd: [2326]: info:
process_lrm_event: LRM operation drbd_pilot0:1_notify_0 (call=17, rc=0,
cib-update=26, confirmed=true) ok
Sep 30 11:29:26 s_all at pilot01-node2 lrmd: [2323]: info:
rsc:drbd_pilot0:1:18: notify
Sep 30 11:29:26 s_all at pilot01-node2 crmd: [2326]: info: do_lrm_rsc_op:
Performing key=79:448:0:106b9e8c-1ea2-475f-b2c9-ddb3088ea7aa
op=drbd_pilot0:1_notify_0 )
Sep 30 11:29:26 s_all at pilot01-node2 crmd: [2326]: info:
process_lrm_event: LRM operation drbd_pilot0:1_notify_0 (call=18, rc=0,
cib-update=27, confirmed=true) ok
Sep 30 11:29:28 s_all at pilot01-node2 lrmd: [2323]: info:
rsc:drbd_pilot0:1:19: notify
Sep 30 11:29:28 s_all at pilot01-node2 crmd: [2326]: info: do_lrm_rsc_op:
Performing key=79:451:0:106b9e8c-1ea2-475f-b2c9-ddb3088ea7aa
op=drbd_pilot0:1_notify_0 )
Sep 30 11:29:28 s_all at pilot01-node2 crmd: [2326]: info:
process_lrm_event: LRM operation drbd_pilot0:1_notify_0 (call=19, rc=0,
cib-update=28, confirmed=true) ok
Sep 30 11:29:29 s_all at pilot01-node2 lrmd: [2323]: info:
rsc:drbd_pilot0:1:20: notify
Sep 30 11:29:29 s_all at pilot01-node2 crmd: [2326]: info: do_lrm_rsc_op:
Performing key=79:454:0:106b9e8c-1ea2-475f-b2c9-ddb3088ea7aa
op=drbd_pilot0:1_notify_0 )
Sep 30 11:29:29 s_all at pilot01-node2 crmd: [2326]: info:
process_lrm_event: LRM operation drbd_pilot0:1_notify_0 (call=20, rc=0,
cib-update=29, confirmed=true) ok
Sep 30 11:29:36 s_all at pilot01-node2 lrmd: [2323]: info: RA output:
(drbd_pilot0:1:monitor:stderr) Warning: resource pilot0 last used config
file: /usr/local/etc/drbd.conf current config file: /etc/drbd.conf
Kind Regards,
Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100930/c350ace3/attachment.htm>