Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear HA savants!
I set up my home cluster with two nodes running Heartbeat/Pacemaker on
Ubuntu 12.04 and have been using it for a year and a half now without
any major problems. It uses LVM over DRBD 8.3.11 and is exclusively
managed via the excellent LCMC (Java GUI). Before saying anything else I
would like to mention that I am by no means any cluster specialist but
have mostly been following instructions found here and there. That
certainly is the cause for my being stuck with the following problem. I
do not even know where to start trouble-shooting and would therefore
highly appreciate any pointers!
All of my resources start LXC-containers which depend on a filesystem,
LVM and DRBD-volume each (in that order). They distribute well over the
two nodes and can be manually stopped and started. The only hickup
happens when I need to reboot one of the nodes. Putting it into
Standy/Switchover-state, the resources properly shut down one by one but
they do not migrate over to the other node. The DRBD-resource does not
start over on the other node but is stuck. The same happens when I
manually migrate a DRBD-resource to the other node.
After making that node available again everything continues working
perfectly, no split brain or anything else. In the log I find cycles of
entries similar to the following.
Aug 23 23:04:29 server101 drbd[3013]: [12402]: ERROR: lxcDNSmasq: Called
drbdadm -c /etc/drbd.conf primary lxcDNSmasq
Aug 23 23:04:29 server101 lrmd: [23951]: info: RA output:
(res_drbd_1:0:promote:stderr) 1: State change failed: (-7) Refusing
to be Primary while peer is not outdated#012Command 'drbdsetup 1
primary' terminated with exit code 11
Aug 23 23:04:30 server101 kernel: [ 5212.397945] block drbd4: helper
command: /sbin/drbdadm fence-peer minor-4 exit code 20 (0
x1400)
Aug 23 23:04:30 server101 kernel: [ 5212.397951] block drbd4: fence-peer
helper broken, returned 20
Aug 23 23:04:30 server101 kernel: [ 5212.399749] block drbd1: helper
command: /sbin/drbdadm fence-peer minor-1 exit code 20 (0
x1400)
Aug 23 23:04:30 server101 kernel: [ 5212.399751] block drbd1: fence-peer
helper broken, returned 20
Aug 23 23:04:30 server101 kernel: [ 5212.399758] block drbd1: State
change failed: Refusing to be Primary while peer is not ou
tdated
Aug 23 23:04:30 server101 kernel: [ 5212.399761] block drbd1: state =
{ cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUn
known r--- }
Aug 23 23:04:30 server101 kernel: [ 5212.399764] block drbd1: wanted =
{ cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnkn
own r--- }
Aug 23 23:04:30 server101 kernel: [ 5212.553056] block drbd4: State
change failed: Refusing to be Primary while peer is not ou
tdated
Aug 23 23:04:30 server101 lrmd: [23951]: WARN: res_drbd_1:0:promote
process (PID 3013) timed out (try 1). Killing with signal
SIGTERM (15).
Aug 23 23:04:30 server101 crmd: [23954]: ERROR: process_lrm_event: LRM
operation res_drbd_1:0_promote_0 (336) Timed Out (timeo
ut=90000ms)
Aug 23 23:04:30 server101 lrmd: [23951]: WARN: operation promote[336] on
res_drbd_1:0 for client 23954: pid 3013 timed out
Aug 23 23:04:30 server101 crmd: [23954]: WARN: status_from_rc: Action 50
(res_drbd_1:0_promote_0) on server101 failed (target:
0 vs. rc: -2): Error
Aug 23 23:04:30 server101 lrmd: [23951]: info: RA output:
(res_drbd_6:1:promote:stderr) 4: State change failed: (-7) Refusing to
be Primary while peer is not outdated#012Command 'drbdsetup 4 primary'
terminated with exit code 11
It seems to me that the switchover process does not properly free the
DRBD-resource. Though I am so totally at loss that I do not even know
which part of the log is relevant nor which configuration file one
should inspect, so please tell me what information could shed some light
on this behavior.
Let me conclude with my sincere respect to the community making HA
available to everybody (even me)!
Cheers!
Stefan Mueller, Switzerland