[DRBD-user] DRBD stuck with exit code 11 after switchover

Sat Aug 23 23:16:25 CEST 2014

Dear HA savants!

I set up my home cluster with two nodes running Heartbeat/Pacemaker on 
Ubuntu 12.04 and have been using it for a year and a half now without 
any major problems. It uses LVM over DRBD 8.3.11 and is exclusively 
managed via the excellent LCMC (Java GUI). Before saying anything else I 
would like to mention that I am by no means any cluster specialist but 
have mostly been following instructions found here and there. That 
certainly is the cause for my being stuck with the following problem. I 
do not even know where to start trouble-shooting and would therefore 
highly appreciate any pointers!

All of my resources start LXC-containers which depend on a filesystem, 
LVM and DRBD-volume each (in that order). They distribute well over the 
two nodes and can be manually stopped and started. The only hickup 
happens when I need to reboot one of the nodes. Putting it into 
Standy/Switchover-state, the resources properly shut down one by one but 
they do not migrate over to the other node. The DRBD-resource does not 
start over on the other node but is stuck. The same happens when I 
manually migrate a DRBD-resource to the other node.

After making that node available again everything continues working 
perfectly, no split brain or anything else. In the log I find cycles of 
entries similar to the following.

Aug 23 23:04:29 server101 drbd[3013]: [12402]: ERROR: lxcDNSmasq: Called 
drbdadm -c /etc/drbd.conf primary lxcDNSmasq
Aug 23 23:04:29 server101 lrmd: [23951]: info: RA output: 
(res_drbd_1:0:promote:stderr) 1: State change failed: (-7) Refusing
to be Primary while peer is not outdated#012Command 'drbdsetup 1 
primary' terminated with exit code 11
Aug 23 23:04:30 server101 kernel: [ 5212.397945] block drbd4: helper 
command: /sbin/drbdadm fence-peer minor-4 exit code 20 (0
x1400)
Aug 23 23:04:30 server101 kernel: [ 5212.397951] block drbd4: fence-peer 
helper broken, returned 20
Aug 23 23:04:30 server101 kernel: [ 5212.399749] block drbd1: helper 
command: /sbin/drbdadm fence-peer minor-1 exit code 20 (0
x1400)
Aug 23 23:04:30 server101 kernel: [ 5212.399751] block drbd1: fence-peer 
helper broken, returned 20
Aug 23 23:04:30 server101 kernel: [ 5212.399758] block drbd1: State 
change failed: Refusing to be Primary while peer is not ou
tdated
Aug 23 23:04:30 server101 kernel: [ 5212.399761] block drbd1:   state = 
{ cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUn
known r--- }
Aug 23 23:04:30 server101 kernel: [ 5212.399764] block drbd1:  wanted = 
{ cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnkn
own r--- }
Aug 23 23:04:30 server101 kernel: [ 5212.553056] block drbd4: State 
change failed: Refusing to be Primary while peer is not ou
tdated
Aug 23 23:04:30 server101 lrmd: [23951]: WARN: res_drbd_1:0:promote 
process (PID 3013) timed out (try 1).  Killing with signal
  SIGTERM (15).
Aug 23 23:04:30 server101 crmd: [23954]: ERROR: process_lrm_event: LRM 
operation res_drbd_1:0_promote_0 (336) Timed Out (timeo
ut=90000ms)
Aug 23 23:04:30 server101 lrmd: [23951]: WARN: operation promote[336] on 
res_drbd_1:0 for client 23954: pid 3013 timed out
Aug 23 23:04:30 server101 crmd: [23954]: WARN: status_from_rc: Action 50 
(res_drbd_1:0_promote_0) on server101 failed (target:
  0 vs. rc: -2): Error
Aug 23 23:04:30 server101 lrmd: [23951]: info: RA output: 
(res_drbd_6:1:promote:stderr) 4: State change failed: (-7) Refusing to 
be Primary while peer is not outdated#012Command 'drbdsetup 4 primary' 
terminated with exit code 11

It seems to me that the switchover process does not properly free the 
DRBD-resource. Though I am so totally at loss that I do not even know 
which part of the log is relevant nor which configuration file one 
should inspect, so please tell me what information could shed some light 
on this behavior.

Let me conclude with my sincere respect to the community making HA 
available to everybody (even me)!
Cheers!
Stefan Mueller, Switzerland