Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear HA savants! I set up my home cluster with two nodes running Heartbeat/Pacemaker on Ubuntu 12.04 and have been using it for a year and a half now without any major problems. It uses LVM over DRBD 8.3.11 and is exclusively managed via the excellent LCMC (Java GUI). Before saying anything else I would like to mention that I am by no means any cluster specialist but have mostly been following instructions found here and there. That certainly is the cause for my being stuck with the following problem. I do not even know where to start trouble-shooting and would therefore highly appreciate any pointers! All of my resources start LXC-containers which depend on a filesystem, LVM and DRBD-volume each (in that order). They distribute well over the two nodes and can be manually stopped and started. The only hickup happens when I need to reboot one of the nodes. Putting it into Standy/Switchover-state, the resources properly shut down one by one but they do not migrate over to the other node. The DRBD-resource does not start over on the other node but is stuck. The same happens when I manually migrate a DRBD-resource to the other node. After making that node available again everything continues working perfectly, no split brain or anything else. In the log I find cycles of entries similar to the following. Aug 23 23:04:29 server101 drbd[3013]: [12402]: ERROR: lxcDNSmasq: Called drbdadm -c /etc/drbd.conf primary lxcDNSmasq Aug 23 23:04:29 server101 lrmd: [23951]: info: RA output: (res_drbd_1:0:promote:stderr) 1: State change failed: (-7) Refusing to be Primary while peer is not outdated#012Command 'drbdsetup 1 primary' terminated with exit code 11 Aug 23 23:04:30 server101 kernel: [ 5212.397945] block drbd4: helper command: /sbin/drbdadm fence-peer minor-4 exit code 20 (0 x1400) Aug 23 23:04:30 server101 kernel: [ 5212.397951] block drbd4: fence-peer helper broken, returned 20 Aug 23 23:04:30 server101 kernel: [ 5212.399749] block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 20 (0 x1400) Aug 23 23:04:30 server101 kernel: [ 5212.399751] block drbd1: fence-peer helper broken, returned 20 Aug 23 23:04:30 server101 kernel: [ 5212.399758] block drbd1: State change failed: Refusing to be Primary while peer is not ou tdated Aug 23 23:04:30 server101 kernel: [ 5212.399761] block drbd1: state = { cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUn known r--- } Aug 23 23:04:30 server101 kernel: [ 5212.399764] block drbd1: wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnkn own r--- } Aug 23 23:04:30 server101 kernel: [ 5212.553056] block drbd4: State change failed: Refusing to be Primary while peer is not ou tdated Aug 23 23:04:30 server101 lrmd: [23951]: WARN: res_drbd_1:0:promote process (PID 3013) timed out (try 1). Killing with signal SIGTERM (15). Aug 23 23:04:30 server101 crmd: [23954]: ERROR: process_lrm_event: LRM operation res_drbd_1:0_promote_0 (336) Timed Out (timeo ut=90000ms) Aug 23 23:04:30 server101 lrmd: [23951]: WARN: operation promote[336] on res_drbd_1:0 for client 23954: pid 3013 timed out Aug 23 23:04:30 server101 crmd: [23954]: WARN: status_from_rc: Action 50 (res_drbd_1:0_promote_0) on server101 failed (target: 0 vs. rc: -2): Error Aug 23 23:04:30 server101 lrmd: [23951]: info: RA output: (res_drbd_6:1:promote:stderr) 4: State change failed: (-7) Refusing to be Primary while peer is not outdated#012Command 'drbdsetup 4 primary' terminated with exit code 11 It seems to me that the switchover process does not properly free the DRBD-resource. Though I am so totally at loss that I do not even know which part of the log is relevant nor which configuration file one should inspect, so please tell me what information could shed some light on this behavior. Let me conclude with my sincere respect to the community making HA available to everybody (even me)! Cheers! Stefan Mueller, Switzerland