Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm doing some failover tests of a passive/active mysql over drbd configuration. The current setup uses drbd 8.0.8 with heartbeat 2.1.3 (V2 crm style, drbddisk RA). One of my failover scenario involves AC unpluging the current active node and see if the passive node is promoted. Unfortunately for me it fails when the soon to become active node starts tries to promote drbd in primary mode. My other failover test (like unplugging ethernet cables, rebooting or anything else worked fine). Here is a log extract of the issue: Here I unplug the power from db2: Feb 26 17:13:52 db1 kernel: tg3: eth2: Link is down. and drbd notice the peer disappeared Feb 26 17:14:01 db1 kernel: drbd0: PingAck did not arrive in time. Feb 26 17:14:01 db1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Feb 26 17:14:01 db1 kernel: drbd0: asender terminated Feb 26 17:14:01 db1 kernel: drbd0: short read expecting header on sock: r=-512 Feb 26 17:14:01 db1 kernel: drbd0: tl_clear() Feb 26 17:14:01 db1 kernel: drbd0: Connection closed Feb 26 17:14:01 db1 kernel: drbd0: Writing meta data super block now. Feb 26 17:14:01 db1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Feb 26 17:14:01 db1 kernel: drbd0: receiver terminated Feb 26 17:14:01 db1 kernel: drbd0: receiver (re)started Feb 26 17:14:01 db1 kernel: drbd0: conn( Unconnected -> WFConnection ) heartbeat noticed that too: Feb 26 17:14:02 db1 heartbeat: [15844]: WARN: node db2: is dead Feb 26 17:14:02 db1 heartbeat: [15844]: info: Link db2:eth0 dead. ...[snipped] here heartbeat starts the other resources (namely an IPAddr2 RA) ... then heartbeat tries to start drbd: Feb 26 17:14:02 db1 tengine: [16091]: info: send_rsc_command: Initiating action 6: drbd_start_0 on db1 Feb 26 17:14:02 db1 lrmd: [15859]: info: rsc:drbd: start and drbd-peer-outdater is started to outdate db2: Feb 26 17:14:02 db1 drbd-peer-outdater: [19930]: debug: drbd peer: db2 Feb 26 17:14:02 db1 drbd-peer-outdater: [19930]: debug: drbd resource: mysql Feb 26 17:14:02 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611050) connected Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql) Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk Feb 26 17:14:03 db1 drbd-peer-outdater: [19930]: debug: message: outdater_rc, db1 Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead And spit an error?? Feb 26 17:14:03 db1 kernel: drbd0: outdate-peer helper broken, returned 20 Then retry: Feb 26 17:14:03 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: outdater: no message this time Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 1 messages Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: destroying connection: (null) Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Deleting outdater (0x611050) from mainloop Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: drbd peer: db2 Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: drbd resource: mysql Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611190) connected Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql) Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: message: outdater_rc, db1 Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 1 messages Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 0 messages Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: destroying connection: (null) Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Deleting outdater (0x611190) from mainloop Feb 26 17:14:03 db1 kernel: drbd0: outdate-peer helper broken, returned 20 Feb 26 17:14:03 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer ... a few seconds pass with dopd trying to talk to the dead db2... Feb 26 17:14:04 db1 kernel: drbd0: outdate-peer helper broken, returned 20 Feb 26 17:14:04 db1 kernel: drbd0: Creating new current UUID Feb 26 17:14:04 db1 kernel: drbd0: Writing meta data super block now. And the drbddisk RA returns _the_ error: Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) State change failed: (-7) Refusing to be Primary while peer is not outdated Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) Command '/sbin/drbdsetup Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) /dev/drbd0 primary' terminated with exit code 11 And dopd tries again... Feb 26 17:14:05 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: drbd peer: db2 Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: drbd resource: mysql Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611190) connected Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql) Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: message: outdater_rc, db1 Feb 26 17:14:05 db1 kernel: drbd0: outdate-peer helper broken, returned 20 with the same outcome... and it continues like that for about 10s (ie 5 or 6 more times), then the resource is declared in error. Now, the question, is how can I solve this particular issue: peer can never be outdated since it is completely dead. Is it an issue with drbddisk or with drbd-outdate-peer/dopd ? Is it a timeout issue somewhere (ie heartbeat doesn't try hard enough to restart drbd)? If needed here is the drbd configuration on both node: global { usage-count no; } common { syncer { rate 10M; } } resource "mysql" { protocol C; handlers { outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; } startup { wfc-timeout 0; degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error pass_on; fencing resource-only; } net { cram-hmac-alg "sha1"; shared-secret "Replicants"; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 50M; al-extents 257; } on db1 { device /dev/drbd0; disk "/dev/disk/by-id/scsi-36001c23000d71d9d0000213b47b57e44"; address 192.168.25.1:7788; flexible-meta-disk internal; } on db2 { device /dev/drbd0; disk "/dev/disk/by-id/scsi-36001c23000d71d9d0000213d47b57e4a"; address 192.168.25.2:7788; flexible-meta-disk internal; } } And the heartbeat ha.cf: logfacility local0 keepalive 500ms deadtime 10 warntime 5 initdead 30 mcast eth0 225.0.0.1 694 2 0 bcast eth2 crm on use_logd on apiauth cibmon uid=hacluster respawn hacluster /usr/lib/heartbeat/cibmon -d apiauth dopd gid=haclient uid=hacluster respawn hacluster /usr/lib/heartbeat/dopd ping 172.16.10.188 respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s respawn root /usr/lib/heartbeat/hbagent deadping 5 node db1 node db2 The two machines are connected by an ethernet cross-over cable each on eth2 and connected to a switch (whose IP address is the pingd one) on eth0. Any help would be appreciated :-) Many thanks, -- Brice Figureau <brice+drbd at daysofwonder.com>