[DRBD-user] "Refusing to be Primary while peer is not outdated" when peer is dead?

Tue Feb 26 17:50:21 CET 2008

Hi,

I'm doing some failover tests of a passive/active mysql over drbd
configuration.
The current setup uses drbd 8.0.8 with heartbeat 2.1.3 (V2 crm style,
drbddisk RA).

One of my failover scenario involves AC unpluging the current active
node and see if the passive node is promoted. 
Unfortunately for me it fails when the soon to become active node starts
tries to promote drbd in primary mode.
My other failover test (like unplugging ethernet cables, rebooting or
anything else worked fine).

Here is a log extract of the issue:

Here I unplug the power from db2:
Feb 26 17:13:52 db1 kernel: tg3: eth2: Link is down.

and drbd notice the peer disappeared
Feb 26 17:14:01 db1 kernel: drbd0: PingAck did not arrive in time.
Feb 26 17:14:01 db1 kernel: drbd0: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
Feb 26 17:14:01 db1 kernel: drbd0: asender terminated
Feb 26 17:14:01 db1 kernel: drbd0: short read expecting header on sock:
r=-512
Feb 26 17:14:01 db1 kernel: drbd0: tl_clear()
Feb 26 17:14:01 db1 kernel: drbd0: Connection closed
Feb 26 17:14:01 db1 kernel: drbd0: Writing meta data super block now.
Feb 26 17:14:01 db1 kernel: drbd0: conn( NetworkFailure ->
Unconnected ) 
Feb 26 17:14:01 db1 kernel: drbd0: receiver terminated
Feb 26 17:14:01 db1 kernel: drbd0: receiver (re)started
Feb 26 17:14:01 db1 kernel: drbd0: conn( Unconnected -> WFConnection ) 

heartbeat noticed that too:
Feb 26 17:14:02 db1 heartbeat: [15844]: WARN: node db2: is dead
Feb 26 17:14:02 db1 heartbeat: [15844]: info: Link db2:eth0 dead.

...[snipped] here heartbeat starts the other resources (namely an
IPAddr2 RA)
... then heartbeat tries to start drbd:
Feb 26 17:14:02 db1 tengine: [16091]: info: send_rsc_command: Initiating action 6: drbd_start_0 on db1
Feb 26 17:14:02 db1 lrmd: [15859]: info: rsc:drbd: start

and drbd-peer-outdater is started to outdate db2:

Feb 26 17:14:02 db1 drbd-peer-outdater: [19930]: debug: drbd peer: db2 
Feb 26 17:14:02 db1 drbd-peer-outdater: [19930]: debug: drbd resource: mysql 
Feb 26 17:14:02 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611050) connected
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql)
Feb 26 17:14:02 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk
Feb 26 17:14:03 db1 drbd-peer-outdater: [19930]: debug: message: outdater_rc, db1 
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead

And spit an error??
Feb 26 17:14:03 db1 kernel: drbd0: outdate-peer helper broken, returned 20 

Then retry:
Feb 26 17:14:03 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: outdater: no message this time
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 1 messages
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: destroying connection: (null) 
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Deleting outdater (0x611050) from mainloop
Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: drbd peer: db2 
Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: drbd resource: mysql 
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611190) connected
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql)
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk
Feb 26 17:14:03 db1 drbd-peer-outdater: [19955]: debug: message: outdater_rc, db1 
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 1 messages
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processed 0 messages
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: destroying connection: (null) 
Feb 26 17:14:03 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Deleting outdater (0x611190) from mainloop
Feb 26 17:14:03 db1 kernel: drbd0: outdate-peer helper broken, returned 20 
Feb 26 17:14:03 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer

... a few seconds pass with dopd trying to talk to the dead db2...

Feb 26 17:14:04 db1 kernel: drbd0: outdate-peer helper broken, returned 20 
Feb 26 17:14:04 db1 kernel: drbd0: Creating new current UUID
Feb 26 17:14:04 db1 kernel: drbd0: Writing meta data super block now.

And the drbddisk RA returns _the_ error:
Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) State change failed: (-7) Refusing to be Primary while peer is not outdated 
Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) Command '/sbin/drbdsetup 
Feb 26 17:14:04 db1 lrmd: [15859]: info: RA output: (drbd:start:stderr) /dev/drbd0 primary' terminated with exit code 11 

And dopd tries again...
Feb 26 17:14:05 db1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: drbd peer: db2 
Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: drbd resource: mysql 
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Connecting channel
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Client outdater (0x611190) connected
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: invoked: outdater
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Processing msg from outdater
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :mysql)
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: debug: Starting node walk
Feb 26 17:14:05 db1 /usr/lib/heartbeat/dopd: [15865]: WARN: Cluster node: db2: status: dead
Feb 26 17:14:05 db1 drbd-peer-outdater: [19964]: debug: message: outdater_rc, db1 
Feb 26 17:14:05 db1 kernel: drbd0: outdate-peer helper broken, returned 20 

with the same outcome... and it continues like that for about 10s (ie 5
or 6 more times), then the resource is declared in error.

Now, the question, is how can I solve this particular issue: peer can
never be outdated since it is completely dead.
Is it an issue with drbddisk or with drbd-outdate-peer/dopd ?
Is it a timeout issue somewhere (ie heartbeat doesn't try hard enough to restart drbd)?

If needed here is the drbd configuration on both node:
global {
    usage-count no;
}
common {
  syncer { rate 10M; }
}
resource "mysql" {
  protocol C;
  handlers {
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater";
  }
  startup {
    wfc-timeout  0;
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   pass_on;
    fencing resource-only;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "Replicants";
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 50M;
    al-extents 257;
  }
  on db1 {
    device     /dev/drbd0;
    disk       "/dev/disk/by-id/scsi-36001c23000d71d9d0000213b47b57e44";
    address    192.168.25.1:7788;
    flexible-meta-disk  internal;
  }
  on db2 {
    device     /dev/drbd0;
    disk       "/dev/disk/by-id/scsi-36001c23000d71d9d0000213d47b57e4a";
    address    192.168.25.2:7788;
    flexible-meta-disk  internal;
  }
}

And the heartbeat ha.cf:
logfacility local0
keepalive 500ms
deadtime 10
warntime 5
initdead 30
mcast eth0 225.0.0.1 694 2 0
bcast eth2
crm on
use_logd on
apiauth cibmon   uid=hacluster
respawn hacluster /usr/lib/heartbeat/cibmon -d
apiauth dopd gid=haclient uid=hacluster
respawn hacluster /usr/lib/heartbeat/dopd  
ping 172.16.10.188
respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s
respawn root /usr/lib/heartbeat/hbagent  
deadping 5
node db1
node db2

The two machines are connected by an ethernet cross-over cable each on
eth2 and connected to a switch (whose IP address is the pingd one) on
eth0. 

Any help would be appreciated :-)
Many thanks,
-- 
Brice Figureau <brice+drbd at daysofwonder.com>