[DRBD-user] FAQ: Reconnecting after a temporary primary node failure

Tue Mar 8 00:39:09 CET 2011

Newbie question here: so I created a mysql+drdb test setup with two
nodes (vms), db1 and db2.

root at db1:~# drbdadm role r0
Primary/Secondary
root at db1:~#
root at db2:~# drbdadm role r0
Secondary/Primary
root at db2:~#

One of the things I would like to see is how it behaves during a
failure and recovery of the primary node. So, let's try a minor issue:
 I pull the ethernet cable on db1:

root at db2:~# tail /var/log/kern.log
Mar  7 18:24:19 db2 kernel: [33722.071609] block drbd0: PingAck did
not arrive in time.
Mar  7 18:24:19 db2 kernel: [33722.081223] block drbd0: peer( Primary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
Mar  7 18:24:19 db2 kernel: [33722.081249] block drbd0: asender terminated
Mar  7 18:24:19 db2 kernel: [33722.081256] block drbd0: Terminating
asender thread
Mar  7 18:24:19 db2 kernel: [33722.081492] block drbd0: short read
expecting header on sock: r=-512
Mar  7 18:24:19 db2 kernel: [33722.096824] block drbd0: Connection closed
Mar  7 18:24:19 db2 kernel: [33722.096836] block drbd0: conn(
NetworkFailure -> Unconnected )
Mar  7 18:24:19 db2 kernel: [33722.096851] block drbd0: receiver terminated
Mar  7 18:24:19 db2 kernel: [33722.096857] block drbd0: Restarting
receiver thread
Mar  7 18:24:19 db2 kernel: [33722.096862] block drbd0: receiver (re)started
Mar  7 18:24:19 db2 kernel: [33722.096871] block drbd0: conn(
Unconnected -> WFConnection )

As mentioned in http://www.drbd.org/users-guide/s-node-failure.html,
db2 did not become primary by itself.

root at db2:~# drbdadm role r0
Secondary/Unknown
root at db2:~# !cat
cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at db2,
2011-03-07 10:22:02
 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:32 nr:79448 dw:364538 dr:270864 al:8 bm:49 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0
root at db2:~#

And then reconnect the ethernet cable on db1, which still thinks it is
the primary node.

root at db1:~# tail /var/log/kern.log
Mar  7 18:24:19 db1 kernel: [  744.695406] block drbd0: receiver (re)started
Mar  7 18:24:19 db1 kernel: [  744.695414] block drbd0: conn(
Unconnected -> WFConnection )
Mar  7 18:24:19 db1 kernel: [  744.696261] block drbd0: bind before
connect failed, err = -99
Mar  7 18:24:19 db1 kernel: [  744.696271] block drbd0: conn(
WFConnection -> Disconnecting )
Mar  7 18:24:19 db1 kernel: [  744.696345] block drbd0: Discarding
network configuration.
Mar  7 18:24:19 db1 kernel: [  744.696702] block drbd0: Connection closed
Mar  7 18:24:19 db1 kernel: [  744.696719] block drbd0: conn(
Disconnecting -> StandAlone )
Mar  7 18:24:19 db1 kernel: [  744.697261] block drbd0: receiver terminated
Mar  7 18:24:19 db1 kernel: [  744.697280] block drbd0: Terminating
receiver thread
Mar  7 18:31:41 db1 kernel: [ 1186.440928] e1000: eth0 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
root at db1:~#

Shouldn't the two nodes re-establish connectivity?