Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all,
It seems DRBD 0.7.4 cannot recover from a network failure.
I can reliably reproduce the following scenario on two RedHat 9 /
kernel 2.4.27 servers; the primary is p4test1 & secondary is p4test2.
I have previously set-up drbd devices (ie, mkfs.ext3'd after
configuring and loading drbd on primary). Note that this is the same
scenario the 'drbddisk' plugin for heartbeat relies on, so the same
problem exists for heartbeat failovers...
1. I start drbd services using init script:
root at p4test1:~ # /etc/init.d/drbd start
root at p4test2:~ # /etc/init.d/drbd start
At this point, both servers consider themselves secondary.
2. I set p4test1 up as the primary drbd server (p4test2 is still a
secondary):
root at p4test1:~ # drbdadm primary all
syslog, p4test1:
Sep 20 19:41:04 p4test1 syslogd 1.4.1: restart.
Sep 20 19:41:19 p4test1 kernel: drbd0: Secondary/Secondary -->
Primary/Secondary
syslog, p4test2:
Sep 20 19:41:26 p4test2 syslogd 1.4.1: restart.
Sep 20 19:41:37 p4test2 kernel: drbd0: Secondary/Secondary -->
Secondary/Primary
3. I unplug the network cable of p4test1 to simulate network failure.
syslog, p4test1:
Sep 20 19:42:37 p4test1 kernel: eth0: link down
Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in time.
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate
Connected --> NetworkFailure
Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
NetworkFailure --> BrokenPipe
Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header on
sock: r=-512
Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
BrokenPipe --> Unconnected
Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost.
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
Unconnected --> WFConnection
syslog, p4test2:
Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in time.
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate
Connected --> NetworkFailure
Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
NetworkFailure --> BrokenPipe
Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header on
sock: r=-512
Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
BrokenPipe --> Unconnected
Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost.
Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
Unconnected --> WFConnection
4. I ask p4test2 to become the primary, and p4test1 to become a
secondary:
root at p4test2:~ # drbdadm primary all
root at p4test1:~ # drbdadm secondary all
syslog, p4test1:
Sep 20 19:43:15 p4test1 kernel: drbd0: Primary/Unknown -->
Secondary/Unknown
syslog, p4test2:
Sep 20 19:43:49 p4test2 kernel: drbd0: Secondary/Unknown -->
Primary/Unknown
5. I plug the network cable of p4test1 back in.
syslog, p4test1:
Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
WFConnection --> WFReportParams
Sep 20 19:44:01 p4test1 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Sep 20 19:44:01 p4test1 kernel: drbd0: Connection established.
syslog, p4test2:
Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate
WFConnection --> WFReportParams
Sep 20 19:44:19 p4test2 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Sep 20 19:44:19 p4test2 kernel: drbd0: Connection established.
6. DRBD fails to reconnect
* drbd seems to think p4test2 is out of sync even though it is
primary, and p4test1 has suffered a network failure and listed as
secondary.
syslog, p4test1:
Sep 20 19:44:01 p4test1 kernel: drbd0: I am(S):
1:00000002:00000001:0000001e:0000000e:00
Sep 20 19:44:01 p4test1 kernel: drbd0: Peer(P):
1:00000002:00000001:0000001d:0000000f:10
Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
WFReportParams --> WFBitMapS
Sep 20 19:44:01 p4test1 kernel: drbd0: Secondary/Unknown -->
Secondary/Primary
Sep 20 19:44:01 p4test1 kernel: drbd0: sock was shut down by peer
Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
WFBitMapS --> BrokenPipe
Sep 20 19:44:01 p4test1 kernel: drbd0: short read expecting header on
sock: r=0
Sep 20 19:44:01 p4test1 kernel: drbd0: meta connection shut down by
peer.
Sep 20 19:44:01 p4test1 kernel: drbd0: asender terminated
Sep 20 19:44:01 p4test1 kernel: drbd0: worker terminated
Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
BrokenPipe --> Unconnected
Sep 20 19:44:01 p4test1 kernel: drbd0: Connection lost.
Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate
Unconnected --> WFConnection
syslog, p4test2:
Sep 20 19:44:19 p4test2 kernel: drbd0: I am(P):
1:00000002:00000001:0000001d:0000000f:10
Sep 20 19:44:19 p4test2 kernel: drbd0: Peer(S):
1:00000002:00000001:0000001e:0000000e:00
Sep 20 19:44:19 p4test2 kernel: drbd0: Current Primary shall become
sync TARGET! Aborting to prevent data corruption.
Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate
WFReportParams --> StandAlone
Sep 20 19:44:19 p4test2 kernel: drbd0: error receiving ReportParams,
l: 72!
Sep 20 19:44:19 p4test2 kernel: drbd0: asender terminated
Sep 20 19:44:19 p4test2 kernel: drbd0: worker terminated
Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate
StandAlone --> StandAlone
Sep 20 19:44:19 p4test2 kernel: drbd0: Connection lost.
Sep 20 19:44:19 p4test2 kernel: drbd0: receiver terminated
Proposed solution:
(a) Assume out of sync after network failure
* If primary becomes a secondary after a network failure, assume it
is out of sync if when it reconnects to a server that tells us it is
now primary.
(b) Recognize network failures on the machines that they happen on.
* This may help trigger above scenarios
* This is also the problem heartbeat tries to solve
* It's more difficult than it sounds!
* Perhaps recognize local failures (ie: eth0 down), where possible
hth,
-Steve