Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, It seems DRBD 0.7.4 cannot recover from a network failure. I can reliably reproduce the following scenario on two RedHat 9 / kernel 2.4.27 servers; the primary is p4test1 & secondary is p4test2. I have previously set-up drbd devices (ie, mkfs.ext3'd after configuring and loading drbd on primary). Note that this is the same scenario the 'drbddisk' plugin for heartbeat relies on, so the same problem exists for heartbeat failovers... 1. I start drbd services using init script: root at p4test1:~ # /etc/init.d/drbd start root at p4test2:~ # /etc/init.d/drbd start At this point, both servers consider themselves secondary. 2. I set p4test1 up as the primary drbd server (p4test2 is still a secondary): root at p4test1:~ # drbdadm primary all syslog, p4test1: Sep 20 19:41:04 p4test1 syslogd 1.4.1: restart. Sep 20 19:41:19 p4test1 kernel: drbd0: Secondary/Secondary --> Primary/Secondary syslog, p4test2: Sep 20 19:41:26 p4test2 syslogd 1.4.1: restart. Sep 20 19:41:37 p4test2 kernel: drbd0: Secondary/Secondary --> Secondary/Primary 3. I unplug the network cable of p4test1 to simulate network failure. syslog, p4test1: Sep 20 19:42:37 p4test1 kernel: eth0: link down Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in time. Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate Connected --> NetworkFailure Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate NetworkFailure --> BrokenPipe Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header on sock: r=-512 Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate BrokenPipe --> Unconnected Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost. Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate Unconnected --> WFConnection syslog, p4test2: Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in time. Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate Connected --> NetworkFailure Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate NetworkFailure --> BrokenPipe Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header on sock: r=-512 Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate BrokenPipe --> Unconnected Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost. Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate Unconnected --> WFConnection 4. I ask p4test2 to become the primary, and p4test1 to become a secondary: root at p4test2:~ # drbdadm primary all root at p4test1:~ # drbdadm secondary all syslog, p4test1: Sep 20 19:43:15 p4test1 kernel: drbd0: Primary/Unknown --> Secondary/Unknown syslog, p4test2: Sep 20 19:43:49 p4test2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown 5. I plug the network cable of p4test1 back in. syslog, p4test1: Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate WFConnection --> WFReportParams Sep 20 19:44:01 p4test1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 20 19:44:01 p4test1 kernel: drbd0: Connection established. syslog, p4test2: Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate WFConnection --> WFReportParams Sep 20 19:44:19 p4test2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 20 19:44:19 p4test2 kernel: drbd0: Connection established. 6. DRBD fails to reconnect * drbd seems to think p4test2 is out of sync even though it is primary, and p4test1 has suffered a network failure and listed as secondary. syslog, p4test1: Sep 20 19:44:01 p4test1 kernel: drbd0: I am(S): 1:00000002:00000001:0000001e:0000000e:00 Sep 20 19:44:01 p4test1 kernel: drbd0: Peer(P): 1:00000002:00000001:0000001d:0000000f:10 Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate WFReportParams --> WFBitMapS Sep 20 19:44:01 p4test1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Sep 20 19:44:01 p4test1 kernel: drbd0: sock was shut down by peer Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate WFBitMapS --> BrokenPipe Sep 20 19:44:01 p4test1 kernel: drbd0: short read expecting header on sock: r=0 Sep 20 19:44:01 p4test1 kernel: drbd0: meta connection shut down by peer. Sep 20 19:44:01 p4test1 kernel: drbd0: asender terminated Sep 20 19:44:01 p4test1 kernel: drbd0: worker terminated Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate BrokenPipe --> Unconnected Sep 20 19:44:01 p4test1 kernel: drbd0: Connection lost. Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: cstate Unconnected --> WFConnection syslog, p4test2: Sep 20 19:44:19 p4test2 kernel: drbd0: I am(P): 1:00000002:00000001:0000001d:0000000f:10 Sep 20 19:44:19 p4test2 kernel: drbd0: Peer(S): 1:00000002:00000001:0000001e:0000000e:00 Sep 20 19:44:19 p4test2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate WFReportParams --> StandAlone Sep 20 19:44:19 p4test2 kernel: drbd0: error receiving ReportParams, l: 72! Sep 20 19:44:19 p4test2 kernel: drbd0: asender terminated Sep 20 19:44:19 p4test2 kernel: drbd0: worker terminated Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: cstate StandAlone --> StandAlone Sep 20 19:44:19 p4test2 kernel: drbd0: Connection lost. Sep 20 19:44:19 p4test2 kernel: drbd0: receiver terminated Proposed solution: (a) Assume out of sync after network failure * If primary becomes a secondary after a network failure, assume it is out of sync if when it reconnects to a server that tells us it is now primary. (b) Recognize network failures on the machines that they happen on. * This may help trigger above scenarios * This is also the problem heartbeat tries to solve * It's more difficult than it sounds! * Perhaps recognize local failures (ie: eth0 down), where possible hth, -Steve