Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-09-21 08:48:51 +0100 \ Steve Purkis: > Hi all, > > It seems DRBD 0.7.4 cannot recover from a network failure. nonsense. see below. > I can reliably reproduce the following scenario on two RedHat 9 / > kernel 2.4.27 servers; the primary is p4test1 & secondary is p4test2. > I have previously set-up drbd devices (ie, mkfs.ext3'd after > configuring and loading drbd on primary). Note that this is the same > scenario the 'drbddisk' plugin for heartbeat relies on, so the same > problem exists for heartbeat failovers... > > 1. I start drbd services using init script: > > root at p4test1:~ # /etc/init.d/drbd start > root at p4test2:~ # /etc/init.d/drbd start > > At this point, both servers consider themselves secondary. > > 2. I set p4test1 up as the primary drbd server (p4test2 is still a > secondary): > > root at p4test1:~ # drbdadm primary all > > syslog, p4test1: > Sep 20 19:41:04 p4test1 syslogd 1.4.1: restart. > Sep 20 19:41:19 p4test1 kernel: drbd0: Secondary/Secondary --> > Primary/Secondary > > syslog, p4test2: > Sep 20 19:41:26 p4test2 syslogd 1.4.1: restart. > Sep 20 19:41:37 p4test2 kernel: drbd0: Secondary/Secondary --> > Secondary/Primary > > > 3. I unplug the network cable of p4test1 to simulate network failure. > > syslog, p4test1: > Sep 20 19:42:37 p4test1 kernel: eth0: link down > Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in > time. > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate > Connected --> NetworkFailure > Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate NetworkFailure --> BrokenPipe > Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header > on sock: r=-512 > Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate BrokenPipe --> Unconnected > Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost. > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate Unconnected --> WFConnection > > syslog, p4test2: > Sep 20 19:42:49 p4test1 kernel: drbd0: PingAck did not arrive in > time. > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_asender [20684]: cstate > Connected --> NetworkFailure > Sep 20 19:42:49 p4test1 kernel: drbd0: asender terminated > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate NetworkFailure --> BrokenPipe > Sep 20 19:42:49 p4test1 kernel: drbd0: short read expecting header > on sock: r=-512 > Sep 20 19:42:49 p4test1 kernel: drbd0: worker terminated > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate BrokenPipe --> Unconnected > Sep 20 19:42:49 p4test1 kernel: drbd0: Connection lost. > Sep 20 19:42:49 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate Unconnected --> WFConnection > > 4. I ask p4test2 to become the primary, and p4test1 to become a > secondary: > > root at p4test2:~ # drbdadm primary all > root at p4test1:~ # drbdadm secondary all there you have a split brain as far as drbd is concerned. and you still do independend administrative requests on both nodes. this does not work. intended (from heartbeat point of view): node gone? stonith it. you ALWAYS MUST FENCE a node before you do administrative requests on the remaining one. you did not. your fault. or, node not gone, only link failure: do nothing [*] (heartbeat has redundant communication links). [*] not exactly nothing. we know of certain failure scenarios and configuration cominations were we currently could lose some of the most recent confirmedly committed transactions. there is discussion going on how to make this bullet prove on drbd-dev and linux-ha-dev, if you are interessted. > syslog, p4test1: > Sep 20 19:43:15 p4test1 kernel: drbd0: Primary/Unknown --> > Secondary/Unknown > > syslog, p4test2: > Sep 20 19:43:49 p4test2 kernel: drbd0: Secondary/Unknown --> > Primary/Unknown > > 5. I plug the network cable of p4test1 back in. > > syslog, p4test1: > Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate WFConnection --> WFReportParams > Sep 20 19:44:01 p4test1 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 74 > Sep 20 19:44:01 p4test1 kernel: drbd0: Connection established. > > syslog, p4test2: > Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: > cstate WFConnection --> WFReportParams > Sep 20 19:44:19 p4test2 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 74 > Sep 20 19:44:19 p4test2 kernel: drbd0: Connection established. > > 6. DRBD fails to reconnect > * drbd seems to think p4test2 is out of sync even though it is > primary, and p4test1 has suffered a network failure and listed as > secondary. well, the nodes ARE out of sync. your events look like this: T1 link T2 Pri ok Sec Pri broke Sec here T1 can (and will, even if it is only the umount) change blocks, which are not mirrored on T2. Sec broke Pri now T2 changes blocks, which are not mirrored on T2. Sec ok Pri they recognise ... > > syslog, p4test1: > Sep 20 19:44:01 p4test1 kernel: drbd0: I am(S): > 1:00000002:00000001:0000001e:0000000e:00 > Sep 20 19:44:01 p4test1 kernel: drbd0: Peer(P): > 1:00000002:00000001:0000001d:0000000f:10 > Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate WFReportParams --> WFBitMapS > Sep 20 19:44:01 p4test1 kernel: drbd0: Secondary/Unknown --> > Secondary/Primary > Sep 20 19:44:01 p4test1 kernel: drbd0: sock was shut down by peer > Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate WFBitMapS --> BrokenPipe > Sep 20 19:44:01 p4test1 kernel: drbd0: short read expecting header > on sock: r=0 > Sep 20 19:44:01 p4test1 kernel: drbd0: meta connection shut down by > peer. > Sep 20 19:44:01 p4test1 kernel: drbd0: asender terminated > Sep 20 19:44:01 p4test1 kernel: drbd0: worker terminated > Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate BrokenPipe --> Unconnected > Sep 20 19:44:01 p4test1 kernel: drbd0: Connection lost. > Sep 20 19:44:01 p4test1 kernel: drbd0: drbd0_receiver [20653]: > cstate Unconnected --> WFConnection > > syslog, p4test2: > Sep 20 19:44:19 p4test2 kernel: drbd0: I am(P): > 1:00000002:00000001:0000001d:0000000f:10 > Sep 20 19:44:19 p4test2 kernel: drbd0: Peer(S): > 1:00000002:00000001:0000001e:0000000e:00 > Sep 20 19:44:19 p4test2 kernel: drbd0: Current Primary shall become > sync TARGET! Aborting to prevent data corruption. HERE that both nodes have been modified independently from each other. to make it clear: they have been identicall once in the past, then both did indepenten modification to the data. DRBD does NOT simply chose to throw away one changeset automatically. we are going to provide a config mechanism somewhen, where one can configure that the node with less modification will be chosen, or the current primary will be chosen, or that ... there are many possible ways. currently however, it will refuse to reconnect and do an automatic sync, because we have divergent data on both nodes, and we think it requires a human to decide whether to merge or throw away or ... you get the idea. > Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: > cstate WFReportParams --> StandAlone > Sep 20 19:44:19 p4test2 kernel: drbd0: error receiving ReportParams, > l: 72! > Sep 20 19:44:19 p4test2 kernel: drbd0: asender terminated > Sep 20 19:44:19 p4test2 kernel: drbd0: worker terminated > Sep 20 19:44:19 p4test2 kernel: drbd0: drbd0_receiver [15823]: > cstate StandAlone --> StandAlone > Sep 20 19:44:19 p4test2 kernel: drbd0: Connection lost. > Sep 20 19:44:19 p4test2 kernel: drbd0: receiver terminated > > Proposed solution: > > (a) Assume out of sync after network failure > * If primary becomes a secondary after a network failure, assume it > is out of sync if when it reconnects to a server that tells us it is > now primary. nope. won't work. but see mentioned discussion thread... > (b) Recognize network failures on the machines that they happen on. > * This may help trigger above scenarios > * This is also the problem heartbeat tries to solve > * It's more difficult than it sounds! > * Perhaps recognize local failures (ie: eth0 down), where possible we do recognize... we map that to connection loss, and to the peer state of "Unknown". whether this failure is local or some switch or the cable or remote does not matter at all. the effect is what counts. and the effect is we do no longer see our peer, so we do not know which state it has. the interessting point is what do we _do_ now. and I think we are not too bad currently. internal split brain is not detectable internally *by definition* but you probably want to have a look at mentioned thread on drbd-dev and linux-ha-dev... Lars Ellenberg -- please use the "List-Reply" function of your email client.