Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
just some preliminary comments... / 2006-04-18 18:19:04 +0200 \ Cyril Bouthors: > Hi, > > I've experienced a major hardware failure on an ethernet switch last > night and DRBD very badly reacted on 6 nodes out of the 26 that were > connected to that switch. > > Those 6 nodes switched to state Primary/Unknown and Secondary/Unknown. > > The read accesses like 'ls' on the filesystem mounted on the 3 > primaries worked fine but I have the impression that write accesses > failed according to the fact that all processes accessing the > partitions (like MySQL) were stuck. > > Unfortunately, I had no time to verify this carefully because of the > pressure of the downtime and my will to fix it _asap_. > > I was unable to stop the user-land processes. The load was about 250. > > SSH worked fine. > > I was able to unmount the partition with "umount -l /mypartition". > > I was unable to stop DRBD on both primaries and secondaries, it just > stuck when heartbeat launched drbdadm. > > I was forced to electrically reboot the servers. > > IIRC, fsck didn't have to parse the whole FS after the reboot on > primaries that I've "umount -l" but not on the others. > > > I'm running EXT3 on top of DRBD 0.7.15 on top of Linux 2.4.27-2-686. > Debian flavor, no home-made patches. > > The exact same thing happened three times in the last two months. > > I'm not sure what the network failure looked exactly like to DRBD but > I guess it was not as easy as if the cable was suddenly disconnected, > I think DRBD got corrupted packets and such. > > I think DRBD should not react this way to network failures and I think > it's a bug in DRBD that should be fixed. > > Do you think DRBD 0.7.16 or .17 already fixed this issue? I can > upgrade to .16 or .17 if (and only if) needed. > > Here are the revelant parts of the kernel logs on one of primary that > have failed: > > Apr 17 22:26:26 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 > Apr 17 22:26:29 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967294 > Apr 17 22:26:32 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967293 > Apr 17 22:26:35 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967292 > Apr 17 22:26:38 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967291 > Apr 17 22:26:41 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967290 > > Apr 17 22:26:44 nfsa4 kernel: drbd0: PingAck did not arrive in time. > Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_asender [15156]: cstate Connected --> NetworkFailure > Apr 17 22:26:44 nfsa4 kernel: drbd0: asender terminated > Apr 17 22:26:44 nfsa4 kernel: drbd0: kupdated [6]: cstate NetworkFailure --> Timeout > Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Timeout --> BrokenPipe > Apr 17 22:26:44 nfsa4 kernel: drbd0: short read expecting header on sock: r=-512 > Apr 17 22:26:44 nfsa4 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001 > Apr 17 22:26:44 nfsa4 kernel: drbd0: worker terminated > Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate BrokenPipe --> Unconnected > Apr 17 22:26:44 nfsa4 kernel: drbd0: Connection lost. > Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Unconnected --> WFConnection > [NETWORK WENT BACK HERE] > Apr 17 22:27:20 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate WFConnection --> WFReportParams no further log messages? so it seems to be stuck in WFReportParams ... doh. that _should_ timeout pretty quick... couple of seconds, iirc. any hanging drbdsetup processes? output of /proc/drbd at that time, in case you have a session/console log? did you try to "drbdadm disconnect"? > Apr 17 22:56:15 nfsa4 kernel: drbd0: Primary/Unknown --> Secondary/Unknown > > > Same thing on a secondary: > > Apr 17 22:26:44 nfsb4 kernel: drbd0: meta connection shut down by peer. > Apr 17 22:26:44 nfsb4 kernel: drbd0: drbd0_asender [12837]: cstate Connected --> NetworkFailure > Apr 17 22:26:44 nfsb4 kernel: drbd0: asender terminated > > Apr 17 22:27:17 nfsb4 kernel: drbd0: short sent BarrierAck size=16 sent=-1001 > Apr 17 22:27:17 nfsb4 kernel: drbd0: error receiving Barrier, l: 8! > Apr 17 22:27:17 nfsb4 kernel: drbd0: worker terminated > Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate NetworkFailure --> Unconnected > Apr 17 22:27:17 nfsb4 kernel: drbd0: Connection lost. > Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection > [NETWORK WENT BACK HERE] > Apr 17 22:27:20 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFConnection --> WFReportParams > > Apr 17 22:27:22 nfsb4 kernel: drbd0: sock_recvmsg returned -11 > Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFReportParams --> BrokenPipe > Apr 17 22:27:22 nfsb4 kernel: drbd0: short read expecting header on sock: r=-11 > Apr 17 22:27:22 nfsb4 kernel: drbd0: worker terminated > Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate BrokenPipe --> Unconnected > Apr 17 22:27:22 nfsb4 kernel: drbd0: Connection lost. > Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection ok, here at least the first WFReportParams indeed timed out. does not help on the primary, unfortunately... > Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate WFConnection --> Unconnected > Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated > Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> StandAlone > Apr 17 22:53:30 nfsb4 kernel: drbd0: Connection lost. > Apr 17 22:53:30 nfsb4 kernel: drbd0: Discarding network configuration. > Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate StandAlone --> StandAlone > Apr 17 22:53:30 nfsb4 kernel: drbd0: receiver terminated > Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> StandAlone > Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> Unconfigured > Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated > Apr 17 22:53:30 nfsb4 kernel: drbd: module cleanup done. > > > The others 20 DRBD nodes that were hit by the same network failure > reacted normally, the FS was nicely working with read/write accesses. > > Here are the revelant parts of the kernel logs on one of primary where > DRBD *DID NOT FAIL*: > > Apr 17 22:26:14 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 > Apr 17 22:26:17 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294 > Apr 17 22:26:20 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967293 > Apr 17 22:26:23 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967292 > Apr 17 22:26:29 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 > Apr 17 22:26:32 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294 > Apr 17 22:26:49 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 > Apr 17 22:26:52 mail1 kernel: drbd0: PingAck did not arrive in time. > Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_asender [23008]: cstate Connected --> NetworkFailure > Apr 17 22:26:52 mail1 kernel: drbd0: asender terminated > Apr 17 22:26:52 mail1 kernel: drbd0: kjournald [1238]: cstate NetworkFailure --> Timeout > Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Timeout --> BrokenPipe > Apr 17 22:26:52 mail1 kernel: drbd0: short read expecting header on sock: r=-512 > Apr 17 22:26:52 mail1 kernel: drbd0: short sent Barrier size=16 sent=-1000 > Apr 17 22:26:52 mail1 kernel: drbd0: worker terminated > Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate BrokenPipe --> Unconnected > Apr 17 22:26:52 mail1 kernel: drbd0: Connection lost. > Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Unconnected --> WFConnection > [NETWORK WENT BACK HERE] > Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFConnection --> WFReportParams > Apr 17 22:26:55 mail1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 > Apr 17 22:26:55 mail1 kernel: drbd0: Connection established. > Apr 17 22:26:55 mail1 kernel: drbd0: I am(P): 1:00000002:00000005:00000029:00000004:10 > Apr 17 22:26:55 mail1 kernel: drbd0: Peer(S): 1:00000002:00000005:00000028:00000004:01 > Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFReportParams --> WFBitMapS > Apr 17 22:26:59 mail1 kernel: drbd0: Primary/Unknown --> Primary/Secondary > Apr 17 22:27:30 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFBitMapS --> SyncSource > Apr 17 22:27:30 mail1 kernel: drbd0: Resync started as SyncSource (need to sync 572 KB [143 bits set]). > Apr 17 22:27:31 mail1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 572 K/sec) > Apr 17 22:27:31 mail1 kernel: drbd0: drbd0_worker [20786]: cstate SyncSource --> Connected well, that looks as expected. the switch seems to still have been able to let small packets through for a while, when larger packets already went nowhere ... (thats this ko count down) -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.