Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I've experienced a major hardware failure on an ethernet switch last night and DRBD very badly reacted on 6 nodes out of the 26 that were connected to that switch. Those 6 nodes switched to state Primary/Unknown and Secondary/Unknown. The read accesses like 'ls' on the filesystem mounted on the 3 primaries worked fine but I have the impression that write accesses failed according to the fact that all processes accessing the partitions (like MySQL) were stuck. Unfortunately, I had no time to verify this carefully because of the pressure of the downtime and my will to fix it _asap_. I was unable to stop the user-land processes. The load was about 250. SSH worked fine. I was able to unmount the partition with "umount -l /mypartition". I was unable to stop DRBD on both primaries and secondaries, it just stuck when heartbeat launched drbdadm. I was forced to electrically reboot the servers. IIRC, fsck didn't have to parse the whole FS after the reboot on primaries that I've "umount -l" but not on the others. I'm running EXT3 on top of DRBD 0.7.15 on top of Linux 2.4.27-2-686. Debian flavor, no home-made patches. The exact same thing happened three times in the last two months. I'm not sure what the network failure looked exactly like to DRBD but I guess it was not as easy as if the cable was suddenly disconnected, I think DRBD got corrupted packets and such. I think DRBD should not react this way to network failures and I think it's a bug in DRBD that should be fixed. Do you think DRBD 0.7.16 or .17 already fixed this issue? I can upgrade to .16 or .17 if (and only if) needed. Here are the revelant parts of the kernel logs on one of primary that have failed: Apr 17 22:26:26 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 Apr 17 22:26:29 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967294 Apr 17 22:26:32 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967293 Apr 17 22:26:35 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967292 Apr 17 22:26:38 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967291 Apr 17 22:26:41 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967290 Apr 17 22:26:44 nfsa4 kernel: drbd0: PingAck did not arrive in time. Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_asender [15156]: cstate Connected --> NetworkFailure Apr 17 22:26:44 nfsa4 kernel: drbd0: asender terminated Apr 17 22:26:44 nfsa4 kernel: drbd0: kupdated [6]: cstate NetworkFailure --> Timeout Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Timeout --> BrokenPipe Apr 17 22:26:44 nfsa4 kernel: drbd0: short read expecting header on sock: r=-512 Apr 17 22:26:44 nfsa4 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001 Apr 17 22:26:44 nfsa4 kernel: drbd0: worker terminated Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate BrokenPipe --> Unconnected Apr 17 22:26:44 nfsa4 kernel: drbd0: Connection lost. Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Unconnected --> WFConnection [NETWORK WENT BACK HERE] Apr 17 22:27:20 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate WFConnection --> WFReportParams Apr 17 22:56:15 nfsa4 kernel: drbd0: Primary/Unknown --> Secondary/Unknown Same thing on a secondary: Apr 17 22:26:44 nfsb4 kernel: drbd0: meta connection shut down by peer. Apr 17 22:26:44 nfsb4 kernel: drbd0: drbd0_asender [12837]: cstate Connected --> NetworkFailure Apr 17 22:26:44 nfsb4 kernel: drbd0: asender terminated Apr 17 22:27:17 nfsb4 kernel: drbd0: short sent BarrierAck size=16 sent=-1001 Apr 17 22:27:17 nfsb4 kernel: drbd0: error receiving Barrier, l: 8! Apr 17 22:27:17 nfsb4 kernel: drbd0: worker terminated Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate NetworkFailure --> Unconnected Apr 17 22:27:17 nfsb4 kernel: drbd0: Connection lost. Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection [NETWORK WENT BACK HERE] Apr 17 22:27:20 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFConnection --> WFReportParams Apr 17 22:27:22 nfsb4 kernel: drbd0: sock_recvmsg returned -11 Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFReportParams --> BrokenPipe Apr 17 22:27:22 nfsb4 kernel: drbd0: short read expecting header on sock: r=-11 Apr 17 22:27:22 nfsb4 kernel: drbd0: worker terminated Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate BrokenPipe --> Unconnected Apr 17 22:27:22 nfsb4 kernel: drbd0: Connection lost. Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate WFConnection --> Unconnected Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> StandAlone Apr 17 22:53:30 nfsb4 kernel: drbd0: Connection lost. Apr 17 22:53:30 nfsb4 kernel: drbd0: Discarding network configuration. Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate StandAlone --> StandAlone Apr 17 22:53:30 nfsb4 kernel: drbd0: receiver terminated Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> StandAlone Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> Unconfigured Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated Apr 17 22:53:30 nfsb4 kernel: drbd: module cleanup done. The others 20 DRBD nodes that were hit by the same network failure reacted normally, the FS was nicely working with read/write accesses. Here are the revelant parts of the kernel logs on one of primary where DRBD *DID NOT FAIL*: Apr 17 22:26:14 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 Apr 17 22:26:17 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294 Apr 17 22:26:20 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967293 Apr 17 22:26:23 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967292 Apr 17 22:26:29 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 Apr 17 22:26:32 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294 Apr 17 22:26:49 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295 Apr 17 22:26:52 mail1 kernel: drbd0: PingAck did not arrive in time. Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_asender [23008]: cstate Connected --> NetworkFailure Apr 17 22:26:52 mail1 kernel: drbd0: asender terminated Apr 17 22:26:52 mail1 kernel: drbd0: kjournald [1238]: cstate NetworkFailure --> Timeout Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Timeout --> BrokenPipe Apr 17 22:26:52 mail1 kernel: drbd0: short read expecting header on sock: r=-512 Apr 17 22:26:52 mail1 kernel: drbd0: short sent Barrier size=16 sent=-1000 Apr 17 22:26:52 mail1 kernel: drbd0: worker terminated Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate BrokenPipe --> Unconnected Apr 17 22:26:52 mail1 kernel: drbd0: Connection lost. Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Unconnected --> WFConnection [NETWORK WENT BACK HERE] Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFConnection --> WFReportParams Apr 17 22:26:55 mail1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Apr 17 22:26:55 mail1 kernel: drbd0: Connection established. Apr 17 22:26:55 mail1 kernel: drbd0: I am(P): 1:00000002:00000005:00000029:00000004:10 Apr 17 22:26:55 mail1 kernel: drbd0: Peer(S): 1:00000002:00000005:00000028:00000004:01 Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFReportParams --> WFBitMapS Apr 17 22:26:59 mail1 kernel: drbd0: Primary/Unknown --> Primary/Secondary Apr 17 22:27:30 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFBitMapS --> SyncSource Apr 17 22:27:30 mail1 kernel: drbd0: Resync started as SyncSource (need to sync 572 KB [143 bits set]). Apr 17 22:27:31 mail1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 572 K/sec) Apr 17 22:27:31 mail1 kernel: drbd0: drbd0_worker [20786]: cstate SyncSource --> Connected -- Cyril Bouthors -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 188 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060418/cb34a807/attachment.pgp>