[DRBD-user] DRBD stuck after a strong network failure

Cyril Bouthors cyril at bouthors.org
Tue Apr 18 18:19:04 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I've experienced a major hardware failure on an ethernet switch last
night and DRBD very badly reacted on 6 nodes out of the 26 that were
connected to that switch.

Those 6 nodes switched to state Primary/Unknown and Secondary/Unknown.

The read accesses like 'ls' on the filesystem mounted on the 3
primaries worked fine but I have the impression that write accesses
failed according to the fact that all processes accessing the
partitions (like MySQL) were stuck.

Unfortunately, I had no time to verify this carefully because of the
pressure of the downtime and my will to fix it _asap_.

I was unable to stop the user-land processes. The load was about 250.

SSH worked fine.

I was able to unmount the partition with "umount -l /mypartition".

I was unable to stop DRBD on both primaries and secondaries, it just
stuck when heartbeat launched drbdadm.

I was forced to electrically reboot the servers.

IIRC, fsck didn't have to parse the whole FS after the reboot on
primaries that I've "umount -l" but not on the others.


I'm running EXT3 on top of DRBD 0.7.15 on top of Linux 2.4.27-2-686.
Debian flavor, no home-made patches.

The exact same thing happened three times in the last two months.

I'm not sure what the network failure looked exactly like to DRBD but
I guess it was not as easy as if the cable was suddenly disconnected,
I think DRBD got corrupted packets and such.

I think DRBD should not react this way to network failures and I think
it's a bug in DRBD that should be fixed.

Do you think DRBD 0.7.16 or .17 already fixed this issue? I can
upgrade to .16 or .17 if (and only if) needed.

Here are the revelant parts of the kernel logs on one of primary that
have failed:

Apr 17 22:26:26 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
Apr 17 22:26:29 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967294
Apr 17 22:26:32 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967293
Apr 17 22:26:35 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967292
Apr 17 22:26:38 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967291
Apr 17 22:26:41 nfsa4 kernel: drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967290

Apr 17 22:26:44 nfsa4 kernel: drbd0: PingAck did not arrive in time.
Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_asender [15156]: cstate Connected --> NetworkFailure
Apr 17 22:26:44 nfsa4 kernel: drbd0: asender terminated
Apr 17 22:26:44 nfsa4 kernel: drbd0: kupdated [6]: cstate NetworkFailure --> Timeout
Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Timeout --> BrokenPipe
Apr 17 22:26:44 nfsa4 kernel: drbd0: short read expecting header on sock: r=-512
Apr 17 22:26:44 nfsa4 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001
Apr 17 22:26:44 nfsa4 kernel: drbd0: worker terminated
Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate BrokenPipe --> Unconnected
Apr 17 22:26:44 nfsa4 kernel: drbd0: Connection lost.
Apr 17 22:26:44 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate Unconnected --> WFConnection
[NETWORK WENT BACK HERE]
Apr 17 22:27:20 nfsa4 kernel: drbd0: drbd0_receiver [831]: cstate WFConnection --> WFReportParams

Apr 17 22:56:15 nfsa4 kernel: drbd0: Primary/Unknown --> Secondary/Unknown


Same thing on a secondary:

Apr 17 22:26:44 nfsb4 kernel: drbd0: meta connection shut down by peer.
Apr 17 22:26:44 nfsb4 kernel: drbd0: drbd0_asender [12837]: cstate Connected --> NetworkFailure
Apr 17 22:26:44 nfsb4 kernel: drbd0: asender terminated

Apr 17 22:27:17 nfsb4 kernel: drbd0: short sent BarrierAck size=16 sent=-1001
Apr 17 22:27:17 nfsb4 kernel: drbd0: error receiving Barrier, l: 8!
Apr 17 22:27:17 nfsb4 kernel: drbd0: worker terminated
Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate NetworkFailure --> Unconnected
Apr 17 22:27:17 nfsb4 kernel: drbd0: Connection lost.
Apr 17 22:27:17 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection
[NETWORK WENT BACK HERE]
Apr 17 22:27:20 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFConnection --> WFReportParams

Apr 17 22:27:22 nfsb4 kernel: drbd0: sock_recvmsg returned -11
Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate WFReportParams --> BrokenPipe
Apr 17 22:27:22 nfsb4 kernel: drbd0: short read expecting header on sock: r=-11
Apr 17 22:27:22 nfsb4 kernel: drbd0: worker terminated
Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate BrokenPipe --> Unconnected
Apr 17 22:27:22 nfsb4 kernel: drbd0: Connection lost.
Apr 17 22:27:22 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> WFConnection

Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate WFConnection --> Unconnected
Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated
Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate Unconnected --> StandAlone
Apr 17 22:53:30 nfsb4 kernel: drbd0: Connection lost.
Apr 17 22:53:30 nfsb4 kernel: drbd0: Discarding network configuration.
Apr 17 22:53:30 nfsb4 kernel: drbd0: drbd0_receiver [12271]: cstate StandAlone --> StandAlone
Apr 17 22:53:30 nfsb4 kernel: drbd0: receiver terminated
Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> StandAlone
Apr 17 22:53:30 nfsb4 kernel: drbd0: drbdsetup [31901]: cstate StandAlone --> Unconfigured
Apr 17 22:53:30 nfsb4 kernel: drbd0: worker terminated
Apr 17 22:53:30 nfsb4 kernel: drbd: module cleanup done.


The others 20 DRBD nodes that were hit by the same network failure
reacted normally, the FS was nicely working with read/write accesses.

Here are the revelant parts of the kernel logs on one of primary where
DRBD *DID NOT FAIL*:

Apr 17 22:26:14 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295
Apr 17 22:26:17 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294
Apr 17 22:26:20 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967293
Apr 17 22:26:23 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967292
Apr 17 22:26:29 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295
Apr 17 22:26:32 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967294
Apr 17 22:26:49 mail1 kernel: drbd0: [kjournald/1238] sock_sendmsg time expired, ko = 4294967295
Apr 17 22:26:52 mail1 kernel: drbd0: PingAck did not arrive in time.
Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_asender [23008]: cstate Connected --> NetworkFailure
Apr 17 22:26:52 mail1 kernel: drbd0: asender terminated
Apr 17 22:26:52 mail1 kernel: drbd0: kjournald [1238]: cstate NetworkFailure --> Timeout
Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Timeout --> BrokenPipe
Apr 17 22:26:52 mail1 kernel: drbd0: short read expecting header on sock: r=-512
Apr 17 22:26:52 mail1 kernel: drbd0: short sent Barrier size=16 sent=-1000
Apr 17 22:26:52 mail1 kernel: drbd0: worker terminated
Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate BrokenPipe --> Unconnected
Apr 17 22:26:52 mail1 kernel: drbd0: Connection lost.
Apr 17 22:26:52 mail1 kernel: drbd0: drbd0_receiver [858]: cstate Unconnected --> WFConnection
[NETWORK WENT BACK HERE]
Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFConnection --> WFReportParams
Apr 17 22:26:55 mail1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Apr 17 22:26:55 mail1 kernel: drbd0: Connection established.
Apr 17 22:26:55 mail1 kernel: drbd0: I am(P): 1:00000002:00000005:00000029:00000004:10
Apr 17 22:26:55 mail1 kernel: drbd0: Peer(S): 1:00000002:00000005:00000028:00000004:01
Apr 17 22:26:55 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFReportParams --> WFBitMapS
Apr 17 22:26:59 mail1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Apr 17 22:27:30 mail1 kernel: drbd0: drbd0_receiver [858]: cstate WFBitMapS --> SyncSource
Apr 17 22:27:30 mail1 kernel: drbd0: Resync started as SyncSource (need to sync 572 KB [143 bits set]).
Apr 17 22:27:31 mail1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 572 K/sec)
Apr 17 22:27:31 mail1 kernel: drbd0: drbd0_worker [20786]: cstate SyncSource --> Connected
-- 
Cyril Bouthors
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060418/cb34a807/attachment.pgp>


More information about the drbd-user mailing list