[DRBD-user] DRBD stuck after a strong network failure

Cyril Bouthors cyril at bouthors.org
Mon May 8 14:14:46 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 26 Apr 2006, Lars Ellenberg wrote:

> if you can "reproduce" this scenario,
> please try with current drbd-0.7 svn,
> which should be released as 0.7.18 soonish.

The same thing still happens with 0.7.18. Both nodes get disconnected
for an unknown reason to me and can't reconnect. This time, it's less
critical to me because the filesystem is still readable and writeable.
Here's more information:

Primary :

May  8 13:34:31 mail1 kernel: drbd0: [kjournald/3038] sock_sendmsg time expired, ko = 4294967295
May  8 13:34:34 mail1 kernel: drbd0: [kjournald/3038] sock_sendmsg time expired, ko = 4294967294
May  8 13:34:46 mail1 kernel: drbd0: [kjournald/3038] sock_sendmsg time expired, ko = 4294967295
May  8 13:34:49 mail1 kernel: drbd0: [kjournald/3038] sock_sendmsg time expired, ko = 4294967294
May  8 13:34:52 mail1 kernel: drbd0: PingAck did not arrive in time.
May  8 13:34:52 mail1 kernel: drbd0: drbd0_asender [811]: cstate Connected --> NetworkFailure
May  8 13:34:52 mail1 kernel: drbd0: asender terminated
May  8 13:34:52 mail1 kernel: drbd0: kjournald [3038]: cstate NetworkFailure --> Timeout
May  8 13:34:52 mail1 kernel: drbd0: drbd0_receiver [2493]: cstate Timeout --> BrokenPipe
May  8 13:34:52 mail1 kernel: drbd0: short read expecting header on sock: r=-512
May  8 13:34:52 mail1 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001
May  8 13:34:52 mail1 kernel: drbd0: worker terminated
May  8 13:34:52 mail1 kernel: drbd0: drbd0_receiver [2493]: cstate BrokenPipe --> Unconnected
May  8 13:34:52 mail1 kernel: drbd0: Connection lost.
May  8 13:34:52 mail1 kernel: drbd0: drbd0_receiver [2493]: cstate Unconnected --> WFConnection
May  8 13:34:55 mail1 kernel: drbd0: drbd0_receiver [2493]: cstate WFConnection --> WFReportParams

root at mail1:~# cat  /proc/drbd
version: 0.7.18 (api:78/proto:74)
SVN Revision: 2176 build by root at ns2, 2006-04-26 20:05:56
 0: cs:WFReportParams st:Primary/Unknown ld:Consistent
    ns:38632 nr:0 dw:109369656 dr:25406553 al:365924 bm:986 lo:2 pe:0 ua:0 ap:0

Secondary:

May  8 13:34:52 mail2 kernel: drbd0: meta connection shut down by peer.
May  8 13:34:52 mail2 kernel: drbd0: drbd0_asender [7479]: cstate Connected --> NetworkFailure
May  8 13:34:52 mail2 kernel: drbd0: asender terminated
May  8 13:34:52 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate NetworkFailure --> BrokenPipe
May  8 13:34:52 mail2 kernel: drbd0: short read receiving data block: read 3984 expected 4096
May  8 13:34:52 mail2 kernel: drbd0: error receiving Data, l: 4112!
May  8 13:34:52 mail2 kernel: drbd0: worker terminated
May  8 13:34:52 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate BrokenPipe --> Unconnected
May  8 13:34:52 mail2 kernel: drbd0: Connection lost.
May  8 13:34:52 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate Unconnected --> WFConnection
May  8 13:34:55 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate WFConnection --> WFReportParams
May  8 13:34:57 mail2 kernel: drbd0: sock_recvmsg returned -11
May  8 13:34:57 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate WFReportParams --> BrokenPipe
May  8 13:34:57 mail2 kernel: drbd0: short read expecting header on sock: r=-11
May  8 13:34:57 mail2 kernel: drbd0: My msock connect got accepted onto peer's sock!
May  8 13:35:03 mail2 kernel: drbd0: worker terminated
May  8 13:35:03 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate BrokenPipe --> Unconnected
May  8 13:35:03 mail2 kernel: drbd0: Connection lost.
May  8 13:35:03 mail2 kernel: drbd0: drbd0_receiver [30218]: cstate Unconnected --> WFConnection

When I tried to restart DRBD on the secondary, I got:

May  8 13:53:47 mail2 kernel: drbd: initialised. Version: 0.7.18 (api:78/proto:74)
May  8 13:53:47 mail2 kernel: drbd: SVN Revision: 2176 build by root at ns2, 2006-04-26 20:05:56
May  8 13:53:47 mail2 kernel: drbd: registered as block device major 147
May  8 13:53:47 mail2 kernel: drbd0: resync bitmap: bits=8658397 words=270576
May  8 13:53:47 mail2 kernel: drbd0: size = 33 GB (34633588 KB)
May  8 13:53:47 mail2 kernel: klogd 1.4.1, ---------- state change ----------
May  8 13:53:47 mail2 kernel: Loaded 1352 symbols from 22 modules.
May  8 13:53:48 mail2 kernel: drbd0: 0 KB marked out-of-sync by on disk bit-map.
May  8 13:53:48 mail2 kernel: drbd0: Found 4 transactions (136 active extents) in activity log.
May  8 13:53:48 mail2 kernel: drbd0: drbdsetup [445]: cstate Unconfigured --> StandAlone
May  8 13:53:48 mail2 kernel: drbd0: drbdsetup [461]: cstate StandAlone --> Unconnected
May  8 13:53:48 mail2 kernel: drbd0: drbd0_receiver [462]: cstate Unconnected --> WFConnection
May  8 13:54:08 mail2 kernel: drbd0: drbdsetup [479]: cstate WFConnection --> Unconnected
May  8 13:54:08 mail2 kernel: drbd0: worker terminated
May  8 13:54:08 mail2 kernel: drbd0: drbd0_receiver [462]: cstate Unconnected --> StandAlone
May  8 13:54:08 mail2 kernel: drbd0: Connection lost.
May  8 13:54:08 mail2 kernel: drbd0: Discarding network configuration.
May  8 13:54:08 mail2 kernel: drbd0: drbd0_receiver [462]: cstate StandAlone --> StandAlone
May  8 13:54:08 mail2 kernel: drbd0: receiver terminated
May  8 13:54:08 mail2 kernel: drbd0: drbdsetup [479]: cstate StandAlone --> StandAlone
May  8 13:54:08 mail2 kernel: drbd0: drbdsetup [479]: cstate StandAlone --> Unconfigured
May  8 13:54:08 mail2 kernel: drbd0: worker terminated
May  8 13:54:08 mail2 kernel: drbd: module cleanup done.

and absolutely nothing happens on the primary.

When I tried "drbdadm connect all" on the primary, I got:

root at mail1:~# drbdadm connect all
Child process does not terminate!
Exiting.
root at mail1:~#

May  8 14:01:06 mail1 kernel: drbd0: interrupted during initial handshake
May  8 14:01:06 mail1 kernel: drbd0: My msock connect got accepted onto peer's sock!
May  8 14:01:06 mail1 kernel: drbd0: worker terminated

and absolutely nothing happens on the secondary.

I'm running 2.4.27-2-686 (Debian flavor, no patches) on both nodes.
-- 
Cyril Bouthors
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060508/443f3300/attachment.pgp>


More information about the drbd-user mailing list