[DRBD-user] Syncer aborted. Connection lost.

Tue Mar 23 19:37:06 CET 2004

/ 2004-03-23 20:10:01 +0300
\ Eugene Crosser:
> On Tue, 2004-03-23 at 12:28, Lars Ellenberg wrote:
> 
> > > Please could someone knowledgeble tell me why this may happen?
> > > Under load (local massive quota update by a script plus cpio of a big
> > > tree from an NFS client) synchronization is repeatedly restarted after
> > > ~1% completion:
> [...]
> > 0.6.12 should help for some weird behaviour related to the above problem.
> > Whether it fixes the actual problem I cannot say. Please report back.
> 
> Behavior did not change with 0.6.12.
> When secondary is booted, it synchronizes, works well for some time, and
> then starts to disconnect and reconnect every few minutes.  At this
> time, load average on the master grows two or three times higher than it
> was while "normal" operation.

Well, you then want to increase
    ping-int    = 20   # or some such

and maybe
    sndbuf-size = 512k # or more.

maybe you even need to increase the sndbuf-size of the meta socket,
(there is no config option for this currently)

> I must also note that heartbeat (it uses the same interface) begins to
> produce these messages:
> 
> Mar 23 19:34:49 nfsa1.mail.back heartbeat[639]: WARN: Late heartbeat: Node 10.0.0.247: interval 14970 ms
> Mar 23 19:35:27 nfsa2.mail.back heartbeat[601]: WARN: Late heartbeat: Node nfsa1: interval 12410 ms

you have nearly 15 (*fifteen* !!) *seconds* round trip time?  
doh.

> Ethernet is 10/100/1000 baseT (tg3) on Dell 1750's.
> When I kill the script that updates quota the problem goes away, SyncAll
> completes successfully and continues to work in Connected mode.

please do this experiment:
	make sure you your link is in full duplex!

	collect some ping statistics while network is idle.

	saturate the link with dummy data
	(use netcat </dev/zero or whatever)

	again, collect some ping statistics in both directions.

	should be enlightening :)

> 
> This is an excerpt from drbd.conf:
>   protocol=C
> [...]
>   net {
>     sync-max=128M
>     sync-min=32M
>     tl-size=512
>     sync-nice=0
>   }

this snipplet is irrelevant.
though, tl-size seems to be very small...
I am used to like 5000. but as long as DRBD does not complain about it,
this is ok.

> I am including a log starting from the moment when secondary is first
> connected.  It's rather long, sorry.  Note that from 15:10 to 16:42 the
> system worked normally.
> 
> Mar 23 13:57:34 nfsa2.mail.back drbd: ===> drbd start <===
> Mar 23 13:57:34 nfsa2.mail.back drbd: modprobe -s drbd minor_count=1
> Mar 23 13:57:34 nfsa2.mail.back kernel: drbd: initialised. Version: 0.6.12 (api:64/proto:62)
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 disk /dev/sda1 --do-panic --disk-size=214291003k
> Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Creating state file
> Mar 23 13:57:35 nfsa2.mail.back kernel: "/var/lib/drbd/drbd0"
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 net 192.168.1.2:7789 192.168.1.1:7789 C --sync-max=128M --sync-min=32M --tl-size=512 --sync-nice=0
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_connect -t 120
> Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 13:57:35 nfsa2.mail.back drbd: 'drbd0' SyncingAll, waiting for this to finish
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_sync
> Mar 23 15:10:47 nfsa1.mail.back kernel: drbd0: Synchronisation done.
> Mar 23 15:10:51 nfsa2.mail.back drbd: 'drbd0' SyncingAll finished, issue drbdsetup /dev/nb0 secondary_remote
> Mar 23 16:27:48 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 16:50:40 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 16:50:40 nfsa1.mail.back last message repeated 1 time
> Mar 23 17:05:06 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295

ok, so sometimes the sending process gets stuck,
but DRBD-Pings still get through.

> Mar 23 17:42:25 nfsa2.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:42:26 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:43:08 nfsa1.mail.back kernel: drbd0: Synchronisation done.

this time not even the ping packet got through,
So we drop and reconnect.

> Mar 23 17:43:39 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection lost.

but it did not help.  again some ping was lost.

> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg returned -32

so this pending send is aborted with an error code

> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection lost.

same thing.

> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/11349] sock_sendmsg returned -32
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Syncer send failed.
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:47:11 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:56:40 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:56:40 nfsa1.mail.back last message repeated 1 time
> Mar 23 17:58:49 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294
> Mar 23 18:00:54 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 18:00:54 nfsa1.mail.back last message repeated 1 time
> Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294
> Mar 23 18:20:48 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295

These again mean that some sending process got stuck,
but DRBD-Pings on the DRBD-meta channel still ar answered in time,
and the sending process seems to recover after typically two to three
ping intervalls.

[...]

	Lars Ellenberg