Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-03-23 20:10:01 +0300
\ Eugene Crosser:
> On Tue, 2004-03-23 at 12:28, Lars Ellenberg wrote:
>
> > > Please could someone knowledgeble tell me why this may happen?
> > > Under load (local massive quota update by a script plus cpio of a big
> > > tree from an NFS client) synchronization is repeatedly restarted after
> > > ~1% completion:
> [...]
> > 0.6.12 should help for some weird behaviour related to the above problem.
> > Whether it fixes the actual problem I cannot say. Please report back.
>
> Behavior did not change with 0.6.12.
> When secondary is booted, it synchronizes, works well for some time, and
> then starts to disconnect and reconnect every few minutes. At this
> time, load average on the master grows two or three times higher than it
> was while "normal" operation.
Well, you then want to increase
ping-int = 20 # or some such
and maybe
sndbuf-size = 512k # or more.
maybe you even need to increase the sndbuf-size of the meta socket,
(there is no config option for this currently)
> I must also note that heartbeat (it uses the same interface) begins to
> produce these messages:
>
> Mar 23 19:34:49 nfsa1.mail.back heartbeat[639]: WARN: Late heartbeat: Node 10.0.0.247: interval 14970 ms
> Mar 23 19:35:27 nfsa2.mail.back heartbeat[601]: WARN: Late heartbeat: Node nfsa1: interval 12410 ms
you have nearly 15 (*fifteen* !!) *seconds* round trip time?
doh.
> Ethernet is 10/100/1000 baseT (tg3) on Dell 1750's.
> When I kill the script that updates quota the problem goes away, SyncAll
> completes successfully and continues to work in Connected mode.
please do this experiment:
make sure you your link is in full duplex!
collect some ping statistics while network is idle.
saturate the link with dummy data
(use netcat </dev/zero or whatever)
again, collect some ping statistics in both directions.
should be enlightening :)
>
> This is an excerpt from drbd.conf:
> protocol=C
> [...]
> net {
> sync-max=128M
> sync-min=32M
> tl-size=512
> sync-nice=0
> }
this snipplet is irrelevant.
though, tl-size seems to be very small...
I am used to like 5000. but as long as DRBD does not complain about it,
this is ok.
> I am including a log starting from the moment when secondary is first
> connected. It's rather long, sorry. Note that from 15:10 to 16:42 the
> system worked normally.
>
> Mar 23 13:57:34 nfsa2.mail.back drbd: ===> drbd start <===
> Mar 23 13:57:34 nfsa2.mail.back drbd: modprobe -s drbd minor_count=1
> Mar 23 13:57:34 nfsa2.mail.back kernel: drbd: initialised. Version: 0.6.12 (api:64/proto:62)
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 disk /dev/sda1 --do-panic --disk-size=214291003k
> Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Creating state file
> Mar 23 13:57:35 nfsa2.mail.back kernel: "/var/lib/drbd/drbd0"
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 net 192.168.1.2:7789 192.168.1.1:7789 C --sync-max=128M --sync-min=32M --tl-size=512 --sync-nice=0
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_connect -t 120
> Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 13:57:35 nfsa2.mail.back drbd: 'drbd0' SyncingAll, waiting for this to finish
> Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_sync
> Mar 23 15:10:47 nfsa1.mail.back kernel: drbd0: Synchronisation done.
> Mar 23 15:10:51 nfsa2.mail.back drbd: 'drbd0' SyncingAll finished, issue drbdsetup /dev/nb0 secondary_remote
> Mar 23 16:27:48 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 16:50:40 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 16:50:40 nfsa1.mail.back last message repeated 1 time
> Mar 23 17:05:06 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
ok, so sometimes the sending process gets stuck,
but DRBD-Pings still get through.
> Mar 23 17:42:25 nfsa2.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:42:26 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:43:08 nfsa1.mail.back kernel: drbd0: Synchronisation done.
this time not even the ping packet got through,
So we drop and reconnect.
> Mar 23 17:43:39 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection lost.
but it did not help. again some ping was lost.
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg returned -32
so this pending send is aborted with an error code
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection lost.
same thing.
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/11349] sock_sendmsg returned -32
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Syncer send failed.
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection lost.
> Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B
> Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15
> Mar 23 17:47:11 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:56:40 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 17:56:40 nfsa1.mail.back last message repeated 1 time
> Mar 23 17:58:49 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294
> Mar 23 18:00:54 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 18:00:54 nfsa1.mail.back last message repeated 1 time
> Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
> Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294
> Mar 23 18:20:48 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295
These again mean that some sending process got stuck,
but DRBD-Pings on the DRBD-meta channel still ar answered in time,
and the sending process seems to recover after typically two to three
ping intervalls.
[...]
Lars Ellenberg