Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-03-23 20:10:01 +0300 \ Eugene Crosser: > On Tue, 2004-03-23 at 12:28, Lars Ellenberg wrote: > > > > Please could someone knowledgeble tell me why this may happen? > > > Under load (local massive quota update by a script plus cpio of a big > > > tree from an NFS client) synchronization is repeatedly restarted after > > > ~1% completion: > [...] > > 0.6.12 should help for some weird behaviour related to the above problem. > > Whether it fixes the actual problem I cannot say. Please report back. > > Behavior did not change with 0.6.12. > When secondary is booted, it synchronizes, works well for some time, and > then starts to disconnect and reconnect every few minutes. At this > time, load average on the master grows two or three times higher than it > was while "normal" operation. Well, you then want to increase ping-int = 20 # or some such and maybe sndbuf-size = 512k # or more. maybe you even need to increase the sndbuf-size of the meta socket, (there is no config option for this currently) > I must also note that heartbeat (it uses the same interface) begins to > produce these messages: > > Mar 23 19:34:49 nfsa1.mail.back heartbeat[639]: WARN: Late heartbeat: Node 10.0.0.247: interval 14970 ms > Mar 23 19:35:27 nfsa2.mail.back heartbeat[601]: WARN: Late heartbeat: Node nfsa1: interval 12410 ms you have nearly 15 (*fifteen* !!) *seconds* round trip time? doh. > Ethernet is 10/100/1000 baseT (tg3) on Dell 1750's. > When I kill the script that updates quota the problem goes away, SyncAll > completes successfully and continues to work in Connected mode. please do this experiment: make sure you your link is in full duplex! collect some ping statistics while network is idle. saturate the link with dummy data (use netcat </dev/zero or whatever) again, collect some ping statistics in both directions. should be enlightening :) > > This is an excerpt from drbd.conf: > protocol=C > [...] > net { > sync-max=128M > sync-min=32M > tl-size=512 > sync-nice=0 > } this snipplet is irrelevant. though, tl-size seems to be very small... I am used to like 5000. but as long as DRBD does not complain about it, this is ok. > I am including a log starting from the moment when secondary is first > connected. It's rather long, sorry. Note that from 15:10 to 16:42 the > system worked normally. > > Mar 23 13:57:34 nfsa2.mail.back drbd: ===> drbd start <=== > Mar 23 13:57:34 nfsa2.mail.back drbd: modprobe -s drbd minor_count=1 > Mar 23 13:57:34 nfsa2.mail.back kernel: drbd: initialised. Version: 0.6.12 (api:64/proto:62) > Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 disk /dev/sda1 --do-panic --disk-size=214291003k > Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Creating state file > Mar 23 13:57:35 nfsa2.mail.back kernel: "/var/lib/drbd/drbd0" > Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 net 192.168.1.2:7789 192.168.1.1:7789 C --sync-max=128M --sync-min=32M --tl-size=512 --sync-nice=0 > Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_connect -t 120 > Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 13:57:35 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 13:57:35 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15 > Mar 23 13:57:35 nfsa2.mail.back drbd: 'drbd0' SyncingAll, waiting for this to finish > Mar 23 13:57:35 nfsa2.mail.back drbd: drbdsetup /dev/nb0 wait_sync > Mar 23 15:10:47 nfsa1.mail.back kernel: drbd0: Synchronisation done. > Mar 23 15:10:51 nfsa2.mail.back drbd: 'drbd0' SyncingAll finished, issue drbdsetup /dev/nb0 secondary_remote > Mar 23 16:27:48 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295 > Mar 23 16:50:40 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295 > Mar 23 16:50:40 nfsa1.mail.back last message repeated 1 time > Mar 23 17:05:06 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295 ok, so sometimes the sending process gets stuck, but DRBD-Pings still get through. > Mar 23 17:42:25 nfsa2.mail.back kernel: drbd0: Connection lost. > Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection lost. > Mar 23 17:42:26 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:42:26 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15 > Mar 23 17:43:08 nfsa1.mail.back kernel: drbd0: Synchronisation done. this time not even the ping packet got through, So we drop and reconnect. > Mar 23 17:43:39 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg time expired, ko = 4294967295 > Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection lost. but it did not help. again some ping was lost. > Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: [kjournald/885] sock_sendmsg returned -32 so this pending send is aborted with an error code > Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection lost. > Mar 23 17:44:13 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:44:13 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15 > Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection lost. same thing. > Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/11349] sock_sendmsg returned -32 > Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Syncer send failed. > Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection lost. > Mar 23 17:45:00 nfsa2.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Connection established. size=214291003 KB / blksize=4096 B > Mar 23 17:45:00 nfsa1.mail.back kernel: drbd0: Synchronisation started blks=15 > Mar 23 17:47:11 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295 > Mar 23 17:56:40 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295 > Mar 23 17:56:40 nfsa1.mail.back last message repeated 1 time > Mar 23 17:58:49 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294 > Mar 23 18:00:54 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295 > Mar 23 18:00:54 nfsa1.mail.back last message repeated 1 time > Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295 > Mar 23 18:20:17 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967294 > Mar 23 18:20:48 nfsa1.mail.back kernel: drbd0: [drbd_syncer_0/18366] sock_sendmsg time expired, ko = 4294967295 These again mean that some sending process got stuck, but DRBD-Pings on the DRBD-meta channel still ar answered in time, and the sending process seems to recover after typically two to three ping intervalls. [...] Lars Ellenberg