[DRBD-user] drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672..
Lars.Ellenberg at linbit.com
Mon Feb 2 00:29:50 CET 2004
/ 2004-02-01 14:36:47 -0500
\ Ward Horner:
> I have DRBD setup with heartbeat doing an NFS export. When I transfer files
> to the primary DRBD device I get the error (after about 1-2 minutes):
> drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672
this means that kjournal got stuck trying to send a block.
It is NOT an error, but a hint that you have high latency, and
obviously the primary writes faster than the secondary can cope.
you can configure a "ko-count", which defaults to 0, (and then
wraps around to 2**32-1 on the first count down, thus the high
value). As soon as ko-count reaches zero (countdown every
ping-intervall while block communication is stalled), the
connection is teared down, and primary goes "StandAlone".
maybe you have some sort of distributed resource deadlock.
I understand that it is actually the secondary writing to the nfs
share, which is the drbd exported by the primary?
Is the drbd_asender_0 kernel thread still alive?
Does the system recover if you do a sync on the Secondary?
Or an emergency sync via SysRq ?
(read about it in /usr/src/linux/Documentation/sysrq.txt)
Or on the Primary?
> This error then occurs continuously until I reboot the system. Also the
> system has locked out all user control. I can telnet in, but once telnet
> has established the connection, I cannot log in. Funny thing though is that
> heartbeat does not transfer over to the secondary system, until I reboot
> the primary. When I halt the primary and have the secondary takeover, I do
> NOT get the error. I only get the error on the primary.
above paragraph is confusing. please do not mix up node names and
prefered roles, and actual drbd states!
> I am using drbd-0.6.10+cvs. The latest file is drbd_main 1.88, Dec 22,
> 2003. I looked at the more current release, but did not find any changes
> that addressed this issue.
> I saw an email on Jan 8 from Andreas Schultz where he talks about a similar
> problem during synchronization, but I have not seen a resolution. However,
> there was mention to a "write_hint_dont_wait" patch.
which is in cvs since drbd_main.c 1.88.
> I am running DRBD on a Timesys Real-time Linux 4.0 system. This distro was
> based on Debian and uses the 2.4.18 kernel.
this might well be a distributed deadlock scenario related to your
> Any ideas what is going on here?
> I used drbd-0.6.6 and never saw this problem.
drbd 0.6.6 just ignored this and eventually got stalled completely
in the same situation WITHOUT logging it.
More information about the drbd-user