[DRBD-user] drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672..

Mon Feb 2 00:29:50 CET 2004

/ 2004-02-01 14:36:47 -0500
\ Ward Horner:
> I have DRBD setup with heartbeat doing an NFS export. When I transfer files 
> to the primary DRBD device I get the error (after about 1-2 minutes):
> 
>     drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672

this means that kjournal got stuck trying to send a block.

It is NOT an error, but a hint that you have high latency, and
obviously the primary writes faster than the secondary can cope.

you can configure a "ko-count", which defaults to 0, (and then
wraps around to 2**32-1 on the first count down, thus the high
value). As soon as ko-count reaches zero (countdown every
ping-intervall while block communication is stalled), the
connection is teared down, and primary goes "StandAlone".

maybe you have some sort of distributed resource deadlock.
I understand that it is actually the secondary writing to the nfs
share, which is the drbd exported by the primary?

Is the drbd_asender_0 kernel thread still alive?

Does the system recover if you do a sync on the Secondary?
Or an emergency sync via SysRq ?
(read about it in /usr/src/linux/Documentation/sysrq.txt)
Or on the Primary?

> This error then occurs continuously until I reboot the system. Also the 
> system has locked out all user control. I can telnet in, but once telnet 
> has established the connection, I cannot log in. Funny thing though is that 
> heartbeat does not transfer over to the secondary system, until I reboot 
> the primary. When I halt the primary and have the secondary takeover, I do 
> NOT get the error. I only get the error on the primary.

above paragraph is confusing. please do not mix up node names and
prefered roles, and actual drbd states!

> I am using drbd-0.6.10+cvs. The latest file is drbd_main 1.88, Dec 22, 
> 2003. I looked at the more current release, but did not find any changes 
> that addressed this issue.
> 
> I saw an email on Jan 8 from Andreas Schultz where he talks about a similar 
> problem during synchronization, but I have not seen a resolution. However, 
> there was mention to a "write_hint_dont_wait" patch.

which is in cvs since drbd_main.c 1.88.

> I am running DRBD on a Timesys Real-time Linux 4.0 system. This distro was 
> based on Debian and uses the 2.4.18 kernel.

this might well be a distributed deadlock scenario related to your
"real-time" kernel.

> Any ideas what is going on here?
> I used drbd-0.6.6 and never saw this problem.

drbd 0.6.6 just ignored this and eventually got stalled completely
in the same situation WITHOUT logging it.

	Lars Ellenberg