[DRBD-user] Re: drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672..

Wed Feb 4 17:38:24 CET 2004

I looked at the latency and it is set to the default of 60 = 6 seconds. 
That should be long enough.

No, there is something wrong with DRBD and its interaction with Timesys RT 
linux. It appears that the socket just stops working and never comes back. 
The error is not occurring a random intervals, but continuously. The error 
is very repeatable on my system. It happens about 1-2 minutes into the file 
transfer and it causes the file transfer to completely stop. I waited 
several hours to make sure that the file transfer had indeed stopped (and 
was not running, but slowly) and it had. After that error message no more 
data was transferred to the drbd disk. Its like the socket has just stopped 
working completely.

I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything 
worked fine. I loaded up the drbd disk over nfs. I brought the primary node 
down and up 10 times during the xfer and then did a diff -r between the 
original files and the files on the drbd disk and the transfer worked fine. 
There were no file differences (BTW, this is very impressive, drbd is an 
amazing piece of SW!). I did see a number of messages like this: drbd0: 
pending_cnt <0 !!! on the secondary node while it was doing a syncingAll 
with the primary node (after I reset the primary node without doing a 
graceful shutdown, as a test). But even still, the files were transferred 
and stored correctly.

I could set the ko-count to a smaller number, but that would just put the 
primary node into standAlone and I assume the secondary node would then 
take over. However, since this is a repeatable error, it would happen after 
only 1-2 minutes of use.

Is there some status information that I could print out to get a better 
handle on this problem and get better visibility into why the socket stops 
functioning?

Thanks for any assistance you can offer,

Ward

Am Montag, 2. Februar 2004 00:29 schrieb Lars Ellenberg:
 > / 2004-02-01 14:36:47 -0500
 >
 > \ Ward Horner:
 > > I have DRBD setup with heartbeat doing an NFS export. When I transfer
 > > files to the primary DRBD device I get the error (after about 1-2
 > > minutes):
 > >
 > >     drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672
 >
 > this means that kjournal got stuck trying to send a block.
 >
 > It is NOT an error, but a hint that you have high latency, and
 > obviously the primary writes faster than the secondary can cope.
 >
 > you can configure a "ko-count", which defaults to 0, (and then
 > wraps around to 2**32-1 on the first count down, thus the high
 > value). As soon as ko-count reaches zero (countdown every
 > ping-intervall while block communication is stalled), the
 > connection is teared down, and primary goes "StandAlone".
 >
 > maybe you have some sort of distributed resource deadlock.
 > I understand that it is actually the secondary writing to the nfs
 > share, which is the drbd exported by the primary?
 >

Maybe it is just a a slow secondary. -- If you want to get rid
of the messages you could also increase the timeout. See
man drbdsetup

--snip--
-t,    If the partner node failes to send an expected response  packet  within
        val  10ths  of a second, the partner node is considered dead and there-
        fore the tcp/ip connection is abandoned.  The default value is 60  =  6
        seconds.
--snap--

-Philipp
--