Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I looked at the latency and it is set to the default of 60 = 6 seconds.
That should be long enough.
No, there is something wrong with DRBD and its interaction with Timesys RT
linux. It appears that the socket just stops working and never comes back.
The error is not occurring a random intervals, but continuously. The error
is very repeatable on my system. It happens about 1-2 minutes into the file
transfer and it causes the file transfer to completely stop. I waited
several hours to make sure that the file transfer had indeed stopped (and
was not running, but slowly) and it had. After that error message no more
data was transferred to the drbd disk. Its like the socket has just stopped
working completely.
I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything
worked fine. I loaded up the drbd disk over nfs. I brought the primary node
down and up 10 times during the xfer and then did a diff -r between the
original files and the files on the drbd disk and the transfer worked fine.
There were no file differences (BTW, this is very impressive, drbd is an
amazing piece of SW!). I did see a number of messages like this: drbd0:
pending_cnt <0 !!! on the secondary node while it was doing a syncingAll
with the primary node (after I reset the primary node without doing a
graceful shutdown, as a test). But even still, the files were transferred
and stored correctly.
I could set the ko-count to a smaller number, but that would just put the
primary node into standAlone and I assume the secondary node would then
take over. However, since this is a repeatable error, it would happen after
only 1-2 minutes of use.
Is there some status information that I could print out to get a better
handle on this problem and get better visibility into why the socket stops
functioning?
Thanks for any assistance you can offer,
Ward
Am Montag, 2. Februar 2004 00:29 schrieb Lars Ellenberg:
> / 2004-02-01 14:36:47 -0500
>
> \ Ward Horner:
> > I have DRBD setup with heartbeat doing an NFS export. When I transfer
> > files to the primary DRBD device I get the error (after about 1-2
> > minutes):
> >
> > drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672
>
> this means that kjournal got stuck trying to send a block.
>
> It is NOT an error, but a hint that you have high latency, and
> obviously the primary writes faster than the secondary can cope.
>
> you can configure a "ko-count", which defaults to 0, (and then
> wraps around to 2**32-1 on the first count down, thus the high
> value). As soon as ko-count reaches zero (countdown every
> ping-intervall while block communication is stalled), the
> connection is teared down, and primary goes "StandAlone".
>
> maybe you have some sort of distributed resource deadlock.
> I understand that it is actually the secondary writing to the nfs
> share, which is the drbd exported by the primary?
>
Maybe it is just a a slow secondary. -- If you want to get rid
of the messages you could also increase the timeout. See
man drbdsetup
--snip--
-t, If the partner node failes to send an expected response packet within
val 10ths of a second, the partner node is considered dead and there-
fore the tcp/ip connection is abandoned. The default value is 60 = 6
seconds.
--snap--
-Philipp
--