Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I looked at the latency and it is set to the default of 60 = 6 seconds. That should be long enough. No, there is something wrong with DRBD and its interaction with Timesys RT linux. It appears that the socket just stops working and never comes back. The error is not occurring a random intervals, but continuously. The error is very repeatable on my system. It happens about 1-2 minutes into the file transfer and it causes the file transfer to completely stop. I waited several hours to make sure that the file transfer had indeed stopped (and was not running, but slowly) and it had. After that error message no more data was transferred to the drbd disk. Its like the socket has just stopped working completely. I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything worked fine. I loaded up the drbd disk over nfs. I brought the primary node down and up 10 times during the xfer and then did a diff -r between the original files and the files on the drbd disk and the transfer worked fine. There were no file differences (BTW, this is very impressive, drbd is an amazing piece of SW!). I did see a number of messages like this: drbd0: pending_cnt <0 !!! on the secondary node while it was doing a syncingAll with the primary node (after I reset the primary node without doing a graceful shutdown, as a test). But even still, the files were transferred and stored correctly. I could set the ko-count to a smaller number, but that would just put the primary node into standAlone and I assume the secondary node would then take over. However, since this is a repeatable error, it would happen after only 1-2 minutes of use. Is there some status information that I could print out to get a better handle on this problem and get better visibility into why the socket stops functioning? Thanks for any assistance you can offer, Ward Am Montag, 2. Februar 2004 00:29 schrieb Lars Ellenberg: > / 2004-02-01 14:36:47 -0500 > > \ Ward Horner: > > I have DRBD setup with heartbeat doing an NFS export. When I transfer > > files to the primary DRBD device I get the error (after about 1-2 > > minutes): > > > > drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672 > > this means that kjournal got stuck trying to send a block. > > It is NOT an error, but a hint that you have high latency, and > obviously the primary writes faster than the secondary can cope. > > you can configure a "ko-count", which defaults to 0, (and then > wraps around to 2**32-1 on the first count down, thus the high > value). As soon as ko-count reaches zero (countdown every > ping-intervall while block communication is stalled), the > connection is teared down, and primary goes "StandAlone". > > maybe you have some sort of distributed resource deadlock. > I understand that it is actually the secondary writing to the nfs > share, which is the drbd exported by the primary? > Maybe it is just a a slow secondary. -- If you want to get rid of the messages you could also increase the timeout. See man drbdsetup --snip-- -t, If the partner node failes to send an expected response packet within val 10ths of a second, the partner node is considered dead and there- fore the tcp/ip connection is abandoned. The default value is 60 = 6 seconds. --snap-- -Philipp --