[DRBD-user] Re: drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672..

Thu Feb 5 01:49:16 CET 2004

/ 2004-02-04 11:38:24 -0500
\ Ward Horner:
> I looked at the latency and it is set to the default of 60 = 6 seconds. 
> That should be long enough.
> 
> No, there is something wrong with DRBD and its interaction with Timesys RT 
> linux. It appears that the socket just stops working and never comes back. 
> The error is not occurring a random intervals, but continuously. The error 
> is very repeatable on my system. It happens about 1-2 minutes into the file 
> transfer and it causes the file transfer to completely stop. I waited 
> several hours to make sure that the file transfer had indeed stopped (and 
> was not running, but slowly) and it had. After that error message no more 
> data was transferred to the drbd disk. Its like the socket has just stopped 
> working completely.

> I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything 
> worked fine. I loaded up the drbd disk over nfs. I brought the primary node 
> down and up 10 times during the xfer and then did a diff -r between the 
> original files and the files on the drbd disk and the transfer worked fine. 
> There were no file differences (BTW, this is very impressive, drbd is an 
> amazing piece of SW!). I did see a number of messages like this: drbd0: 
> pending_cnt <0 !!! on the secondary node while it was doing a syncingAll 
> with the primary node (after I reset the primary node without doing a 
> graceful shutdown, as a test). But even still, the files were transferred 
> and stored correctly.

It is very well possible that since I introduced non-blocking
"write hints", the secondary now just does not see any reason to
flush its data to disk, because the write hints no longer come through.
So it maybe has to kick its lower level device when it receives a ping
and has pending count > 0...
I'll have a look into that.

> Is there some status information that I could print out to get a better 
> handle on this problem and get better visibility into why the socket stops 
> functioning?

You could check whether the cluster recovers if you do "sync"s,
or even better emergency sync via sysrq, on the drbd-Secondary.

	Lars Ellenberg