[DRBD-user] Re: drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672..

Thu Feb 5 10:20:40 CET 2004

Am Donnerstag, 5. Februar 2004 01:49 schrieb Lars Ellenberg:
> / 2004-02-04 11:38:24 -0500
>
> \ Ward Horner:
> > I looked at the latency and it is set to the default of 60 = 6 seconds.
> > That should be long enough.
> >
> > No, there is something wrong with DRBD and its interaction with Timesys
> > RT linux. It appears that the socket just stops working and never comes
> > back. The error is not occurring a random intervals, but continuously.
> > The error is very repeatable on my system. It happens about 1-2 minutes
> > into the file transfer and it causes the file transfer to completely
> > stop. I waited several hours to make sure that the file transfer had
> > indeed stopped (and was not running, but slowly) and it had. After that
> > error message no more data was transferred to the drbd disk. Its like the
> > socket has just stopped working completely.
> >
> > I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything
> > worked fine. I loaded up the drbd disk over nfs. I brought the primary
> > node down and up 10 times during the xfer and then did a diff -r between
> > the original files and the files on the drbd disk and the transfer worked
> > fine. There were no file differences (BTW, this is very impressive, drbd
> > is an amazing piece of SW!). I did see a number of messages like this:
> > drbd0: pending_cnt <0 !!! on the secondary node while it was doing a
> > syncingAll with the primary node (after I reset the primary node without
> > doing a graceful shutdown, as a test). But even still, the files were
> > transferred and stored correctly.
>
> It is very well possible that since I introduced non-blocking
> "write hints", the secondary now just does not see any reason to
> flush its data to disk, because the write hints no longer come through.
> So it maybe has to kick its lower level device when it receives a ping
> and has pending count > 0...
> I'll have a look into that.
>

I do not think so. WRITE_HINTS are just a performace thing. In case the
WRITE_HINTS (btw, they should be renamed to IO_HINTS) do not come through,
run_task_queue(tq_disk) is run by some other means sooner or later.
think bdflush... etc.

I guess it is some strange interaction with RT Linux. 

I guess the quickest method to find the source of the problem would be
to split the patches from 0.6.8 to 0.6.10 into logical units, apply
them one by one. => And find the patch that breaks the beast.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :