Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Am Donnerstag, 5. Februar 2004 01:49 schrieb Lars Ellenberg: > / 2004-02-04 11:38:24 -0500 > > \ Ward Horner: > > I looked at the latency and it is set to the default of 60 = 6 seconds. > > That should be long enough. > > > > No, there is something wrong with DRBD and its interaction with Timesys > > RT linux. It appears that the socket just stops working and never comes > > back. The error is not occurring a random intervals, but continuously. > > The error is very repeatable on my system. It happens about 1-2 minutes > > into the file transfer and it causes the file transfer to completely > > stop. I waited several hours to make sure that the file transfer had > > indeed stopped (and was not running, but slowly) and it had. After that > > error message no more data was transferred to the drbd disk. Its like the > > socket has just stopped working completely. > > > > I removed drbd-0.6.10 and put drbd-0.6.8 on the system and everything > > worked fine. I loaded up the drbd disk over nfs. I brought the primary > > node down and up 10 times during the xfer and then did a diff -r between > > the original files and the files on the drbd disk and the transfer worked > > fine. There were no file differences (BTW, this is very impressive, drbd > > is an amazing piece of SW!). I did see a number of messages like this: > > drbd0: pending_cnt <0 !!! on the secondary node while it was doing a > > syncingAll with the primary node (after I reset the primary node without > > doing a graceful shutdown, as a test). But even still, the files were > > transferred and stored correctly. > > It is very well possible that since I introduced non-blocking > "write hints", the secondary now just does not see any reason to > flush its data to disk, because the write hints no longer come through. > So it maybe has to kick its lower level device when it receives a ping > and has pending count > 0... > I'll have a look into that. > I do not think so. WRITE_HINTS are just a performace thing. In case the WRITE_HINTS (btw, they should be renamed to IO_HINTS) do not come through, run_task_queue(tq_disk) is run by some other means sooner or later. think bdflush... etc. I guess it is some strange interaction with RT Linux. I guess the quickest method to find the source of the problem would be to split the patches from 0.6.8 to 0.6.10 into logical units, apply them one by one. => And find the patch that breaks the beast. -Philipp -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :