Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Lars, Lars Ellenberg schrieb: > / 2004-10-30 14:50:01 +0200 > \ Andreas Hartmann: >> Hello Lars, >> >> Lars Ellenberg schrieb: >> > / 2004-10-29 22:42:00 +0200 >> > \ Andreas Hartmann: >> >> Hello! >> >> >> >> I wrote the same problem to the lkml. This is what Marcelo said: >> > >> > well, he said nothing. >> >> He asked the same, which I want to know, too: >> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size wrong!!found=1061 >> reported=1060 >> >> What does this mean? > > you should have seen some "tl messed up, ... too small", too. Couldn't see any. > it means that drbd 0.6 has a static (configurable) amount of slots > which it uses to guarantee strict write ordering on both nodes. > > and that currently you cause throughput that fast that this log is > becoming too small, and thus drbd cannot guarantee this anymore. > > this usually is no problem, and it is most likely not the cause of the > problem here, but just an other symptom. Could it help to increase the epoch-size? I can see a lot of latencies in our network (ping). Could this be the problem? On the other hand, I'm getting high throughputs (>8 MB (which I configured in drbd.conf as max) average with 100 MB ethernet isn't that bad). >> A few seconds later, kswapd crashed the machine! > > yes. > you cause high memory and io pressure, drbd notices first, > and a few seconds later kswapd notices too. 8 MB throughput shouldn't be a problem! The raid cache policy is configured to write back and 512 MB or 1 GB RAM should handle this "pressure" without any problem. BTW: the machine didn't really swap. I never saw more then 1 or 2 MB! [...] >> I could see very strange things: >> -> mount -o remount,rw, even to a non drbd-Device on the backup-machine >> suddenly segfaults, after the primary died during datareceiving. The >> second remount killed the machine. >> -> After copying datas to the drbd-device, compiling of the bcm5700-driver >> segfaults. After rebooting, compiling run without any problem. >> -> more things at tuesday, after the died primary is rebooted again. >> >> Problems always happened during or after receiving datas or when creating >> new logical volumes onto the primary or secondary. During copying of datas >> to the primary, suddenly the datastream stopped, because the primary died >> (sometimes, it couldn't even be pinged after it died). > > bad ram? I had no problem with an UDB database at load 10-20. > dma flipping bits on transfer? Is there any chance to see them? I doubt, this isn't possible! > dma/interrupt conflicts between your various drivers only showing when > drbd starts stressing them? The vanilla kernel activated only the nic and the raid - nothing more. Ok, I didn't switch off USB in the bios. [...] > you could try 0.7 series next time. > it does many things better, or at least different. Surely! Kind regards, Andreas Hartmann