Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-10-30 20:57:29 +0200 \ Andreas Hartmann: > Hello Lars, > > Lars Ellenberg schrieb: > > / 2004-10-30 14:50:01 +0200 > > \ Andreas Hartmann: > >> Hello Lars, > >> > >> Lars Ellenberg schrieb: > >> > / 2004-10-29 22:42:00 +0200 > >> > \ Andreas Hartmann: > >> >> Hello! > >> >> > >> >> I wrote the same problem to the lkml. This is what Marcelo said: > >> > > >> > well, he said nothing. > >> > >> He asked the same, which I want to know, too: > >> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size wrong!!found=1061 > >> reported=1060 > >> > >> What does this mean? > > > > you should have seen some "tl messed up, ... too small", too. > > Couldn't see any. then this rather is a sign of memory corruption. this message without the other does not make sense. > > it means that drbd 0.6 has a static (configurable) amount of slots > > which it uses to guarantee strict write ordering on both nodes. > > > > and that currently you cause throughput that fast that this log is > > becoming too small, and thus drbd cannot guarantee this anymore. > > > > this usually is no problem, and it is most likely not the cause of the > > problem here, but just an other symptom. > > Could it help to increase the epoch-size? well, it was suggested already to increase the tl-size. but as you say you did not see the "tl messed up" or "transferlog too small!!", the problem seems to be elsewere. > I can see a lot of latencies in our network (ping). Could this be the > problem? On the other hand, I'm getting high throughputs (>8 MB (which I > configured in drbd.conf as max) average with 100 MB ethernet isn't that bad). > > >> A few seconds later, kswapd crashed the machine! > > > > yes. > > you cause high memory and io pressure, drbd notices first, > > and a few seconds later kswapd notices too. > > 8 MB throughput shouldn't be a problem! The raid cache policy is > configured to write back and 512 MB or 1 GB RAM should handle this > "pressure" without any problem. BTW: the machine didn't really swap. I > never saw more then 1 or 2 MB! it does not need to "swap", kswapd also reclaims memory from caches and slabs, not does not neccessarily page out to swap spaace. > [...] > > >> I could see very strange things: > >> -> mount -o remount,rw, even to a non drbd-Device on the backup-machine > >> suddenly segfaults, after the primary died during datareceiving. The > >> second remount killed the machine. > >> -> After copying datas to the drbd-device, compiling of the bcm5700-driver > >> segfaults. After rebooting, compiling run without any problem. > >> -> more things at tuesday, after the died primary is rebooted again. > >> > >> Problems always happened during or after receiving datas or when creating > >> new logical volumes onto the primary or secondary. During copying of datas > >> to the primary, suddenly the datastream stopped, because the primary died > >> (sometimes, it couldn't even be pinged after it died). > > > > bad ram? > > I had no problem with an UDB database at load 10-20. well. "load" number is not neccessarily a sign of high stress. and that your ram probably was fine some time, does not neccessarily mean it is still. > > dma flipping bits on transfer? > > Is there any chance to see them? I doubt, this isn't possible! I have seen this... "only" about one bit flip per transfered 100MB, but that would still bring down the box in very weird ways... I still wonder how all these electronics possibly can work at all :) sure, it still could be drbd's fault. but there is not real sign of this. and the symptom of "gcc compile fails always the same; after reboot, it works fine" for me clearly suggests bad ram. Lars Ellenberg -- please use the "List-Reply" function of your email client.