[DRBD-user] Machine crashed repeatedly: drbd16: Epoch set size wrong!!found=1061 reported=1060

Sat Oct 30 23:40:31 CEST 2004

/ 2004-10-30 20:57:29 +0200
\ Andreas Hartmann:
> Hello Lars,
> 
> Lars Ellenberg schrieb:
> > / 2004-10-30 14:50:01 +0200
> > \ Andreas Hartmann:
> >> Hello Lars,
> >> 
> >> Lars Ellenberg schrieb:
> >> > / 2004-10-29 22:42:00 +0200
> >> > \ Andreas Hartmann:
> >> >> Hello!
> >> >> 
> >> >> I wrote the same problem to the lkml. This is what Marcelo said:
> >> > 
> >> > well, he said nothing.
> >> 
> >> He asked the same, which I want to know, too:
> >> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size wrong!!found=1061
> >> reported=1060
> >> 
> >> What does this mean?
> > 
> > you should have seen some "tl messed up, ... too small", too.
> 
> Couldn't see any.

then this rather is a sign of memory corruption.
this message without the other does not make sense.

> > it means that drbd 0.6 has a static (configurable) amount of slots
> > which it uses to guarantee strict write ordering on both nodes.
> > 
> > and that currently you cause throughput that fast that this log is
> > becoming too small, and thus drbd cannot guarantee this anymore.
> > 
> > this usually is no problem, and it is most likely not the cause of the
> > problem here, but just an other symptom.
> 
> Could it help to increase the epoch-size?

well, it was suggested already to increase the tl-size.
but as you say you did not see the "tl messed up" or
"transferlog too small!!", the problem seems to be elsewere.

> I can see a lot of latencies in our network (ping). Could this be the
> problem? On the other hand, I'm getting high throughputs (>8 MB (which I
> configured in drbd.conf as max) average with 100 MB ethernet isn't that bad).
> 
> >> A few seconds later, kswapd crashed the machine!
> > 
> > yes.
> > you cause high memory and io pressure, drbd notices first,
> > and a few seconds later kswapd notices too.
> 
> 8 MB throughput shouldn't be a problem! The raid cache policy is
> configured to write back and 512 MB or 1 GB RAM should handle this
> "pressure" without any problem. BTW: the machine didn't really swap. I
> never saw more then 1 or 2 MB!

it does not need to "swap", kswapd also reclaims memory from caches and
slabs, not does not neccessarily page out to swap spaace.

> [...]
> 
> >> I could see very strange things:
> >> -> mount -o remount,rw, even to a non drbd-Device on the backup-machine
> >> suddenly segfaults, after the primary died during datareceiving. The
> >> second remount killed the machine.
> >> -> After copying datas to the drbd-device, compiling of the bcm5700-driver
> >> segfaults. After rebooting, compiling run without any problem.
> >> -> more things at tuesday, after the died primary is rebooted again.
> >>
> >> Problems always happened during or after receiving datas or when creating
> >> new logical volumes onto the primary or secondary. During copying of datas
> >> to the primary, suddenly the datastream stopped, because the primary died
> >> (sometimes, it couldn't even be pinged after it died).
> > 
> > bad ram?
> 
> I had no problem with an UDB database at load 10-20.

well. "load" number is not neccessarily a sign of high stress.
and that your ram probably was fine some time, does not neccessarily
mean it is still.

> > dma flipping bits on transfer?
> 
> Is there any chance to see them? I doubt, this isn't possible!

I have seen this... "only" about one bit flip per transfered 100MB,
but that would still bring down the box in very weird ways...
I still wonder how all these electronics possibly can work at all :)

sure, it still could be drbd's fault. but there is not real sign of this.
and the symptom of "gcc compile fails always the same;
after reboot, it works fine" for me clearly suggests bad ram.

	Lars Ellenberg

-- 
please use the "List-Reply" function of your email client.