[DRBD-user] Machine crashed repeatedly: drbd16: Epoch set size wrong!!found=1061 reported=1060

Sat Oct 30 20:57:29 CEST 2004

Hello Lars,

Lars Ellenberg schrieb:
> / 2004-10-30 14:50:01 +0200
> \ Andreas Hartmann:
>> Hello Lars,
>> 
>> Lars Ellenberg schrieb:
>> > / 2004-10-29 22:42:00 +0200
>> > \ Andreas Hartmann:
>> >> Hello!
>> >> 
>> >> I wrote the same problem to the lkml. This is what Marcelo said:
>> > 
>> > well, he said nothing.
>> 
>> He asked the same, which I want to know, too:
>> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size wrong!!found=1061
>> reported=1060
>> 
>> What does this mean?
> 
> you should have seen some "tl messed up, ... too small", too.

Couldn't see any.

> it means that drbd 0.6 has a static (configurable) amount of slots
> which it uses to guarantee strict write ordering on both nodes.
> 
> and that currently you cause throughput that fast that this log is
> becoming too small, and thus drbd cannot guarantee this anymore.
> 
> this usually is no problem, and it is most likely not the cause of the
> problem here, but just an other symptom.

Could it help to increase the epoch-size?
I can see a lot of latencies in our network (ping). Could this be the
problem? On the other hand, I'm getting high throughputs (>8 MB (which I
configured in drbd.conf as max) average with 100 MB ethernet isn't that bad).

>> A few seconds later, kswapd crashed the machine!
> 
> yes.
> you cause high memory and io pressure, drbd notices first,
> and a few seconds later kswapd notices too.

8 MB throughput shouldn't be a problem! The raid cache policy is
configured to write back and 512 MB or 1 GB RAM should handle this
"pressure" without any problem. BTW: the machine didn't really swap. I
never saw more then 1 or 2 MB!

[...]

>> I could see very strange things:
>> -> mount -o remount,rw, even to a non drbd-Device on the backup-machine
>> suddenly segfaults, after the primary died during datareceiving. The
>> second remount killed the machine.
>> -> After copying datas to the drbd-device, compiling of the bcm5700-driver
>> segfaults. After rebooting, compiling run without any problem.
>> -> more things at tuesday, after the died primary is rebooted again.
>>
>> Problems always happened during or after receiving datas or when creating
>> new logical volumes onto the primary or secondary. During copying of datas
>> to the primary, suddenly the datastream stopped, because the primary died
>> (sometimes, it couldn't even be pinged after it died).
> 
> bad ram?

I had no problem with an UDB database at load 10-20.

> dma flipping bits on transfer?

Is there any chance to see them? I doubt, this isn't possible!

> dma/interrupt conflicts between your various drivers only showing when
> drbd starts stressing them?

The vanilla kernel activated only the nic and the raid - nothing more. Ok,
I didn't switch off USB in the bios.

[...]

> you could try 0.7 series next time.
> it does many things better, or at least different.

Surely!

Kind regards,
Andreas Hartmann