[DRBD-user] Machine crashed repeatedly: drbd16: Epoch set size wrong!!found=1061 reported=1060

Sat Oct 30 20:31:46 CEST 2004

Andreas Huck schrieb:
> Hi,
> 
> On Friday 29 October 2004 17:55, you wrote:
>> Hello Andreas,
> thanks for the personal mail but keeping the problem discussion on the
> list makes it easier for all. Maybe it was by accident, so I hope you don't
> mind the CC.

I used reply and the address was your personal address. You're right to
post it to the list. Thank you!

>> Andreas Huck schrieb:
>> > Hi,
>> >
>> > On Friday 29 October 2004 11:41, Andreas Hartmann wrote:
>> >> Hello all,
>> >>
>> >> I'm wondering about this message, which occured with drbd 0.6.13 running
>> >> with original kernel 2.4.27 on a XSeries 235 machine with serveraid 5,
>> >> broadcom gigabit ethernet (bcm5700) during copying datas to /dev/nb16.
>> >> The machine has 512 MB RAM and 1024 MB cache.
>> >>
>> >> What does this mean:
>> >> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size
>> >> wrong!!found=1061 reported=1060
>> >
>> > I recall having read something simmilar here,
>> > google(drbd "Epoch set size wrong") comes
>> > up with
>> > http://lists.linbit.com/pipermail/drbd-user/2004-May/000796.html
>> > http://lists.linbit.com/pipermail/drbd-user/2004-May/000797.html
>>
>> I found this too, but I couldn't find any answer concerning epoch size
>> itself. The respond speaks about performance problems and the tl-size or
>> snd-buf size (drbd docu says: these options are for protocol A - but I'm
>> using C). Additionally: I cannot detect any performance problem on both of
>> my machines.
> 
> Lars gave some hints meanwhile. Claiming DRBD beeing too strong for
> your hardware sounds nice :-) 
> - Did the system die at low load?

Both. Sometimes, the initial syncall (11 devices syncing at the same time
gives a load at about 11) crashed the machine, sometimes, the syncall
worked fine and the machine crashed later with low or zero load (when
accessing the filesystem or starting lvcreate or lvresize (after
deactivating drbd for the accessed device on both nodes).
remounting even another device (not served with drbd), crashed the
machines: mount seg faults, the second try killed the machine.

To decrease the load, I put all devices into an own group. But the problem
doesn't change.

BTW: I do not have any problem with performance: I configured max. 8MB
bandwith for synchronizing (100MB ethernet), and I really got this value
as average!

> - Can you _trigger_ the failure by increasing system load (bonnie etc.), 

no - it's independent from system load. The dead sometimes came suddenly,
sometimes I could feel it coming (gcc crashed always at the same point
while compiling bcm5700; after rebooting, the compile was fine).

>   parallel resync or anything?

I put datas to the devices during syncall, too. It had no effect. Yes, I
tried to really stress the machine. The machine must be stable in
production, too.

> Try to crash it not by coincidence once a day 
>   but 3 minutes after some action. Well, easy to talk about ...

It was really easy to crash it (no, it wasn't the goal), but there wasn't
any other way! Every time after a crash, 70 GB had to be resynced.
Afterwards, the machines were more or less dead, somtimes they died during
resyncing!

> - Is the failure caused by both systems regardless of which one was primary?

Yes.

> - "original" kernel means vanilla? Can you reproduce it with a different
>   kernel? E.g. RH or SLES8-SP3 (just to verify).

Originally, I had SLES8 kernels. I tested 2 of them, then I took 2.4.27
vanilla. All kernels show the same behaviour.

> - You define "minor_count=40", but the error occurs as well with one
>   device only, I guess.

I never had just one device. I never had less then 11.

> - If you have 2x256MB memory each, try with kernel parameter 
>   "mem=256k" (and avoid swapping) to detect faulty memory. 

One machine runs 1 GB, the other 512 MB RAM. The machine with 1 GB RAM had
 3 UDB databases running before, one of them had the size of 25 GB, all in
all about 35 GB. All three databases were resided in one drbd-device,
which layed over LVM. There have been no problems. As I did this test, I
had an "old" SLES8 kernel (< 2.4.21-231).
Before I did setup the machines for the new job, I switched to kernel
-241, and as last -251.

Regards,
Andreas Hartmann