[DRBD-user] Machine crashed repeatedly: drbd16: Epoch set size wrong!!found=1061 reported=1060

Andreas Huck drbd at huck.it
Sat Oct 30 18:09:56 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

On Friday 29 October 2004 17:55, you wrote:
> Hello Andreas,
thanks for the personal mail but keeping the problem discussion on the
list makes it easier for all. Maybe it was by accident, so I hope you don't
mind the CC.

> Andreas Huck schrieb:
> > Hi,
> >
> > On Friday 29 October 2004 11:41, Andreas Hartmann wrote:
> >> Hello all,
> >>
> >> I'm wondering about this message, which occured with drbd 0.6.13 running
> >> with original kernel 2.4.27 on a XSeries 235 machine with serveraid 5,
> >> broadcom gigabit ethernet (bcm5700) during copying datas to /dev/nb16.
> >> The machine has 512 MB RAM and 1024 MB cache.
> >>
> >> What does this mean:
> >> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size
> >> wrong!!found=1061 reported=1060
> >
> > I recall having read something simmilar here,
> > google(drbd "Epoch set size wrong") comes
> > up with
> > http://lists.linbit.com/pipermail/drbd-user/2004-May/000796.html
> > http://lists.linbit.com/pipermail/drbd-user/2004-May/000797.html
>
> I found this too, but I couldn't find any answer concerning epoch size
> itself. The respond speaks about performance problems and the tl-size or
> snd-buf size (drbd docu says: these options are for protocol A - but I'm
> using C). Additionally: I cannot detect any performance problem on both of
> my machines.

Lars gave some hints meanwhile. Claiming DRBD beeing too strong for
your hardware sounds nice :-) 
- Did the system die at low load?
- Can you _trigger_ the failure by increasing system load (bonnie etc.), 
  parallel resync or anything? Try to crash it not by coincidence once a day 
  but 3 minutes after some action. Well, easy to talk about ...
- Is the failure caused by both systems regardless of which one was primary?
- "original" kernel means vanilla? Can you reproduce it with a different
  kernel? E.g. RH or SLES8-SP3 (just to verify).
- You define "minor_count=40", but the error occurs as well with one
  device only, I guess.
- If you have 2x256MB memory each, try with kernel parameter 
  "mem=256k" (and avoid swapping) to detect faulty memory. 

>
> That's why I'm asking here, hoping to get an answer to the epoch size
> problem: what does it mean? Is there any switch to control it? I couldn't
> find any hint in the docu. Are there any dependencies between tl-size /
> sndbuf-size and epoch size? I would be very glad, if somebody could
> explain the dependencies because I want to understand it and I must get
> rid of the crashes :-).
see comments from Lars. 

Good luck,
Andreas






More information about the drbd-user mailing list