[DRBD-user] Machine crashed repeatedly: drbd16: Epoch set size wrong!!found=1061 reported=1060

Sat Oct 30 15:42:14 CEST 2004

/ 2004-10-30 14:50:01 +0200
\ Andreas Hartmann:
> Hello Lars,
> 
> Lars Ellenberg schrieb:
> > / 2004-10-29 22:42:00 +0200
> > \ Andreas Hartmann:
> >> Hello!
> >> 
> >> I wrote the same problem to the lkml. This is what Marcelo said:
> > 
> > well, he said nothing.
> 
> He asked the same, which I want to know, too:
> Oct 29 05:30:20 FAGINTSC kernel: drbd16: Epoch set size wrong!!found=1061
> reported=1060
> 
> What does this mean?

you should have seen some "tl messed up, ... too small", too.

it means that drbd 0.6 has a static (configurable) amount of slots
which it uses to guarantee strict write ordering on both nodes.

and that currently you cause throughput that fast that this log is
becoming too small, and thus drbd cannot guarantee this anymore.

this usually is no problem, and it is most likely not the cause of the
problem here, but just an other symptom.

> A few seconds later, kswapd crashed the machine!

yes.
you cause high memory and io pressure, drbd notices first,
and a few seconds later kswapd notices too.

> > I doubt it is drbds fault. more likely some weird memory pressure thing,
> 
> Two machines, which run without drbd without any problem should have the
> same hw-problems? I hardly can believe it :-).
>
> > or kernel code and gcc optimization does not like your xeons (would not
> > be the first time that weird kernel behaviour occurs with cpus/chipsets
> > that are "too new").
> 
> IBM XSeries 235 are pretty old :-). We even cannot buy them any more.

it would not be the first time. drbd stresses network io, disk io and
sometimes memory _at the same time_, which is a slightly unusual pattern.

> > you may want to recompile your kernel with CONFIG_DEBUG_SLAB,
> > recompile your drbd with DBG_ALL_SYMBOLS, and save the module symbol
> > information (after you loaded drbd) for later reference with ksymoops.
> > then see if you can reproduce the event.
> 
> Yesterday evening, one of the machines crashed again during datatransfer
> (don't know why up to now, because the location of the machine is another
> as mine. Maybe I can see something at tuesday). I rebooted the secondary,
> preventing it from crashing, too.
> 
> > the stack trace you provide is pretty boring^W uninformative, it just
> > tells that kswapd thought it wants to shrink a cache (thats its job, it
> > tries to free pages), and that kmem_cache_reap obviously tried to
> > dereference some "next" pointer, that happend to point to 0xffffffff.
> > 
> > which pointer, which slab, which page, which process, why it was set
> > that way, why and when this might have happend, who did it...
> > all pure guesswork.
> 
> Yes, your're right! Nevertheless, without drbd, the machines are running
> stable.
> 
> > 
> > but just in case:
> > do you see other log messages that may be drbd related?
> 
> Yes, there are others, but not on the running machine. They are all on the
> machine, wich is dead at this time. But often, there couldn't be seen
> anything. The machines suddenly were dead. Mostly, the second machine died
> some time after the primary died.
> 
> I could see very strange things:
> -> mount -o remount,rw, even to a non drbd-Device on the backup-machine
> suddenly segfaults, after the primary died during datareceiving. The
> second remount killed the machine.
> -> After copying datas to the drbd-device, compiling of the bcm5700-driver
> segfaults. After rebooting, compiling run without any problem.
> -> more things at tuesday, after the died primary is rebooted again.
>
> Problems always happened during or after receiving datas or when creating
> new logical volumes onto the primary or secondary. During copying of datas
> to the primary, suddenly the datastream stopped, because the primary died
> (sometimes, it couldn't even be pinged after it died).

bad ram?
dma flipping bits on transfer?
dma/interrupt conflicts between your various drivers only showing when
drbd starts stressing them?

> Some time after the primary died, the secondary died usually, too.
> Therefore I rebooted it before it could die, too.
> 
> I suspect, that suddenly no more process could write to the disks any more
> and / or could read datas correctly from the disk.
> 
> Perhaps it's a problem of LVM? But so far, I couldn't see any problem with
> LVM without drbd (I'm using LVM since long time ago).
> 
> I suspected the RAID-device, but I couldn't find anything. The whole
> system runs fine without drbd.
> 
> I tested three SLES8-smp-Kernels with drbd 0.6.13, afterwards a minimal
> smp-vanilla 2.4.27 (optimized for Pentium III) - but the problem consists.
> 
> It seems to me, as if drbd confuses the whole memory.

unlikely. someone else had noticed before.

> > if not, I cannot help you.  if yes, I probably still can not help you,
> > but at least I could try to.
> 
> I appreciate your offer very much and I really want to find the problem
> too, but unfortunately, I can't spend any more time to investigate the
> problem, because those machines must go to production asap.
> Therefore, I changed my conception and I'm using rsync with cron now. Even
> after rsyncing 40 GB of datas to my running machine, there was no crash or
> any error log in messages or any other strange behaviour. Compiling runs
> fine. Hope that it stays silent :-).

you could try 0.7 series next time.
it does many things better, or at least different.

	Lars Ellenberg

-- 
please use the "List-Reply" function of your email client.