[DRBD-user] Drbd 0.6.13 cause hard reset

Mon Sep 27 19:28:42 CEST 2004

/ 2004-09-27 17:59:24 +0200
\ Yannick Lecaillez:
> First, thank for your quick answer.
> 
> Lars Ellenberg wrote:
> 
> >>      * Is this problem could be resolved with an upgrade to 0.7.4 ?
> >>   
> >>
> >you could try.
> >
> >since strange problems happened before with "older" kernels and
> >"too new" machines, you may want to try a 2.6 kernel.
> > 
> >
> Generaly or with drbd ?

I meant generally, but I met it in some drbd related bug reports.
iirc, back then I think it was a redhat 2.4(early) kernel, and dual
xeons, and it did all sort of weird stuff, and some oopses aparently
related to drbd (but stacktraces showing impossible code pathes).

all problems resolved magically by using a newer kernel.

the problem may be related to asumtions in the kernel code about cpu
write ordering constraints, and that they no longer have been met by the
very latest and greatest cpus.

> Have you got any links about that ?
nope, sorry.

> >>  However, drbd is a nice piece of software since we encountered more 
> >>than 10 crashes
> >>on node 1 and nothing was lost (at this time ... :-p). I'm just sad 
> >>about the fact it seems to
> >>crash our node :-( ...
> >>   
> >>
> >I really doubt it crashes your node.
> >_nothing_ in the kernel is supposed to lead
> >to a silent hard reset. 
> > 
> >
> Yes, this is my first experience of this type with Linux ...
> Sometimes we have about 20 of load during neer 1 hour without
> problems ... Our home made app was launched at 01:00 PM and
> the load was near null (our customer are 99% french).

[ so this is a time zone thing, or a frenchman thing? :) ]

>  However, this
> app could succeed to do its task when the access log is small (hundred
> of Mb). But crash later on the awstats ...
> 
> >this really sounds like some weird
> >hardware related problem triggered by the heavy disk and network
> >load you put on the box using the applications on top of drbd.
> >
> I will try to put a very big load on the server for each component (RAM, HD
> and Network) and trying to see something ...

and combinations thereof.
though I don't remember the details, I think there have been
NIC(driver)s, that crashed the box, but only if you had heavy DMA
diskload on more than one channel _and_ at the same time loads of
network traffic. maybe in combination with chipset and irq routing
problems. "software only" problems usually at least manage to get some
log messages to the console even if some really bad things happen.

	lge