Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-09-27 17:01:02 +0200 \ Yannick Lecaillez: > Hi ppl ! > > I've serious problem with drbd 0.6.13 in a production environment. > First of > all there is my config (two node with heartbeat) : > > Hardware is Dual Xeon 2.4Ghz / 1Gb RAM / SCSI Raid 5 Adaptec > and OS is Mandrake 10.0 with kernel 2.4.25. On both nodes. > > Node 1 Node 2 > Used by > com > /web (primary) /web (secondary) Apache > /home (primary) /home (secondary) Qmail / Vpopmail > /rdbms (secondary) /rdbms (primary) PostgreSQL / MySQL > > All these one are LVM2 devices with ext3 FS. > > Things goes very well since we launch an home made app which split > the main apache logfiles (more than 1Gb) by vhosts. All created access > logs are put in /web (this app insure there is no more than 50 files opened > simultaneously). When we launch this tools on node 1 this one hard reset > after 1 or 2 minuts of activity (no kernel panic : hard reset). I can > say it is > not a temperature problem and/or home made app (these point was tested > before this post :-p). > > We first thinked about the HIGHMEM patch so we removed it on the > Node 1 (only) : no effect. (By the way, node 1 didn't crash when we made > a test with drbd module loaded, but *not* started, when running our home > made app.) It crash 100% of the time when drbd module is loaded and started > with combination of Apache + Qmail + home made app. Apache + Qmail > seems to works better (we experienced 2 crash in two weeks without running > the home made app). Perhaps our home made software stress drbd "badly" ? > This piece of C code do a lot of write on file opened in append mode > then close, then > re-open it etc ... ? > > So, i've got four question : > > * Is there a problem about kernel 2.4.25 + LVM2 + drbd > 0.6.13 + SMP combination ? none I know of. an nothing should lead to hard reset. > * Is the "mix" about primary/secondary device on each > node could be a problem ? you can try that one easily, by putting it all on one node. still, nothing should cause a silent hard reset. we had some "hangs" in early 0.6 if you used "mixed" setups and smp. but no resets. drbd 0.6.12 (13) is believed to have them all fixed, though. > * How is it possible to debug that since we haven't got a > single line of information in logs ? test hardware. stresstest without drbd, with disconnected drbd try to mimic the workload with simple tools as dd and netcat. > * Is this problem could be resolved with an upgrade to 0.7.4 ? you could try. since strange problems happened before with "older" kernels and "too new" machines, you may want to try a 2.6 kernel. > However, drbd is a nice piece of software since we encountered more > than 10 crashes > on node 1 and nothing was lost (at this time ... :-p). I'm just sad > about the fact it seems to > crash our node :-( ... I really doubt it crashes your node. _nothing_ in the kernel is supposed to lead to a silent hard reset. this really sounds like some weird hardware related problem triggered by the heavy disk and network load you put on the box using the applications on top of drbd. Lars Ellenberg -- please use the "List-Reply" function of your email client.