Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi ppl ! I've serious problem with drbd 0.6.13 in a production environment. First of all there is my config (two node with heartbeat) : Hardware is Dual Xeon 2.4Ghz / 1Gb RAM / SCSI Raid 5 Adaptec and OS is Mandrake 10.0 with kernel 2.4.25. On both nodes. Node 1 Node 2 Used by com /web (primary) /web (secondary) Apache /home (primary) /home (secondary) Qmail / Vpopmail /rdbms (secondary) /rdbms (primary) PostgreSQL / MySQL All these one are LVM2 devices with ext3 FS. Things goes very well since we launch an home made app which split the main apache logfiles (more than 1Gb) by vhosts. All created access logs are put in /web (this app insure there is no more than 50 files opened simultaneously). When we launch this tools on node 1 this one hard reset after 1 or 2 minuts of activity (no kernel panic : hard reset). I can say it is not a temperature problem and/or home made app (these point was tested before this post :-p). We first thinked about the HIGHMEM patch so we removed it on the Node 1 (only) : no effect. (By the way, node 1 didn't crash when we made a test with drbd module loaded, but *not* started, when running our home made app.) It crash 100% of the time when drbd module is loaded and started with combination of Apache + Qmail + home made app. Apache + Qmail seems to works better (we experienced 2 crash in two weeks without running the home made app). Perhaps our home made software stress drbd "badly" ? This piece of C code do a lot of write on file opened in append mode then close, then re-open it etc ... ? So, i've got four question : * Is there a problem about kernel 2.4.25 + LVM2 + drbd 0.6.13 + SMP combination ? * Is the "mix" about primary/secondary device on each node could be a problem ? * How is it possible to debug that since we haven't got a single line of information in logs ? * Is this problem could be resolved with an upgrade to 0.7.4 ? However, drbd is a nice piece of software since we encountered more than 10 crashes on node 1 and nothing was lost (at this time ... :-p). I'm just sad about the fact it seems to crash our node :-( ... Any comment, proposition etc ... is welcome ! Greets, Yannick. Here the drbd.conf : resource drbd0 { protocol = C fsckcmd = /bin/true net { sync-min = 10M sync-max = 50M sync-rate = 50M connect-int = 10 ping-int = 10 } disk { disk-size = 48336896 } on lrvs2 { device = /dev/nbd/0 disk = /dev/mapper/vg01-lvrdmbs address = 192.168.4.2 port = 7788 } on lrvs1 { device = /dev/nbd/0 disk = /dev/mapper/vg01-lvrdbms address = 192.168.4.1 port = 7788 } } resource drbd1 { protocol = C fsckcmd = /bin/true net { sync-min = 10M sync-max = 50M sync-rate = 50M connect-int = 10 ping-int = 10 } disk { disk-size = 89128960 } on lrvs2 { device = /dev/nbd/1 disk = /dev/mapper/vg01-lvweb address = 192.168.4.2 port = 7789 } on lrvs1 { device = /dev/nbd/1 disk = /dev/mapper/vg01-lvweb address = 192.168.4.1 port = 7789 } } resource drbd2 { protocol = C fsckcmd = /bin/true net { sync-min = 10M sync-max = 50M sync-rate = 50M connect-int = 10 ping-int = 10 } disk { disk-size = 51404800 } on lrvs2 { device = /dev/nbd/2 disk = /dev/mapper/vg01-lvhome address = 192.168.4.2 port = 7790 } on lrvs1 { device = /dev/nbd/2 disk = /dev/mapper/vg01-lvhome address = 192.168.4.1 port = 7790 } }