Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi ppl !
I've serious problem with drbd 0.6.13 in a production environment.
First of
all there is my config (two node with heartbeat) :
Hardware is Dual Xeon 2.4Ghz / 1Gb RAM / SCSI Raid 5 Adaptec
and OS is Mandrake 10.0 with kernel 2.4.25. On both nodes.
Node 1 Node 2
Used by
com
/web (primary) /web (secondary) Apache
/home (primary) /home (secondary) Qmail / Vpopmail
/rdbms (secondary) /rdbms (primary) PostgreSQL / MySQL
All these one are LVM2 devices with ext3 FS.
Things goes very well since we launch an home made app which split
the main apache logfiles (more than 1Gb) by vhosts. All created access
logs are put in /web (this app insure there is no more than 50 files opened
simultaneously). When we launch this tools on node 1 this one hard reset
after 1 or 2 minuts of activity (no kernel panic : hard reset). I can
say it is
not a temperature problem and/or home made app (these point was tested
before this post :-p).
We first thinked about the HIGHMEM patch so we removed it on the
Node 1 (only) : no effect. (By the way, node 1 didn't crash when we made
a test with drbd module loaded, but *not* started, when running our home
made app.) It crash 100% of the time when drbd module is loaded and started
with combination of Apache + Qmail + home made app. Apache + Qmail
seems to works better (we experienced 2 crash in two weeks without running
the home made app). Perhaps our home made software stress drbd "badly" ?
This piece of C code do a lot of write on file opened in append mode
then close, then
re-open it etc ... ?
So, i've got four question :
* Is there a problem about kernel 2.4.25 + LVM2 + drbd 0.6.13 +
SMP combination ?
* Is the "mix" about primary/secondary device on each node could
be a problem ?
* How is it possible to debug that since we haven't got a single
line of information in logs ?
* Is this problem could be resolved with an upgrade to 0.7.4 ?
However, drbd is a nice piece of software since we encountered more
than 10 crashes
on node 1 and nothing was lost (at this time ... :-p). I'm just sad
about the fact it seems to
crash our node :-( ...
Any comment, proposition etc ... is welcome !
Greets, Yannick.
Here the drbd.conf :
resource drbd0 {
protocol = C
fsckcmd = /bin/true
net {
sync-min = 10M
sync-max = 50M
sync-rate = 50M
connect-int = 10
ping-int = 10
}
disk {
disk-size = 48336896
}
on lrvs2 {
device = /dev/nbd/0
disk = /dev/mapper/vg01-lvrdmbs
address = 192.168.4.2
port = 7788
}
on lrvs1 {
device = /dev/nbd/0
disk = /dev/mapper/vg01-lvrdbms
address = 192.168.4.1
port = 7788
}
}
resource drbd1 {
protocol = C
fsckcmd = /bin/true
net {
sync-min = 10M
sync-max = 50M
sync-rate = 50M
connect-int = 10
ping-int = 10
}
disk {
disk-size = 89128960
}
on lrvs2 {
device = /dev/nbd/1
disk = /dev/mapper/vg01-lvweb
address = 192.168.4.2
port = 7789
}
on lrvs1 {
device = /dev/nbd/1
disk = /dev/mapper/vg01-lvweb
address = 192.168.4.1
port = 7789
}
}
resource drbd2 {
protocol = C
fsckcmd = /bin/true
net {
sync-min = 10M
sync-max = 50M
sync-rate = 50M
connect-int = 10
ping-int = 10
}
disk {
disk-size = 51404800
}
on lrvs2 {
device = /dev/nbd/2
disk = /dev/mapper/vg01-lvhome
address = 192.168.4.2
port = 7790
}
on lrvs1 {
device = /dev/nbd/2
disk = /dev/mapper/vg01-lvhome
address = 192.168.4.1
port = 7790
}
}