[DRBD-user] Drbd 0.6.13 cause hard reset

Mon Sep 27 17:01:02 CEST 2004

Hi ppl !

    I've serious problem with drbd 0.6.13 in a production environment. 
First of
all there is my config (two node with heartbeat) :

    Hardware is Dual Xeon 2.4Ghz / 1Gb RAM / SCSI Raid 5 Adaptec
    and OS is Mandrake 10.0 with kernel 2.4.25. On both nodes.

     Node 1                                Node 2                     
Used by
    com
     /web    (primary)                 /web    (secondary)  Apache
     /home  (primary)                 /home  (secondary)  Qmail / Vpopmail
     /rdbms (secondary)             /rdbms (primary)      PostgreSQL / MySQL

    All these one are LVM2 devices with ext3 FS.

    Things goes very well since we launch an home made app which split
the main apache logfiles (more than 1Gb) by vhosts. All created access
logs are put in /web (this app insure there is no more than 50 files opened
simultaneously). When we launch this tools on node 1 this one hard reset
after 1 or 2 minuts of activity (no kernel panic : hard reset). I can 
say it is
not a temperature problem and/or home made app (these point was tested
before this post :-p).

    We first thinked about the HIGHMEM patch so we removed it on the
Node 1 (only) : no effect. (By the way, node 1 didn't crash when we made
a test with drbd module loaded, but *not* started, when running our home
made app.) It crash 100% of the time when drbd module is loaded and started
with combination of  Apache + Qmail + home made app. Apache + Qmail
seems to works better (we experienced 2 crash in two weeks without running
the home made app). Perhaps our home made software stress drbd "badly" ?
This piece of C code do a lot of write on file opened in append mode 
then close, then
re-open it etc ... ?

    So, i've got four question :

        * Is there a problem about kernel 2.4.25 + LVM2 + drbd 0.6.13 + 
SMP combination ?
        * Is the "mix" about primary/secondary device on each node could 
be a problem ?
        * How is it possible to debug that since we haven't got a single 
line of information in logs ?
        * Is this problem could be resolved with an upgrade to 0.7.4 ?

    However, drbd is a nice piece of software since we encountered more 
than 10 crashes
on node 1 and nothing was lost (at this time ... :-p). I'm just sad 
about the fact it seems to
crash our node :-( ...

    Any comment, proposition etc ... is welcome !

    Greets, Yannick.

Here the drbd.conf :
resource drbd0 {
  protocol = C
  fsckcmd  = /bin/true
                                                                                                                             

  net {
    sync-min = 10M
    sync-max = 50M
    sync-rate = 50M
    connect-int = 10
    ping-int = 10
  }
                                                                                                                             

  disk {
        disk-size = 48336896
  }
                                                                                                                             

  on lrvs2 {
    device  = /dev/nbd/0
    disk    = /dev/mapper/vg01-lvrdmbs
    address = 192.168.4.2
    port    = 7788
  }
                                                                                                                             

  on lrvs1 {
    device  = /dev/nbd/0
    disk    = /dev/mapper/vg01-lvrdbms
    address = 192.168.4.1
    port    = 7788
  
}                                                                                                                            

}

resource drbd1 {
  protocol = C
  fsckcmd  = /bin/true
                                                                                                                             

  net {
    sync-min = 10M
    sync-max = 50M
    sync-rate = 50M
    connect-int = 10
    ping-int = 10
  }
                                                                                                                             

  disk {
        disk-size = 89128960
  }
                                                                                                                             

  on lrvs2 {
    device  = /dev/nbd/1
    disk    = /dev/mapper/vg01-lvweb
    address = 192.168.4.2
    port    = 7789
  }
                                                                                                                             

  on lrvs1 {
    device  = /dev/nbd/1
    disk    = /dev/mapper/vg01-lvweb
    address = 192.168.4.1
    port    = 7789
  
}                                                                                                                          

}

resource drbd2 {
  protocol = C
  fsckcmd  = /bin/true
                                                                                                                             

  net {
    sync-min = 10M
    sync-max = 50M
    sync-rate = 50M
    connect-int = 10
    ping-int = 10
  }
                                                                                                                             

  disk {
        disk-size = 51404800
  }
                                                                                                                             

  on lrvs2 {
    device  = /dev/nbd/2
    disk    = /dev/mapper/vg01-lvhome
    address = 192.168.4.2
    port    = 7790
  }
                                                                                                                             

  on lrvs1 {
    device  = /dev/nbd/2
    disk    = /dev/mapper/vg01-lvhome
    address = 192.168.4.1
    port    = 7790
  
}                                                                                                                          

}