Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks a lot for this quick and long answer ! Lars Ellenberg wrote : > On Mon, Jun 23, 2008 at 04:32:25PM +0200, Eric Marin wrote: >> Hello and sorry about the length of this report, > > I have a few comments, below. > (...) > this does not really help "after the fact". > it would be of interesst to see wether this is in file "payload data" > area, or in file system "meta data" area (allocation bitmaps and such). > > I don't know from the top of my head how to find out for ext3, > but I remember it was not too difficult. I'm sorry, I'm not sure I understand... > there are a few basic ways how data can end up being different on the > two replicas. let us leave out additional modes that could be seen when > hard node crashes/power cycles/etc are involved, but focus only on what > could happen during "normal" operation. > - hardware problems, like broken RAM > - a bug in drbd > - a bug in other drivers > - someone fiddling around with direct access or otherwise bypassing > drbd > A consistency check on the RAID array didn't detect any error, so I suppose they are clean. Right now , I'm running MemTest86+. Maybe the crossover ethernet cable is simply bad (!), but I don't think this would explain the crashes. I did this once and got identical checksums : ldap-a:~# dd if=/dev/urandom of=/tmp/foobar bs=1M count=256 ldap-a:~# md5sum /tmp/foobar ldap-b:~# netcat -l -p 4711 | md5sum ldap-a:~# netcat -q0 192.168.0.2 4711 < /tmp/foobar > maybe first enable the "integrity checking", > see if that detects something, > then disable the checksum offloading, > see if it still occurs, > if not leave offloading disabled, > but disable again the "integrity checking" for performance reasons. OK, I'm going to try that next. >> ldap-a has crashed twice in three weeks, > > what exactly does "crashed" mean? > just "unresponsive"? panic? oops? BUG()? > any logs? anything from the console? The screen is completely black and the keyboard unresponsive. Even Alt+Print+B doesn't reboot the server, I have to press the power button for a few seconds. I can't check right now, but the first time it crashed, I couldn't find anything in the logs, except for out-of-sync warnings and a corrupted FS on a partition mounted by DRBD. Only ldap-a has crashed for now, but drbdadm verify all was always executed on ldap-a. > oh, and in that case (crashed a few times), > we need to consider more failure modes: > is there any volatile disk cache involved? > is no-disk-flushes set? > is no-md-flushes set? Do you mean volatile disk cache on the RAID card ? I enabled no-disk-flushes and no-md-flushes so as not to pollute logs with "local disk flush failed", I was getting lots of that. I think this is safe with a hardware RAID card with battery-backed cache (PERC 6i), isn't it (I hope so !) ? The write policy is : write back. Contents of /etc/drbd.conf : --------8<---------------------------------------------------------------------- global { usage-count no; } common { handlers { outdate-peer "echo 'Cable croise debranche entre ldap-a et ldap-b ?' | /usr/bin/Mail -s 'OUTDATE-PEER SUR LDAP/CAS !' xxxx at utc.fr & /usr/lib/heartbeat/drbd-peer-outdater"; split-brain "echo 'DRBD a detecte une situation de split-brain. Intervention manuelle necessaire sur ldap-a et ldap-b !' | /usr/bin/Mail -s 'SPLIT-BRAIN SUR LDAP/CAS !' xxxx at utc.fr"; out-of-sync "echo 'DRBD est desynchronise. Intervention manuelle necessaire sur ldap-a et ldap-b !' | /usr/bin/Mail -s 'OUT-OF-SYNC SUR LDAP/CAS !' xxxx at utc.fr"; } net { cram-hmac-alg sha1; shared-secret "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; } syncer { rate 33M; verify-alg sha1; cpu-mask 1; } disk { on-io-error detach; fencing resource-only; no-disk-flushes; no-md-flushes; } startup { degr-wfc-timeout 30; } protocol C; } # LDAP, MySQL, Apache resource drbd0 { device /dev/drbd0; disk /dev/sda7; meta-disk /dev/sda6[0]; on ldap-a { address 192.168.0.1:7788; } on ldap-b { address 192.168.0.2:7788; } } # Tomcat, CAS resource drbd1 { device /dev/drbd1; disk /dev/sda8; meta-disk /dev/sda6[1]; syncer { after drbd0; } on ldap-a { address 192.168.0.1:7789; } on ldap-b { address 192.168.0.2:7789; } } -----------------------------------------------------------------------8<------- A few other things of interest : -kernel = 2.6.18-6-686-bigmem #1 SMP Fri Jun 6 23:31:15 UTC 2008 i686 -in /etc/sysctl.conf, I put : ----8<------------------------------------------------------------------- # In case of a 'kernel panic' or 'oops', I want the node to reboot # so that the resources are taken by the other node. kernel.panic_on_oops = 1 kernel.panic = 1 ------------------------------------------------------------------->8---- But it didn't reboot both times it crashed. Maybe these settings could be a source of problems ? -about the RAID card, I get this in DELL Open Manage Server Administrator : Firmware version 6.0.2-0002 Driver version 00.00.03.01 Minimal required driver version 00.00.03.13 =>I first dismissed this warning, maybe I shouldn't... though I'm not sure how to update this driver while keeping the stock Debian kernel. Maybe I could find a new module... -drbd0 contains live databases for OpenLDAP and MySQL, perhaps these could have the usage patterns you refer to ? They are not often modified, though : only a few entries a day. > possibly, but I'd say unlikely, until proven otherwise (by an oops stack > trace for example). I'm not used to this. Do I need to recompile the kernel to get this stack trace ? I'd say the additional load of the verify caused > some unexpected memory pressure/io-load, and you box was unable to > handle that. serialize your resources, reduce the syncer rate, maybe > reduce "max-buffers". By serializing resources, do you mean this for drbd1 ? : syncer { after drbd0; } Syncer rate is 33M for now, as suggested by the official documentation for a gigabit connection. What would you suggest for max-buffers : 32, which is the minimum ? Again, thank you for your answer : I'm still worried, but I have a few things to test now. If you need any other piece of information, please don't hesitate ! Eric