Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi guys, Background: - 100 Gig DRBD partition - DRBD 0.6.12-3 - Fedora Core 1 - GigE net card (e1000-5.2.52) My DRBD had been working fine for about 9 months. The other day one of the machines crashed and since then the machine has been crashing. I tried syncing the secondary to the primary a number of times with no luck - the secondary would crash 20%, 30%, 50% thru the sync leaving a whole bunch of RX and TX errors on the network card - i used ifconfig to see this. Eventually I got a full sync, but a few hours later it crashed out again. The machine stays up - you can ping it, but you cannot ssh to it. Once again there were huge numbers of RX and TX 'dropped' 'errors' and 'overruns'. I first thought it was DRBD as it always happened during the sync. However since it got a full sync, i'm now guessing it might be a broken network card?? This is re-inforced by the fact that the machine had been running fine for about 9 months and the primary is an identical machine - so this rules out software/driver problems... I've attached the logs which show what happened.. I've run a memtest on the machine and it doesn't look like bad RAM. Does anyone have any ideas what it might be? :) - bad network card? - bad hard drive? - software misconfiguration? - bad mother board? Any help/opinion is most appreciated :) Cheers Jon On the master machine (which doesn't crash) these are the messages i get: ************************************************************************** Feb 14 15:32:22 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Feb 14 15:49:06 jack kernel: drbd0: Connection established. size=100179416 KB / blksize=4096 B Feb 14 15:49:06 jack kernel: drbd0: Synchronisation started blks=15 Feb 14 15:56:08 jack sshd(pam_unix)[25269]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=jill user=root Feb 14 15:56:48 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Feb 14 15:56:51 jack kernel: drbd0: [drbd_syncer_0/24935] sock_sendmsg time expired, ko = 3 Feb 14 15:56:53 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex Feb 14 15:56:54 jack kernel: drbd0: Syncer send failed. Feb 14 15:56:54 jack kernel: drbd0: Connection lost. On the slave machine (which crashes), these are the messages I get... ************************************************************************** Feb 14 15:49:07 jill drbd: ===> drbd start <=== Feb 14 15:49:07 jill drbd: modprobe -s drbd minor_count=1 Feb 14 15:49:07 jill kernel: drbd: initialised. Version: 0.6.12 (api:64/proto:62) Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 disk /dev/VolGroupData/LogVolData --do-panic --disk-size=100179416k Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 net 10.20.1.2:7788 10.20.1.1:7788 C --sync-min=5M --sync-max=50M --tl-size=5000 --timeout=60 --connect-int=10 --ping-int=10 --ko-count=4 Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 wait_connect -t 1 Feb 14 15:49:07 jill kernel: drbd0: Connection established. size=100179416 KB / blksize=4096 B Feb 14 15:49:07 jill kernel: klogd 1.4.1, ---------- state change ---------- Feb 14 15:49:07 jill drbd: 'drbd0' SyncingAll, waiting for this to finish Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 wait_sync Feb 14 15:49:13 jill drbd: ERROR: drbdsetup /dev/nb0 wait_sync [20]: Feb 14 15:49:13 jill drbd: ERROR: ioctl(wait_sync): Interrupted system call Feb 14 15:49:13 jill drbd: 'drbd0' wait_sync terminated unexpectedly Feb 14 15:54:25 jill drbd: ===> drbd stop <=== -- ************************************************** Jonathan Soong Institute of Medical and Veterinary Science Information, Communication and Technology Services www.imvs.org Ph: +61 8 8222 3095