[DRBD-user] 0.6.12-3 Crashes - but could be hardware?

Jon Soong jon.soong at imvs.sa.gov.au
Thu Feb 17 05:37:02 CET 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi guys,

Background:
- 100 Gig DRBD partition
- DRBD 0.6.12-3
- Fedora Core 1
- GigE net card (e1000-5.2.52)

My DRBD had been working fine for about 9 months. The other day one of the 
machines crashed and since then the machine has been crashing. I tried 
syncing the secondary to the primary a number of times with no luck - the 
secondary would crash 20%, 30%, 50% thru the sync leaving a whole bunch of 
RX and TX errors on the network card - i used ifconfig to see this.

Eventually I got a full sync, but a few hours later it crashed out again. 
The machine stays up - you can ping it, but you cannot ssh to it. Once 
again there were huge numbers of RX and TX 'dropped' 'errors' and 'overruns'.

I first thought it was DRBD as it always happened during the sync. However 
since it got a full sync, i'm now guessing it might be a broken network card??

This is re-inforced by the fact that the machine had been running fine for 
about 9 months and the primary is an identical machine - so this rules out 
software/driver problems...

I've attached the logs which show what happened..

I've run a memtest on the machine and it doesn't look like bad RAM.

Does anyone have any ideas what it might be? :)
   - bad network card?
   - bad hard drive?
   - software misconfiguration?
   - bad mother board?

Any help/opinion is most appreciated :)

Cheers

Jon


On the master machine (which doesn't crash) these are the messages i get:
**************************************************************************
Feb 14 15:32:22 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 
1000 Mbps Full Duplex
Feb 14 15:49:06 jack kernel: drbd0: Connection established. size=100179416 
KB / blksize=4096 B
Feb 14 15:49:06 jack kernel: drbd0: Synchronisation started blks=15
Feb 14 15:56:08 jack sshd(pam_unix)[25269]: authentication failure; 
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=jill  user=root
Feb 14 15:56:48 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Feb 14 15:56:51 jack kernel: drbd0: [drbd_syncer_0/24935] sock_sendmsg time 
expired, ko = 3
Feb 14 15:56:53 jack kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 
100 Mbps Full Duplex
Feb 14 15:56:54 jack kernel: drbd0: Syncer send failed.
Feb 14 15:56:54 jack kernel: drbd0: Connection lost.


On the slave machine (which crashes), these are the messages I get...
**************************************************************************
Feb 14 15:49:07 jill drbd: ===> drbd start <===
Feb 14 15:49:07 jill drbd: modprobe -s drbd minor_count=1
Feb 14 15:49:07 jill kernel: drbd: initialised. Version: 0.6.12 
(api:64/proto:62)
Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 disk 
/dev/VolGroupData/LogVolData --do-panic --disk-size=100179416k
Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 net 10.20.1.2:7788 
10.20.1.1:7788 C --sync-min=5M --sync-max=50M --tl-size=5000 --timeout=60 
--connect-int=10 --ping-int=10 --ko-count=4
Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 wait_connect -t 1
Feb 14 15:49:07 jill kernel: drbd0: Connection established. size=100179416 
KB / blksize=4096 B
Feb 14 15:49:07 jill kernel: klogd 1.4.1, ---------- state change ----------
Feb 14 15:49:07 jill drbd: 'drbd0' SyncingAll, waiting for this to finish
Feb 14 15:49:07 jill drbd: drbdsetup /dev/nb0 wait_sync
Feb 14 15:49:13 jill drbd: ERROR: drbdsetup /dev/nb0 wait_sync [20]:
Feb 14 15:49:13 jill drbd: ERROR: ioctl(wait_sync): Interrupted system call
Feb 14 15:49:13 jill drbd: 'drbd0' wait_sync terminated unexpectedly
Feb 14 15:54:25 jill drbd: ===> drbd stop <===

-- 
**************************************************
Jonathan Soong
Institute of Medical and Veterinary Science
Information, Communication and Technology Services
www.imvs.org Ph: +61 8 8222 3095



More information about the drbd-user mailing list