[DRBD-user] [DRBD 0.6.4] After RAID5 disk crash "SCSI disk error"

Fri Oct 14 17:32:14 CEST 2005

Marc Fischer wrote:
> 
> Hello
> 
> We have DRBD 0.6.4 running on two IBM Netfinity 5100 with a RAID5 array.
> (Linux Suse 8.2).
> Primary DRBD server name = pascal
> Secondary DRBD server name = descartes
> 
> After a disk crash on the primary DRBD server (Pascale) we replaced the
> disk and rebuilt it to the RAID.
> The secondary DRBD server (Descartes) is now running as primary:
> descartes:~ # cat /proc/drbd
> version: 0.6.4 (api:61/proto:62)
> 0: cs:WFConnection st:Primary/Unknown ns:53548436 nr:11002921
> dw:68278315 dr:71328254 pe:0 ua:0
> 
> When I "drbd start" on the broken server (Pascale) DRBD starts
> synchronizing but the DRBD drive on the temorary primary server
> (Descartes) is not accessable anymore and the following log messages are
> created in /var/log/messages:
> .
> .
> .
> Oct  4 11:36:20 pascal kernel: SCSI disk error : host 2 channel 0 id 1
> lun 0 return code = 70000
> Oct  4 11:36:20 pascal kernel:  I/O error: dev 08:11, sector 316256
> Oct  4 11:36:20 pascal kernel: drbd0: The lower-level device had an error.
> Oct  4 11:36:20 pascal kernel: SCSI disk error : host 2 channel 0 id 1
> lun 0 return code = 70000
> Oct  4 11:36:20 pascal kernel:  I/O error: dev 08:11, sector 316640
> Oct  4 11:36:20 pascal kernel: drbd0: The lower-level device had an error
> .
> .
> .

Marc, 
have you gotten the system back up yet? (I have not seen any new messages
from you indicating a change in status)

To me it looks like more than one disk in  Pascale was broken. 

> 
> We completely checked the RAID 5 array (sector r/w test) and did not get
> an error.

did you try:
badblocks -sw -c1024 -b4096  /dev/device
so that it will attempt to write in the same size chunks as drbd?

Also, I found that with a RAID 5 array and a bad disk, badblocks will only
really find the bad disk if you split the array up into a set of disks and
check each physical disk individually... because the RAID 5 will do its best
to mask the problems when it can (which is why you use it). I believe it is
even a good idea to check all the disks before using the new disk with it,
because I have received several brand new (still in their factory static bag
and Styrofoam) disk which failed a badblocks check immediately after power
up.

> When I start drbd manually I do:
> 1. modprobe drbd
> 2. drbdsetup /dev/nb0 disk /dev/sdb1 -d 8809069
> 3. drbdsetup /dev/nb0 net 1.1.1.13 1.1.1.11 C
> 
> At step 3 the errors start.

This is when the other node starts trying to write to the disk.
Is the new drive the same model & size as the one being replaced, or a least
bigger? (though I would expect the array to automatically resize to a
smaller size if the drive were smaller).
is 8809069 the size that was used in the drbd.conf on both machines?

> 
> Can anybody help how I get this system running again? This is a
> productive system and there is not much trying around.
> (... and I know that we should upgrade...:-))

on pascal issue `drbd /dev/nb0 disconnect` and leave it that way until you
are 150% sure of the array on it.  descartes can thus remain in production
until pascal is healthy. :)

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter