Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Marc Fischer wrote: > > Hello > > We have DRBD 0.6.4 running on two IBM Netfinity 5100 with a RAID5 array. > (Linux Suse 8.2). > Primary DRBD server name = pascal > Secondary DRBD server name = descartes > > After a disk crash on the primary DRBD server (Pascale) we replaced the > disk and rebuilt it to the RAID. > The secondary DRBD server (Descartes) is now running as primary: > descartes:~ # cat /proc/drbd > version: 0.6.4 (api:61/proto:62) > 0: cs:WFConnection st:Primary/Unknown ns:53548436 nr:11002921 > dw:68278315 dr:71328254 pe:0 ua:0 > > When I "drbd start" on the broken server (Pascale) DRBD starts > synchronizing but the DRBD drive on the temorary primary server > (Descartes) is not accessable anymore and the following log messages are > created in /var/log/messages: > . > . > . > Oct 4 11:36:20 pascal kernel: SCSI disk error : host 2 channel 0 id 1 > lun 0 return code = 70000 > Oct 4 11:36:20 pascal kernel: I/O error: dev 08:11, sector 316256 > Oct 4 11:36:20 pascal kernel: drbd0: The lower-level device had an error. > Oct 4 11:36:20 pascal kernel: SCSI disk error : host 2 channel 0 id 1 > lun 0 return code = 70000 > Oct 4 11:36:20 pascal kernel: I/O error: dev 08:11, sector 316640 > Oct 4 11:36:20 pascal kernel: drbd0: The lower-level device had an error > . > . > . Marc, have you gotten the system back up yet? (I have not seen any new messages from you indicating a change in status) To me it looks like more than one disk in Pascale was broken. > > We completely checked the RAID 5 array (sector r/w test) and did not get > an error. did you try: badblocks -sw -c1024 -b4096 /dev/device so that it will attempt to write in the same size chunks as drbd? Also, I found that with a RAID 5 array and a bad disk, badblocks will only really find the bad disk if you split the array up into a set of disks and check each physical disk individually... because the RAID 5 will do its best to mask the problems when it can (which is why you use it). I believe it is even a good idea to check all the disks before using the new disk with it, because I have received several brand new (still in their factory static bag and Styrofoam) disk which failed a badblocks check immediately after power up. > When I start drbd manually I do: > 1. modprobe drbd > 2. drbdsetup /dev/nb0 disk /dev/sdb1 -d 8809069 > 3. drbdsetup /dev/nb0 net 1.1.1.13 1.1.1.11 C > > At step 3 the errors start. This is when the other node starts trying to write to the disk. Is the new drive the same model & size as the one being replaced, or a least bigger? (though I would expect the array to automatically resize to a smaller size if the drive were smaller). is 8809069 the size that was used in the drbd.conf on both machines? > > Can anybody help how I get this system running again? This is a > productive system and there is not much trying around. > (... and I know that we should upgrade...:-)) on pascal issue `drbd /dev/nb0 disconnect` and leave it that way until you are 150% sure of the array on it. descartes can thus remain in production until pascal is healthy. :) -- Todd Denniston Crane Division, Naval Surface Warfare Center (NSWC Crane) Harnessing the Power of Technology for the Warfighter