Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sat, Apr 07, 2007 at 10:08:09PM +0100, wcsl wrote: > Hello, > > Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1 > > [root at data1 ~]# rpm -qa | grep drbd > kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos > drbd-0.7.23-1.el4.centos > > Extract for /var/log/messages would be helpfull if you could persuade your email client to not break pasted lines :) > Apr 6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5ffffffff > Apr 6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5ffffffff > Apr 6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5ffffffff > Apr 6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5ffffffff > Apr 6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5ffffffff > Apr 6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card. > Apr 6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5ffffffff > Apr 6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5ffffffff > Apr 6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5ffffffff > Apr 6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5ffffffff > Apr 6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5ffffffff > Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. > Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11. > Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. > Apr 6 12:11:11 data2 last message repeated 4 times > Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. > Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1. > Apr 6 12:11:11 data2 kernel: Device sdd not ready. > Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751 > Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289 > Apr 6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching... in case you wondered: -5 is just -EIO, and this is the drbd bio_end_io callback complaining about the lower stack returning an error, and which one. actually it is probably a debugging left-over from when we tried to hunt down a suboptimal handling of -EAGAIN response to a READA request :) drbd is nice enough to "detach" the failing underlying device (sdd, aparently), and should not just ship all requests over the network. > Setup. > > LVS cluster > DRBD devices setup on 3ware SATA RAID card devices one of them failing (or the controler got confused, whatever). > iSCSI devices (using iscsi-target) setup on DRBD devices to provide > resilient iSCSI storage for backed W2K3/E2K3 cluster. > > Symptom > > After these errors were reported I was unable to deallocate drbd5: > device or shutdown the drbd processes other than by rebooting. The > device was still seen as the primary on the other node in the cluster > and and would not failover to the secondary member. well, it should have still be operational, and it had still references. for a failover, you should have done hb_standby, which would have shut down the iscsi targets, unmounted any mounted drbd and so on. should have worked. > After the reboot; the drbd device was release; heartbeat kicked in and > the iSCSI targets were presented on the other member machine. Exchange > restarted and the mailstore started successfully, which is a great test > of data integrity. > > What I would have liked to have happened > > . Hard disk failed > . drbd noticed and released the drbd device making the other node primary it detached the bad hardware, now shipping every request over the network. because you configured "on-io-error = detach". valid other options would be "panic", causing a kernel panic, and therefore very likely a failover (at least that would be the intention of this optioin), and "pass on", in which case the upper layers (file system or iscsi-target or whatever) would have seen the io error, and would do their repertoir of error handling: remount read only, panic, bug, whatever. > . heartbeat to kick and automatically fail the services to the other node. probably "panic" would be what you expected. > This as far as I can see will require two things. > > 1. The problem that I experienced to be overcome um. this was not a drbd problem, but an IO ERROR on sdd. and, it should still have been working, everything? > 2. A fancy heartbeat monitoring script for all drdb devices monitoring > which side is primary and failing over accordingly. > > My iSCSI drbd lvs is configured as active/passive so if there were any > discrepancies on the primary node I would wish the secondary to take over. > > Does anyone have any similar experiences or comments? > Do these heartbeat scripts exist? If so I would like to see a copy > > Thanks in advance > > /Steve -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.