Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, Thanks for the explanation. So if I use the panic option the kernel would therefore crash forcing a failover to occur? Is there no other way for heartbeat to monitor the status of the drbd devices? Basically I ended up with a cluster hang. Regards /Steve Lars Ellenberg wrote: > On Sat, Apr 07, 2007 at 10:08:09PM +0100, wcsl wrote: > >> Hello, >> >> Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1 >> >> [root at data1 ~]# rpm -qa | grep drbd >> kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos >> drbd-0.7.23-1.el4.centos >> >> Extract for /var/log/messages >> > > would be helpfull if you could persuade your email client to not break > pasted lines :) > > >> Apr 6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5ffffffff >> Apr 6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5ffffffff >> Apr 6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5ffffffff >> Apr 6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5ffffffff >> Apr 6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5ffffffff >> Apr 6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card. >> Apr 6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5ffffffff >> Apr 6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5ffffffff >> Apr 6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5ffffffff >> Apr 6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5ffffffff >> Apr 6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5ffffffff >> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. >> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11. >> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. >> Apr 6 12:11:11 data2 last message repeated 4 times >> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. >> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1. >> Apr 6 12:11:11 data2 kernel: Device sdd not ready. >> Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751 >> > > >> Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289 >> Apr 6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching... >> > > in case you wondered: -5 is just -EIO, and this is the drbd bio_end_io > callback complaining about the lower stack returning an error, and which > one. actually it is probably a debugging left-over from when we tried to > hunt down a suboptimal handling of -EAGAIN response to a READA request > :) > > drbd is nice enough to "detach" the failing underlying device (sdd, > aparently), and should not just ship all requests over the network. > > >> Setup. >> >> LVS cluster >> DRBD devices setup on 3ware SATA RAID card devices >> > > one of them failing (or the controler got confused, whatever). > > >> iSCSI devices (using iscsi-target) setup on DRBD devices to provide >> resilient iSCSI storage for backed W2K3/E2K3 cluster. >> >> Symptom >> >> After these errors were reported I was unable to deallocate drbd5: >> device or shutdown the drbd processes other than by rebooting. The >> device was still seen as the primary on the other node in the cluster >> and and would not failover to the secondary member. >> > > well, it should have still be operational, > and it had still references. > for a failover, you should have done hb_standby, which would have shut > down the iscsi targets, unmounted any mounted drbd and so on. > should have worked. > > >> After the reboot; the drbd device was release; heartbeat kicked in and >> the iSCSI targets were presented on the other member machine. Exchange >> restarted and the mailstore started successfully, which is a great test >> of data integrity. >> >> What I would have liked to have happened >> >> . Hard disk failed >> . drbd noticed and released the drbd device making the other node primary >> > > it detached the bad hardware, > now shipping every request over the network. > because you configured "on-io-error = detach". > valid other options would be "panic", causing a kernel panic, > and therefore very likely a failover (at least that would be the > intention of this optioin), and "pass on", in which case the upper > layers (file system or iscsi-target or whatever) would have seen the io > error, and would do their repertoir of error handling: remount read > only, panic, bug, whatever. > > >> . heartbeat to kick and automatically fail the services to the other node. >> > > probably "panic" would be what you expected. > > >> This as far as I can see will require two things. >> >> 1. The problem that I experienced to be overcome >> > > um. > this was not a drbd problem, but an IO ERROR on sdd. > and, it should still have been working, everything? > > >> 2. A fancy heartbeat monitoring script for all drdb devices monitoring >> which side is primary and failing over accordingly. >> >> My iSCSI drbd lvs is configured as active/passive so if there were any >> discrepancies on the primary node I would wish the secondary to take over. >> >> Does anyone have any similar experiences or comments? >> Do these heartbeat scripts exist? If so I would like to see a copy >> >> Thanks in advance >> >> /Steve >> > >