[DRBD-user] drbd error -5 and lvm thoughts and observations
Lars Ellenberg
lars.ellenberg at linbit.com
Mon Apr 9 22:47:47 CEST 2007
On Sat, Apr 07, 2007 at 10:08:09PM +0100, wcsl wrote:
> Hello,
>
> Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1
>
> [root at data1 ~]# rpm -qa | grep drbd
> kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos
> drbd-0.7.23-1.el4.centos
>
> Extract for /var/log/messages
would be helpfull if you could persuade your email client to not break
pasted lines :)
> Apr 6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5ffffffff
> Apr 6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5ffffffff
> Apr 6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5ffffffff
> Apr 6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5ffffffff
> Apr 6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5ffffffff
> Apr 6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card.
> Apr 6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5ffffffff
> Apr 6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5ffffffff
> Apr 6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5ffffffff
> Apr 6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5ffffffff
> Apr 6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5ffffffff
> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11.
> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
> Apr 6 12:11:11 data2 last message repeated 4 times
> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
> Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1.
> Apr 6 12:11:11 data2 kernel: Device sdd not ready.
> Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751
> Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
> Apr 6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching...
in case you wondered: -5 is just -EIO, and this is the drbd bio_end_io
callback complaining about the lower stack returning an error, and which
one. actually it is probably a debugging left-over from when we tried to
hunt down a suboptimal handling of -EAGAIN response to a READA request
:)
drbd is nice enough to "detach" the failing underlying device (sdd,
aparently), and should not just ship all requests over the network.
> Setup.
>
> LVS cluster
> DRBD devices setup on 3ware SATA RAID card devices
one of them failing (or the controler got confused, whatever).
> iSCSI devices (using iscsi-target) setup on DRBD devices to provide
> resilient iSCSI storage for backed W2K3/E2K3 cluster.
>
> Symptom
>
> After these errors were reported I was unable to deallocate drbd5:
> device or shutdown the drbd processes other than by rebooting. The
> device was still seen as the primary on the other node in the cluster
> and and would not failover to the secondary member.
well, it should have still be operational,
and it had still references.
for a failover, you should have done hb_standby, which would have shut
down the iscsi targets, unmounted any mounted drbd and so on.
should have worked.
> After the reboot; the drbd device was release; heartbeat kicked in and
> the iSCSI targets were presented on the other member machine. Exchange
> restarted and the mailstore started successfully, which is a great test
> of data integrity.
>
> What I would have liked to have happened
>
> . Hard disk failed
> . drbd noticed and released the drbd device making the other node primary
it detached the bad hardware,
now shipping every request over the network.
because you configured "on-io-error = detach".
valid other options would be "panic", causing a kernel panic,
and therefore very likely a failover (at least that would be the
intention of this optioin), and "pass on", in which case the upper
layers (file system or iscsi-target or whatever) would have seen the io
error, and would do their repertoir of error handling: remount read
only, panic, bug, whatever.
> . heartbeat to kick and automatically fail the services to the other node.
probably "panic" would be what you expected.
> This as far as I can see will require two things.
>
> 1. The problem that I experienced to be overcome
um.
this was not a drbd problem, but an IO ERROR on sdd.
and, it should still have been working, everything?
> 2. A fancy heartbeat monitoring script for all drdb devices monitoring
> which side is primary and failing over accordingly.
>
> My iSCSI drbd lvs is configured as active/passive so if there were any
> discrepancies on the primary node I would wish the secondary to take over.
>
> Does anyone have any similar experiences or comments?
> Do these heartbeat scripts exist? If so I would like to see a copy
>
> Thanks in advance
>
> /Steve
--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
__
please use the "List-Reply" function of your email client.
More information about the drbd-user
mailing list