[DRBD-user] drbd error -5 and lvm thoughts and observations

Mon Apr 9 22:47:47 CEST 2007

On Sat, Apr 07, 2007 at 10:08:09PM +0100, wcsl wrote:
> Hello,
> 
> Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1
> 
> [root at data1 ~]# rpm -qa | grep drbd
> kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos
> drbd-0.7.23-1.el4.centos
> 
> Extract for /var/log/messages

would be helpfull if you could persuade your email client to not break
pasted lines :)

> Apr  6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5ffffffff
> Apr  6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5ffffffff
> Apr  6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5ffffffff
> Apr  6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5ffffffff
> Apr  6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5ffffffff
> Apr  6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card.
> Apr  6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5ffffffff
> Apr  6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5ffffffff
> Apr  6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5ffffffff
> Apr  6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5ffffffff
> Apr  6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5ffffffff
> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11.
> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
> Apr  6 12:11:11 data2 last message repeated 4 times
> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1.
> Apr  6 12:11:11 data2 kernel: Device sdd not ready.
> Apr  6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751

> Apr  6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
> Apr  6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching...

in case you wondered: -5 is just -EIO, and this is the drbd bio_end_io
callback complaining about the lower stack returning an error, and which
one. actually it is probably a debugging left-over from when we tried to
hunt down a suboptimal handling of -EAGAIN response to a READA request
:)

drbd is nice enough to "detach" the failing underlying device (sdd,
aparently), and should not just ship all requests over the network.

> Setup.
> 
> LVS cluster
> DRBD devices setup on 3ware SATA RAID card devices

one of them failing (or the controler got confused, whatever).

> iSCSI devices (using iscsi-target) setup on DRBD devices to provide 
> resilient iSCSI storage for backed W2K3/E2K3 cluster.
> 
> Symptom
> 
> After these errors were reported I was unable to deallocate drbd5: 
> device or shutdown the drbd processes other than by rebooting.  The 
> device was still seen as the primary on the other node in the cluster 
> and and would not failover to the secondary member.

well, it should have still be operational,
and it had still references.
for a failover, you should have done hb_standby, which would have shut
down the iscsi targets, unmounted any mounted drbd and so on.
should have worked.

> After the reboot; the drbd device was release; heartbeat kicked in and 
> the iSCSI targets were presented on the other member machine.  Exchange 
> restarted and the mailstore started successfully, which is a great test 
> of data integrity.
> 
> What I would have liked to have happened
> 
> . Hard disk failed
> . drbd noticed and released the drbd device making the other node primary

it detached the bad hardware,
now shipping every request over the network.
because you configured "on-io-error = detach".
valid other options would be "panic", causing a kernel panic,
and therefore very likely a failover (at least that would be the
intention of this optioin), and "pass on", in which case the upper
layers (file system or iscsi-target or whatever) would have seen the io
error, and would do their repertoir of error handling: remount read
only, panic, bug, whatever.

> . heartbeat to kick and automatically fail the services to the other node.

probably "panic" would be what you expected.

> This as far as I can see will require two things.
> 
> 1. The problem that I experienced to be overcome

um.
this was not a drbd problem, but an IO ERROR on sdd.
and, it should still have been working, everything?

> 2. A fancy heartbeat monitoring script for all drdb devices monitoring 
> which side is primary and failing over accordingly.
>
> My iSCSI drbd lvs is configured as active/passive so if there were any 
> discrepancies on the primary node I would wish the secondary to take over.
> 
> Does anyone have any similar experiences or comments?
> Do these heartbeat scripts exist?  If so I would like to see a copy
> 
> Thanks in advance
> 
> /Steve

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.