[DRBD-user] drbd error -5 and lvm thoughts and observations
wcsl
drbd at wcsl.net
Sat Apr 7 23:08:09 CEST 2007
Hello,
Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1
[root at data1 ~]# rpm -qa | grep drbd
kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos
drbd-0.7.23-1.el4.centos
Extract for /var/log/messages
Apr 6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5
ffffffff
Apr 6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5
ffffffff
Apr 6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5
ffffffff
Apr 6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5
ffffffff
Apr 6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5
ffffffff
Apr 6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C):
Unit #3: Command (0x2a) timed out, resetting card.
Apr 6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5
ffffffff
Apr 6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5
ffffffff
Apr 6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5
ffffffff
Apr 6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5
ffffffff
Apr 6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5
ffffffff
Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009):
Drive timeout detected:port=11.
Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A):
Drive error detected:unit=3, port=11.
Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009):
Drive timeout detected:port=11.
Apr 6 12:11:11 data2 last message repeated 4 times
Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronization completed:unit=0.
Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronization completed:unit=1.
Apr 6 12:11:11 data2 kernel: Device sdd not ready.
Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector
1897951751
Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in
/home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
Apr 6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching...
Apr 6 12:11:11 data2 kernel: Device sdd not ready.
Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector
1849493623
Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in
/home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
Apr 6 12:11:11 data2 kernel: Device sdd not ready.
Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector
1849365383
Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in
/home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
Setup.
LVS cluster
DRBD devices setup on 3ware SATA RAID card devices
iSCSI devices (using iscsi-target) setup on DRBD devices to provide
resilient iSCSI storage for backed W2K3/E2K3 cluster.
Symptom
After these errors were reported I was unable to deallocate drbd5:
device or shutdown the drbd processes other than by rebooting. The
device was still seen as the primary on the other node in the cluster
and and would not failover to the secondary member.
After the reboot; the drbd device was release; heartbeat kicked in and
the iSCSI targets were presented on the other member machine. Exchange
restarted and the mailstore started successfully, which is a great test
of data integrity.
What I would have liked to have happened
. Hard disk failed
. drbd noticed and released the drbd device making the other node primary
. heartbeat to kick and automatically fail the services to the other node.
This as far as I can see will require two things.
1. The problem that I experienced to be overcome
2. A fancy heartbeat monitoring script for all drdb devices monitoring
which side is primary and failing over accordingly.
My iSCSI drbd lvs is configured as active/passive so if there were any
discrepancies on the primary node I would wish the secondary to take over.
Does anyone have any similar experiences or comments?
Do these heartbeat scripts exist? If so I would like to see a copy
Thanks in advance
/Steve
More information about the drbd-user
mailing list