Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1 [root at data1 ~]# rpm -qa | grep drbd kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos drbd-0.7.23-1.el4.centos Extract for /var/log/messages Apr 6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5 ffffffff Apr 6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5 ffffffff Apr 6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5 ffffffff Apr 6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5 ffffffff Apr 6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5 ffffffff Apr 6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card. Apr 6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5 ffffffff Apr 6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5 ffffffff Apr 6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5 ffffffff Apr 6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5 ffffffff Apr 6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5 ffffffff Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11. Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11. Apr 6 12:11:11 data2 last message repeated 4 times Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. Apr 6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1. Apr 6 12:11:11 data2 kernel: Device sdd not ready. Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751 Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289 Apr 6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching... Apr 6 12:11:11 data2 kernel: Device sdd not ready. Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1849493623 Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289 Apr 6 12:11:11 data2 kernel: Device sdd not ready. Apr 6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1849365383 Apr 6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289 Setup. LVS cluster DRBD devices setup on 3ware SATA RAID card devices iSCSI devices (using iscsi-target) setup on DRBD devices to provide resilient iSCSI storage for backed W2K3/E2K3 cluster. Symptom After these errors were reported I was unable to deallocate drbd5: device or shutdown the drbd processes other than by rebooting. The device was still seen as the primary on the other node in the cluster and and would not failover to the secondary member. After the reboot; the drbd device was release; heartbeat kicked in and the iSCSI targets were presented on the other member machine. Exchange restarted and the mailstore started successfully, which is a great test of data integrity. What I would have liked to have happened . Hard disk failed . drbd noticed and released the drbd device making the other node primary . heartbeat to kick and automatically fail the services to the other node. This as far as I can see will require two things. 1. The problem that I experienced to be overcome 2. A fancy heartbeat monitoring script for all drdb devices monitoring which side is primary and failing over accordingly. My iSCSI drbd lvs is configured as active/passive so if there were any discrepancies on the primary node I would wish the secondary to take over. Does anyone have any similar experiences or comments? Do these heartbeat scripts exist? If so I would like to see a copy Thanks in advance /Steve