[DRBD-user] drbd error -5 and lvm thoughts and observations

Tue Apr 10 00:53:42 CEST 2007

Lars,

Thanks for the explanation. 

So if I use the panic option the kernel would therefore crash forcing a 
failover to occur?

Is there no other way for heartbeat to monitor the status of the drbd 
devices?

Basically I ended up with a cluster hang.

Regards

/Steve

Lars Ellenberg wrote:
> On Sat, Apr 07, 2007 at 10:08:09PM +0100, wcsl wrote:
>   
>> Hello,
>>
>> Linux data1.contact-24-7.local 2.6.9-42.0.8.ELsmp #1
>>
>> [root at data1 ~]# rpm -qa | grep drbd
>> kernel-module-drbd-2.6.9-42.0.8.ELsmp-0.7.23-1.el4.centos
>> drbd-0.7.23-1.el4.centos
>>
>> Extract for /var/log/messages
>>     
>
> would be helpfull if you could persuade your email client to not break
> pasted lines :)
>
>   
>> Apr  6 12:10:11 data2 kernel: execute_task_management(1212) 5cd94fc 5ffffffff
>> Apr  6 12:10:12 data2 kernel: execute_task_management(1212) 5380c6 5ffffffff
>> Apr  6 12:10:13 data2 kernel: execute_task_management(1212) fb84b 5ffffffff
>> Apr  6 12:10:14 data2 kernel: execute_task_management(1212) f5097 5ffffffff
>> Apr  6 12:10:14 data2 kernel: execute_task_management(1212) 1ac561 5ffffffff
>> Apr  6 12:10:31 data2 kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #3: Command (0x2a) timed out, resetting card.
>> Apr  6 12:10:46 data2 kernel: execute_task_management(1212) 5cd94ff 5ffffffff
>> Apr  6 12:10:47 data2 kernel: execute_task_management(1212) 5380ca 5ffffffff
>> Apr  6 12:10:48 data2 kernel: execute_task_management(1212) fb84e 5ffffffff
>> Apr  6 12:10:49 data2 kernel: execute_task_management(1212) f509a 5ffffffff
>> Apr  6 12:10:49 data2 kernel: execute_task_management(1212) 1ac564 5ffffffff
>> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
>> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=3, port=11.
>> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=11.
>> Apr  6 12:11:11 data2 last message repeated 4 times
>> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
>> Apr  6 12:11:11 data2 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1.
>> Apr  6 12:11:11 data2 kernel: Device sdd not ready.
>> Apr  6 12:11:11 data2 kernel: end_request: I/O error, dev sdd, sector 1897951751
>>     
>
>   
>> Apr  6 12:11:11 data2 kernel: drbd5: error = -5 in /home/buildsvn/rpmbuild/BUILD/drbd-0.7.23/drbd/drbd_worker.c:289
>> Apr  6 12:11:11 data2 kernel: drbd5: Local IO failed. Detaching...
>>     
>
> in case you wondered: -5 is just -EIO, and this is the drbd bio_end_io
> callback complaining about the lower stack returning an error, and which
> one. actually it is probably a debugging left-over from when we tried to
> hunt down a suboptimal handling of -EAGAIN response to a READA request
> :)
>
> drbd is nice enough to "detach" the failing underlying device (sdd,
> aparently), and should not just ship all requests over the network.
>
>   
>> Setup.
>>
>> LVS cluster
>> DRBD devices setup on 3ware SATA RAID card devices
>>     
>
> one of them failing (or the controler got confused, whatever).
>
>   
>> iSCSI devices (using iscsi-target) setup on DRBD devices to provide 
>> resilient iSCSI storage for backed W2K3/E2K3 cluster.
>>
>> Symptom
>>
>> After these errors were reported I was unable to deallocate drbd5: 
>> device or shutdown the drbd processes other than by rebooting.  The 
>> device was still seen as the primary on the other node in the cluster 
>> and and would not failover to the secondary member.
>>     
>
> well, it should have still be operational,
> and it had still references.
> for a failover, you should have done hb_standby, which would have shut
> down the iscsi targets, unmounted any mounted drbd and so on.
> should have worked.
>
>   
>> After the reboot; the drbd device was release; heartbeat kicked in and 
>> the iSCSI targets were presented on the other member machine.  Exchange 
>> restarted and the mailstore started successfully, which is a great test 
>> of data integrity.
>>
>> What I would have liked to have happened
>>
>> . Hard disk failed
>> . drbd noticed and released the drbd device making the other node primary
>>     
>
> it detached the bad hardware,
> now shipping every request over the network.
> because you configured "on-io-error = detach".
> valid other options would be "panic", causing a kernel panic,
> and therefore very likely a failover (at least that would be the
> intention of this optioin), and "pass on", in which case the upper
> layers (file system or iscsi-target or whatever) would have seen the io
> error, and would do their repertoir of error handling: remount read
> only, panic, bug, whatever.
>
>   
>> . heartbeat to kick and automatically fail the services to the other node.
>>     
>
> probably "panic" would be what you expected.
>
>   
>> This as far as I can see will require two things.
>>
>> 1. The problem that I experienced to be overcome
>>     
>
> um.
> this was not a drbd problem, but an IO ERROR on sdd.
> and, it should still have been working, everything?
>
>   
>> 2. A fancy heartbeat monitoring script for all drdb devices monitoring 
>> which side is primary and failing over accordingly.
>>
>> My iSCSI drbd lvs is configured as active/passive so if there were any 
>> discrepancies on the primary node I would wish the secondary to take over.
>>
>> Does anyone have any similar experiences or comments?
>> Do these heartbeat scripts exist?  If so I would like to see a copy
>>
>> Thanks in advance
>>
>> /Steve
>>     
>
>