[DRBD-user] Drbd/pacemaker active/passive san failover

Tue Sep 20 09:00:07 CEST 2016

Furthermore there are logs from the secondary node:

http://pastebin.com/A2ySXDCB

Please compare time. It seems that also on the secondary node drbd goes to
diskless mode. Why?

2016-09-20 8:44 GMT+02:00 Marco Marino <marino.mrc at gmail.com>:

> Hi, logs can be found here: http://pastebin.com/BGR33jN6
>
> @digimer:
> Using local-io-error should power off the node and switch the cluster on
> the remaing node.... is this a good idea?
>
> Regards,
> Marco
>
> 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au>:
>
>>
>>
>> On 19/09/2016 19:06, Marco Marino wrote:
>>
>>
>>
>> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>:
>>
>>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
>>> >
>>> > Hi, I'm trying to build an active/passive cluster with drbd and
>>> pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid)
>>> on each one. Each node has an ssd disk that works as cache for read (and
>>> write?) realizing the CacheCade proprietary tecnology.
>>> >
>>> Did you configure the CacheCade? If the write cache was enabled in
>>> write-back mode then suddenly removing the device from under the controller
>>> would have caused serious problems I guess since the controller expects to
>>> write to the ssd cache firts and then flush to the hdd's. Maybe this
>>> explains the read only mode?
>>>
>> Good point. It is exactly as you wrote. How can I mitigate this behavior
>> in a clustered (active/passive) enviroment??? As I told in the other post,
>> I think the best solution is to poweroff the node using local-io-error and
>> switch all resources on the other node.... But please give me some
>> suggestions....
>>
>>
>>
>>
>>> > Basically, the structure of the san is:
>>> >
>>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource
>>> (that use /dev/sdb as backend) (using pacemaker with a master/slave
>>> resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker)
>>> -> Iscsi LUNS (one for each logical volume in the VG, managed with
>>> pacemaker)
>>> >
>>> > Few days ago, the ssd disk was wrongly removed from the primary node
>>> of the cluster and this caused a lot of problems: drbd resource and all
>>> logical volumes went in readonly mode with a lot of I/O errors but the
>>> cluster did not switched to the other node. All filesystem on initiators
>>> went to readonly mode. There are 2 problems involved here (I think): 1) Why
>>> removing the ssd disk cause a readonly mode with I/O errors? This means
>>> that the ssd is a single point of failure for a single node san with
>>> megaraid controllers and CacheCade tecnology..... and 2) Why drbd not
>>> worked as espected?
>>> What was the state in /proc/drbd ?
>>>
>>
>> I think you will need to examine the logs to find out what happened. It
>> would appear (just making a wild guess) that either the cache is happening
>> between DRBD and iSCSI instead of between DRBD and RAID. If it happened
>> under DRBD then DRBD should see the read/write error, and should
>> automatically fail the local storage. It wouldn't necessarily failover to
>> the secondary, but it would do all read/write from the secondary node. The
>> fact this didn't happen makes it look like the failure happened above DRBD.
>>
>> At least that is my understanding of how it will work in that scenario.
>>
>> Regards,
>> Adam
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/25ef7ef5/attachment.htm>