[DRBD-user] Drbd/pacemaker active/passive san failover

Tue Sep 20 08:44:08 CEST 2016

Hi, logs can be found here: http://pastebin.com/BGR33jN6

@digimer:
Using local-io-error should power off the node and switch the cluster on
the remaing node.... is this a good idea?

Regards,
Marco

2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au>:

>
>
> On 19/09/2016 19:06, Marco Marino wrote:
>
>
>
> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>:
>
>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
>> >
>> > Hi, I'm trying to build an active/passive cluster with drbd and
>> pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid)
>> on each one. Each node has an ssd disk that works as cache for read (and
>> write?) realizing the CacheCade proprietary tecnology.
>> >
>> Did you configure the CacheCade? If the write cache was enabled in
>> write-back mode then suddenly removing the device from under the controller
>> would have caused serious problems I guess since the controller expects to
>> write to the ssd cache firts and then flush to the hdd's. Maybe this
>> explains the read only mode?
>>
> Good point. It is exactly as you wrote. How can I mitigate this behavior
> in a clustered (active/passive) enviroment??? As I told in the other post,
> I think the best solution is to poweroff the node using local-io-error and
> switch all resources on the other node.... But please give me some
> suggestions....
>
>
>
>
>> > Basically, the structure of the san is:
>> >
>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource
>> (that use /dev/sdb as backend) (using pacemaker with a master/slave
>> resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker)
>> -> Iscsi LUNS (one for each logical volume in the VG, managed with
>> pacemaker)
>> >
>> > Few days ago, the ssd disk was wrongly removed from the primary node of
>> the cluster and this caused a lot of problems: drbd resource and all
>> logical volumes went in readonly mode with a lot of I/O errors but the
>> cluster did not switched to the other node. All filesystem on initiators
>> went to readonly mode. There are 2 problems involved here (I think): 1) Why
>> removing the ssd disk cause a readonly mode with I/O errors? This means
>> that the ssd is a single point of failure for a single node san with
>> megaraid controllers and CacheCade tecnology..... and 2) Why drbd not
>> worked as espected?
>> What was the state in /proc/drbd ?
>>
>
> I think you will need to examine the logs to find out what happened. It
> would appear (just making a wild guess) that either the cache is happening
> between DRBD and iSCSI instead of between DRBD and RAID. If it happened
> under DRBD then DRBD should see the read/write error, and should
> automatically fail the local storage. It wouldn't necessarily failover to
> the secondary, but it would do all read/write from the secondary node. The
> fact this didn't happen makes it look like the failure happened above DRBD.
>
> At least that is my understanding of how it will work in that scenario.
>
> Regards,
> Adam
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/4a240bd7/attachment.htm>