[DRBD-user] Drbd/pacemaker active/passive san failover

Tue Sep 20 10:33:44 CEST 2016

On 20 Sep 2016 5:00 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
>
> Furthermore there are logs from the secondary node:
>
> http://pastebin.com/A2ySXDCB
>
>
> Please compare time. It seems that also on the secondary node drbd goes
to diskless mode. Why?
>
In the secondary log you can see I/O errors too:

Sep  7 19:55:19 iscsi2 kernel: end_request: I/O error, dev sdb, sector
685931856
Sep  7 19:55:19 iscsi2 kernel: block drbd1: write: error=-5 s=685931856s
Sep  7 19:55:19 iscsi2 kernel: block drbd1: disk( UpToDate -> Failed )
Sep  7 19:55:19 iscsi2 kernel: block drbd1: Local IO failed in
drbd_endio_write_sec_final. Detaching...

and since your policy is:

disk {
                on-io-error     detach;
        }

thats what drbd did. No disk => no master.

>
>
> 2016-09-20 8:44 GMT+02:00 Marco Marino <marino.mrc at gmail.com>:
>>
>> Hi, logs can be found here: http://pastebin.com/BGR33jN6
>>
>> @digimer:
>> Using local-io-error should power off the node and switch the cluster on
the remaing node.... is this a good idea?
>>
>> Regards,
>> Marco
>>
>> 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au>:
>>>
>>>
>>>
>>> On 19/09/2016 19:06, Marco Marino wrote:
>>>>
>>>>
>>>>
>>>> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com
>:
>>>>>
>>>>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
>>>>> >
>>>>> > Hi, I'm trying to build an active/passive cluster with drbd and
pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid)
on each one. Each node has an ssd disk that works as cache for read (and
write?) realizing the CacheCade proprietary tecnology.
>>>>> >
>>>>> Did you configure the CacheCade? If the write cache was enabled in
write-back mode then suddenly removing the device from under the controller
would have caused serious problems I guess since the controller expects to
write to the ssd cache firts and then flush to the hdd's. Maybe this
explains the read only mode?
>>>>
>>>> Good point. It is exactly as you wrote. How can I mitigate this
behavior in a clustered (active/passive) enviroment??? As I told in the
other post, I think the best solution is to poweroff the node using
local-io-error and switch all resources on the other node.... But please
give me some suggestions....
>>>>
>>>
>>>>
>>>>>
>>>>> > Basically, the structure of the san is:
>>>>> >
>>>>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd
resource (that use /dev/sdb as backend) (using pacemaker with a
master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with
pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed
with pacemaker)
>>>>> >
>>>>> > Few days ago, the ssd disk was wrongly removed from the primary
node of the cluster and this caused a lot of problems: drbd resource and
all logical volumes went in readonly mode with a lot of I/O errors but the
cluster did not switched to the other node. All filesystem on initiators
went to readonly mode. There are 2 problems involved here (I think): 1) Why
removing the ssd disk cause a readonly mode with I/O errors? This means
that the ssd is a single point of failure for a single node san with
megaraid controllers and CacheCade tecnology..... and 2) Why drbd not
worked as espected?
>>>>> What was the state in /proc/drbd ?
>>>>
>>>>
>>> I think you will need to examine the logs to find out what happened. It
would appear (just making a wild guess) that either the cache is happening
between DRBD and iSCSI instead of between DRBD and RAID. If it happened
under DRBD then DRBD should see the read/write error, and should
automatically fail the local storage. It wouldn't necessarily failover to
the secondary, but it would do all read/write from the secondary node. The
fact this didn't happen makes it look like the failure happened above DRBD.
>>>
>>> At least that is my understanding of how it will work in that scenario.
>>>
>>> Regards,
>>> Adam
>>>
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/b7825c3b/attachment.htm>