[DRBD-user] Drbd/pacemaker active/passive san failover

Mon Sep 19 12:58:10 CEST 2016

On 19/09/2016 19:06, Marco Marino wrote:
>
>
> 2016-09-19 10:50 GMT+02:00 Igor Cicimov 
> <igorc at encompasscorporation.com <mailto:igorc at encompasscorporation.com>>:
>
>     On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com
>     <mailto:marino.mrc at gmail.com>> wrote:
>     >
>     > Hi, I'm trying to build an active/passive cluster with drbd and
>     pacemaker for a san. I'm using 2 nodes with one raid controller
>     (megaraid) on each one. Each node has an ssd disk that works as
>     cache for read (and write?) realizing the CacheCade proprietary
>     tecnology.
>     >
>     Did you configure the CacheCade? If the write cache was enabled in
>     write-back mode then suddenly removing the device from under the
>     controller would have caused serious problems I guess since the
>     controller expects to write to the ssd cache firts and then flush
>     to the hdd's. Maybe this explains the read only mode?
>
> Good point. It is exactly as you wrote. How can I mitigate this 
> behavior in a clustered (active/passive) enviroment??? As I told in 
> the other post, I think the best solution is to poweroff the node 
> using local-io-error and switch all resources on the other node.... 
> But please give me some suggestions....
>

>     > Basically, the structure of the san is:
>     >
>     > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd
>     resource (that use /dev/sdb as backend) (using pacemaker with a
>     master/slave resource) -> VG (managed with pacemaker) -> Iscsi
>     target (with pacemaker) -> Iscsi LUNS (one for each logical volume
>     in the VG, managed with pacemaker)
>     >
>     > Few days ago, the ssd disk was wrongly removed from the primary
>     node of the cluster and this caused a lot of problems: drbd
>     resource and all logical volumes went in readonly mode with a lot
>     of I/O errors but the cluster did not switched to the other node.
>     All filesystem on initiators went to readonly mode. There are 2
>     problems involved here (I think): 1) Why removing the ssd disk
>     cause a readonly mode with I/O errors? This means that the ssd is
>     a single point of failure for a single node san with megaraid
>     controllers and CacheCade tecnology..... and 2) Why drbd not
>     worked as espected?
>     What was the state in /proc/drbd ?
>
>
I think you will need to examine the logs to find out what happened. It 
would appear (just making a wild guess) that either the cache is 
happening between DRBD and iSCSI instead of between DRBD and RAID. If it 
happened under DRBD then DRBD should see the read/write error, and 
should automatically fail the local storage. It wouldn't necessarily 
failover to the secondary, but it would do all read/write from the 
secondary node. The fact this didn't happen makes it look like the 
failure happened above DRBD.

At least that is my understanding of how it will work in that scenario.

Regards,
Adam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160919/75890e31/attachment.htm>