[DRBD-user] Drbd/pacemaker active/passive san failover

Tue Sep 20 11:13:38 CEST 2016

mmm... This means that I do not understood this policy. I thought that I/O
error happens only on the primary node, but it seems that all nodes become
diskless in this case. Why? Basically I have an I/O error on the primary
node because I removed wrongly the ssd (cachecade) disk. Why also the
secondary node is affected?? And furthermore, using

local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o >
/proc/sysrq-trigger ; halt -f";

will be shut down both nodes? and again, should I remove on-io-error
detach; if I use local-io-error?

Thank you

2016-09-20 10:33 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>:

> On 20 Sep 2016 5:00 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
> >
> > Furthermore there are logs from the secondary node:
> >
> > http://pastebin.com/A2ySXDCB
> >
> >
> > Please compare time. It seems that also on the secondary node drbd goes
> to diskless mode. Why?
> >
> In the secondary log you can see I/O errors too:
>
> Sep  7 19:55:19 iscsi2 kernel: end_request: I/O error, dev sdb, sector
> 685931856
> Sep  7 19:55:19 iscsi2 kernel: block drbd1: write: error=-5 s=685931856s
> Sep  7 19:55:19 iscsi2 kernel: block drbd1: disk( UpToDate -> Failed )
> Sep  7 19:55:19 iscsi2 kernel: block drbd1: Local IO failed in
> drbd_endio_write_sec_final. Detaching...
>
> and since your policy is:
>
> disk {
>                 on-io-error     detach;
>         }
>
> thats what drbd did. No disk => no master.
>
> >
> >
> > 2016-09-20 8:44 GMT+02:00 Marco Marino <marino.mrc at gmail.com>:
> >>
> >> Hi, logs can be found here: http://pastebin.com/BGR33jN6
> >>
> >> @digimer:
> >> Using local-io-error should power off the node and switch the cluster
> on the remaing node.... is this a good idea?
> >>
> >> Regards,
> >> Marco
> >>
> >> 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au
> >:
> >>>
> >>>
> >>>
> >>> On 19/09/2016 19:06, Marco Marino wrote:
> >>>>
> >>>>
> >>>>
> >>>> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.
> com>:
> >>>>>
> >>>>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote:
> >>>>> >
> >>>>> > Hi, I'm trying to build an active/passive cluster with drbd and
> pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid)
> on each one. Each node has an ssd disk that works as cache for read (and
> write?) realizing the CacheCade proprietary tecnology.
> >>>>> >
> >>>>> Did you configure the CacheCade? If the write cache was enabled in
> write-back mode then suddenly removing the device from under the controller
> would have caused serious problems I guess since the controller expects to
> write to the ssd cache firts and then flush to the hdd's. Maybe this
> explains the read only mode?
> >>>>
> >>>> Good point. It is exactly as you wrote. How can I mitigate this
> behavior in a clustered (active/passive) enviroment??? As I told in the
> other post, I think the best solution is to poweroff the node using
> local-io-error and switch all resources on the other node.... But please
> give me some suggestions....
> >>>>
> >>>
> >>>>
> >>>>>
> >>>>> > Basically, the structure of the san is:
> >>>>> >
> >>>>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd
> resource (that use /dev/sdb as backend) (using pacemaker with a
> master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with
> pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed
> with pacemaker)
> >>>>> >
> >>>>> > Few days ago, the ssd disk was wrongly removed from the primary
> node of the cluster and this caused a lot of problems: drbd resource and
> all logical volumes went in readonly mode with a lot of I/O errors but the
> cluster did not switched to the other node. All filesystem on initiators
> went to readonly mode. There are 2 problems involved here (I think): 1) Why
> removing the ssd disk cause a readonly mode with I/O errors? This means
> that the ssd is a single point of failure for a single node san with
> megaraid controllers and CacheCade tecnology..... and 2) Why drbd not
> worked as espected?
> >>>>> What was the state in /proc/drbd ?
> >>>>
> >>>>
> >>> I think you will need to examine the logs to find out what happened.
> It would appear (just making a wild guess) that either the cache is
> happening between DRBD and iSCSI instead of between DRBD and RAID. If it
> happened under DRBD then DRBD should see the read/write error, and should
> automatically fail the local storage. It wouldn't necessarily failover to
> the secondary, but it would do all read/write from the secondary node. The
> fact this didn't happen makes it look like the failure happened above DRBD.
> >>>
> >>> At least that is my understanding of how it will work in that scenario.
> >>>
> >>> Regards,
> >>> Adam
> >>>
> >>> _______________________________________________
> >>> drbd-user mailing list
> >>> drbd-user at lists.linbit.com
> >>> http://lists.linbit.com/mailman/listinfo/drbd-user
> >>>
> >>
> >
> >
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/f6f167f5/attachment.htm>