[DRBD-user] Drbd/pacemaker active/passive san failover

Tue Sep 20 13:12:53 CEST 2016

On Tue, Sep 20, 2016 at 7:13 PM, Marco Marino <marino.mrc at gmail.com> wrote:

> mmm... This means that I do not understood this policy. I thought that I/O
> error happens only on the primary node, but it seems that all nodes become
> diskless in this case. Why? Basically I have an I/O error on the primary
> node because I removed wrongly the ssd (cachecade) disk. Why also the
> secondary node is affected??
>

The problem is as I see it that when the io-error happened on the secondary
the disk was not UpToDate any more:

Sep  7 19:55:19 iscsi2 kernel: block drbd1: disk( *UpToDate -> Failed* )

in which case it can not be promoted to primary. I don't think what ever
policy you had in those handlers it would had made any difference in your
case. By removing the write-back cache drive in the mid of operation you
caused damage on both ends. Even if you had any chance by force, would you
really want to promote a secondary that has a corrupt data to primary at
this point?

You might try the call-local-io-error option as suggested by Lars or even
the pass_on and let the file system handle it. You should also take
Digimer's suggestion and let Pacemaker take care of everything since you
have it already installed so why not use it. You need proper functioning
fencing though in that case.

As someone else suggested you should also remove the root file system from
the CacheCade virtual drive (just an assumption but looks like that is the
case). Creating a mirror of SSD drives for the CacheCade is also an option
to avoid similar accidents in the future (what is the chance that someone
removes 2 drives in the same time??). And finally putting a "DON'T REMOVE"
sticker on the drive might work if nothing else does :-D

> And furthermore, using
>
> local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
>
> will be shut down both nodes? and again, should I remove on-io-error detach; if I use local-io-error?
>
> Thank you
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/88702bff/attachment.htm>