<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 21, 2016 at 1:17 AM, Marco Marino <span dir="ltr"><<a href="mailto:marino.mrc@gmail.com" target="_blank">marino.mrc@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div>As told by Lars Ellenberg, one first problem with the configuration <a href="http://pastebin.com/r3N1gzwx" target="_blank">http://pastebin.com/r3N1gzwx</a><br></div>is that on-io-error should be<br>on-io-error call-local-io-error;<br></div></div></div></blockquote><div><br></div><div>And in your specific case that would have shut down both servers since both had io-error. Don't see how could that help.<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div></div>and not detach. Furthermore, in the configuration there is also another error:<br></div>fencing should be<br>fencing resource-and-stonith;<font face="arial,helvetica,sans-serif"><br>and not resource-only.</font></div></blockquote><div><br></div><div>Only if you have fencing configured in Pacemaker. Do you?<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><pre><font face="arial,helvetica,sans-serif">But I don't understand (again) why the secondary node becomes diskless (UpToDate -> Failed and then Failed -> Diskless). </font> <br></pre></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><pre><font face="arial,helvetica,sans-serif"></font></pre><pre><font face="arial,helvetica,sans-serif">I'd like to do one (stupid) example: if I have 2 nodes with 1 disk for each node used as backend for a drbd resource and one of these disks fails, nothing should happen on the secondary node.....<br></font></pre><pre><font face="arial,helvetica,sans-serif">Igor Cicimov: why removing the write-back cache drive on the primary node cause problems also on the secondary node? What is the dynamics involved?<br></font></pre></div></blockquote><div>As Lars pointed out it is up to you to figure it out by examining the
logs and your setup. One possible reason is that there was inflight
data that was flushed from the cache and replicated to the secondary when you removed the ssd and the secondary received corrupt stream that
could not be written to the disk. It is also possible that you already
had a problem on the secondary which went diskless even <b>before</b> you created the issue on the primary. Comparing the timestamps in both servers logs should tell you if that was the case.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><pre><font face="arial,helvetica,sans-serif"></font></pre><pre><font face="arial,helvetica,sans-serif">However, root file system is not part of the CacheCade virtual drive and yes, one possible solution could be create a mirror of ssd drives for CacheCade. But I'm using drbd/pacemaker because </font><br><font face="arial,helvetica,sans-serif">in a similar situation I need to switch resources automatically on the other node.</font><br></pre><div><br><br></div></div><div class="gmail-HOEnZb"><div class="gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">2016-09-20 13:12 GMT+02:00 Igor Cicimov <span dir="ltr"><<a href="mailto:igorc@encompasscorporation.com" target="_blank">igorc@encompasscorporation.<wbr>com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><span>On Tue, Sep 20, 2016 at 7:13 PM, Marco Marino <span dir="ltr"><<a href="mailto:marino.mrc@gmail.com" target="_blank">marino.mrc@gmail.com</a>></span> wrote:<br></span><div class="gmail_extra"><div class="gmail_quote"><span><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr">mmm... This means that I do not understood this policy. I thought that I/O error happens only on the primary node, but it seems that all nodes become diskless in this case. Why? Basically I have an I/O error on the primary node because I removed wrongly the ssd (cachecade) disk. Why also the secondary node is affected?? </div></blockquote><div><br></div></span><div>The problem is as I see it that when the io-error happened on the secondary the disk was not UpToDate any more:<span><br><br>Sep 7 19:55:19 iscsi2 kernel: block drbd1: disk( <b>UpToDate -> Failed</b> )<br><br></span>in which case it can not be promoted to primary. I don't think what ever policy you had in those handlers it would had made any difference in your case. By removing the write-back cache drive in the mid of operation you caused damage on both ends. Even if you had any chance by force, would you really want to promote a secondary that has a corrupt data to primary at this point? <br><br></div><div>You might try the call-local-io-error option as suggested by Lars or even the pass_on and let the file system handle it. You should also take Digimer's suggestion and let Pacemaker take care of everything since you have it already installed so why not use it. You need proper functioning fencing though in that case.<br><br>As someone else suggested you should also remove the root file system from the CacheCade virtual drive (just an assumption but looks like that is the case). Creating a mirror of SSD drives for the CacheCade is also an option to avoid similar accidents in the future (what is the chance that someone removes 2 drives in the same time??). And finally putting a "DON'T REMOVE" sticker on the drive might work if nothing else does :-D<br> <br></div><span><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr">And furthermore, using <br><span><pre>local-io-error <span>"/usr/lib/drbd/notify-io-error<wbr>.sh; /usr/lib/drbd/notify-emergency<wbr>-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"</span>;<br><br></pre></span><pre>will be shut down both nodes? and again, should I remove on-io-error detach; if I use local-io-error?<br><br></pre><pre>Thank you</pre></div></blockquote></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br></div></div>