<p dir="ltr"></p>

<p dir="ltr">On 20 Sep 2016 5:00 pm, &quot;Marco Marino&quot; &lt;<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>&gt; wrote:<br>

&gt;<br>

&gt; Furthermore there are logs from the secondary node:<br>

&gt;<br>

&gt; <a href="http://pastebin.com/A2ySXDCB">http://pastebin.com/A2ySXDCB</a><br>

&gt;<br>

&gt;<br>

&gt; Please compare time. It seems that also on the secondary node drbd goes to diskless mode. Why?<br>

&gt;<br>

In the secondary log you can see I/O errors too:</p>

<p dir="ltr">Sep  7 19:55:19 iscsi2 kernel: end_request: I/O error, dev sdb, sector 685931856<br>

Sep  7 19:55:19 iscsi2 kernel: block drbd1: write: error=-5 s=685931856s<br>

Sep  7 19:55:19 iscsi2 kernel: block drbd1: disk( UpToDate -&gt; Failed )<br>

Sep  7 19:55:19 iscsi2 kernel: block drbd1: Local IO failed in drbd_endio_write_sec_final. Detaching...</p>

<p dir="ltr">and since your policy is:</p>

<p dir="ltr">disk {<br>

                on-io-error     detach;<br>

        }</p>

<p dir="ltr">thats what drbd did. No disk =&gt; no master.</p>

<p dir="ltr">&gt;<br>

&gt;<br>

&gt; 2016-09-20 8:44 GMT+02:00 Marco Marino &lt;<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>&gt;:<br>

&gt;&gt;<br>

&gt;&gt; Hi, logs can be found here: <a href="http://pastebin.com/BGR33jN6">http://pastebin.com/BGR33jN6</a><br>

&gt;&gt;<br>

&gt;&gt; @digimer:<br>

&gt;&gt; Using local-io-error should power off the node and switch the cluster on the remaing node.... is this a good idea?<br>

&gt;&gt;<br>

&gt;&gt; Regards,<br>

&gt;&gt; Marco<br>

&gt;&gt;<br>

&gt;&gt; 2016-09-19 12:58 GMT+02:00 Adam Goryachev &lt;<a href="mailto:adam@websitemanagers.com.au">adam@websitemanagers.com.au</a>&gt;:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; On 19/09/2016 19:06, Marco Marino wrote:<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; 2016-09-19 10:50 GMT+02:00 Igor Cicimov &lt;<a href="mailto:igorc@encompasscorporation.com">igorc@encompasscorporation.com</a>&gt;:<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; On 19 Sep 2016 5:45 pm, &quot;Marco Marino&quot; &lt;<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>&gt; wrote:<br>

&gt;&gt;&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt;&gt;&gt; &gt; Hi, I&#39;m trying to build an active/passive cluster with drbd and pacemaker for a san. I&#39;m using 2 nodes with one raid controller (megaraid) on each one. Each node has an ssd disk that works as cache for read (and write?) realizing the CacheCade proprietary tecnology. <br>

&gt;&gt;&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt;&gt;&gt; Did you configure the CacheCade? If the write cache was enabled in write-back mode then suddenly removing the device from under the controller would have caused serious problems I guess since the controller expects to write to the ssd cache firts and then flush to the hdd&#39;s. Maybe this explains the read only mode?<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; Good point. It is exactly as you wrote. How can I mitigate this behavior in a clustered (active/passive) enviroment??? As I told in the other post, I think the best solution is to poweroff the node using local-io-error and switch all resources on the other node.... But please give me some suggestions....<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;  <br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; &gt; Basically, the structure of the san is:<br>

&gt;&gt;&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt;&gt;&gt; &gt; Physycal disks -&gt; RAID -&gt; Device /dev/sdb in the OS -&gt; Drbd resource (that use /dev/sdb as backend) (using pacemaker with a master/slave resource) -&gt; VG (managed with pacemaker) -&gt; Iscsi target (with pacemaker) -&gt; Iscsi LUNS (one for each logical volume in the VG, managed with pacemaker)<br>

&gt;&gt;&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt;&gt;&gt; &gt; Few days ago, the ssd disk was wrongly removed from the primary node of the cluster and this caused a lot of problems: drbd resource and all logical volumes went in readonly mode with a lot of I/O errors but the cluster did not switched to the other node. All filesystem on initiators went to readonly mode. There are 2 problems involved here (I think): 1) Why removing the ssd disk cause a readonly mode with I/O errors? This means that the ssd is a single point of failure for a single node san with megaraid controllers and CacheCade tecnology..... and 2) Why drbd not worked as espected?<br>

&gt;&gt;&gt;&gt;&gt; What was the state in /proc/drbd ?<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt; I think you will need to examine the logs to find out what happened. It would appear (just making a wild guess) that either the cache is happening between DRBD and iSCSI instead of between DRBD and RAID. If it happened under DRBD then DRBD should see the read/write error, and should automatically fail the local storage. It wouldn&#39;t necessarily failover to the secondary, but it would do all read/write from the secondary node. The fact this didn&#39;t happen makes it look like the failure happened above DRBD.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; At least that is my understanding of how it will work in that scenario.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Regards,<br>

&gt;&gt;&gt; Adam<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt; drbd-user mailing list<br>

&gt;&gt;&gt; <a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>

&gt;&gt;&gt; <a href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>

&gt;&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; drbd-user mailing list<br>

&gt; <a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>

&gt; <a href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>

&gt;</p>