<div dir="ltr"><div><div><div><div><div>Hi, I'm trying to build an active/passive cluster with drbd and pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) on each one. Each node has an ssd disk that works as cache for read (and write?) realizing the CacheCade proprietary tecnology. <br><br></div>Basically, the structure of the san is:<br><br></div>Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource (that use /dev/sdb as backend) (using pacemaker with a master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed with pacemaker)<br><br></div>Few days ago, the ssd disk was wrongly removed from the primary node of the cluster and this caused a lot of problems: drbd resource and all logical volumes went in readonly mode with a lot of I/O errors but the cluster did not switched to the other node. All filesystem on initiators went to readonly mode. There are 2 problems involved here (I think): 1) Why removing the ssd disk cause a readonly mode with I/O errors? This means that the ssd is a single point of failure for a single node san with megaraid controllers and CacheCade tecnology..... and 2) Why drbd not worked as espected?<br></div>For point 1) I'm checking with the vendor and I doubt that I can do something<br></div>For point 2) I have errors in the drbd configuration. My idea is that when an I/O error happens on the primary node, the cluster should switch to the secondary node and shut down the damaged node. <br><div><div>Here -> <a href="http://pastebin.com/79dDK66m">http://pastebin.com/79dDK66m</a> it is possible to see the actual drbd configuration, but I need to change a lot of things and I want to share my ideas here:<br><br>1) The "handlers" section should be moved in the "common" section of global_common.conf and not in the resource file.<br><br></div><div>2)I'm thinking to modify the "handlers" section as follow:<br><pre class="gmail-de1">handlers <span class="gmail-br0">{</span>
                pri-on-incon-degr <span class="gmail-st0">"/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"</span>;
                pri-lost-after-sb <span class="gmail-st0">"/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"</span>;
                local-io-error <span class="gmail-st0">"/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"</span>;
                <span class="gmail-co0"># Hook into Pacemaker's fencing.</span>
                fence-peer <span class="gmail-st0">"/usr/lib/drbd/crm-fence-peer.sh"</span>;
        <span class="gmail-br0">}</span></pre><br></div><div>In this way, when an I/O error happens, the node will be powered off and pacemaker will switch resources to the other node (or at least doesn't create problematic behaviors...)<br></div><div><br></div><div>3) I'm thinking to move the "fencing" directive from the resource to the global_common.conf file. Furthermore, I want to change it to<br><pre class="gmail-de1">fencing resource-and-stonith;</pre><br></div><div>4) Finally, in the global "net" section I need to add:<br><pre class="gmail-de1">after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;<br><br></pre>At the end of the work configuration will be -> <a href="http://pastebin.com/r3N1gzwx">http://pastebin.com/r3N1gzwx</a><br><br></div><div>Please, give me suggestion about mistakes and possible changes.<br><br></div><div>Thank you<br></div><div><br></div><div><br></div></div></div>