<div dir="ltr">Furthermore there are logs from the secondary node:<br><div><br><a href="http://pastebin.com/A2ySXDCB">http://pastebin.com/A2ySXDCB</a><br><br><br></div><div>Please compare time. It seems that also on the secondary node drbd goes to diskless mode. Why?<br><br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2016-09-20 8:44 GMT+02:00 Marco Marino <span dir="ltr"><<a href="mailto:marino.mrc@gmail.com" target="_blank">marino.mrc@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Hi, logs can be found here: <a href="http://pastebin.com/BGR33jN6" target="_blank">http://pastebin.com/BGR33jN6</a><br><br></div>@digimer:<br>Using local-io-error should power off the node and switch the cluster on the remaing node.... is this a good idea?<br><br></div>Regards,<br></div>Marco<br></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">2016-09-19 12:58 GMT+02:00 Adam Goryachev <span dir="ltr"><<a href="mailto:adam@websitemanagers.com.au" target="_blank">adam@websitemanagers.com.au</a>></span>:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
<div bgcolor="#FFFFFF" text="#000000"><span>
<p><br>
</p>
<br>
<div>On 19/09/2016 19:06, Marco Marino
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">2016-09-19 10:50 GMT+02:00 Igor
Cicimov <span dir="ltr"><<a href="mailto:igorc@encompasscorporation.com" target="_blank">igorc@encompasscorporation.co<wbr>m</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<p dir="ltr"><span>On 19 Sep 2016 5:45 pm, "Marco
Marino" <<a href="mailto:marino.mrc@gmail.com" target="_blank">marino.mrc@gmail.com</a>>
wrote:<br>
><br>
> Hi, I'm trying to build an active/passive cluster
with drbd and pacemaker for a san. I'm using 2 nodes
with one raid controller (megaraid) on each one. Each
node has an ssd disk that works as cache for read (and
write?) realizing the CacheCade proprietary tecnology.
<br>
><br>
</span>
Did you configure the CacheCade? If the write cache was
enabled in write-back mode then suddenly removing the
device from under the controller would have caused
serious problems I guess since the controller expects to
write to the ssd cache firts and then flush to the
hdd's. Maybe this explains the read only mode?</p>
</blockquote>
<div>Good point. It is exactly as you wrote. How can I
mitigate this behavior in a clustered (active/passive)
enviroment??? As I told in the other post, I think the
best solution is to poweroff the node using local-io-error
and switch all resources on the other node.... But please
give me some suggestions....<br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<p dir="ltr"><span>> Basically, the structure
of the san is:<br>
><br>
> Physycal disks -> RAID -> Device /dev/sdb
in the OS -> Drbd resource (that use /dev/sdb as
backend) (using pacemaker with a master/slave
resource) -> VG (managed with pacemaker) ->
Iscsi target (with pacemaker) -> Iscsi LUNS (one
for each logical volume in the VG, managed with
pacemaker)<br>
><br>
> Few days ago, the ssd disk was wrongly removed
from the primary node of the cluster and this caused a
lot of problems: drbd resource and all logical volumes
went in readonly mode with a lot of I/O errors but the
cluster did not switched to the other node. All
filesystem on initiators went to readonly mode. There
are 2 problems involved here (I think): 1) Why
removing the ssd disk cause a readonly mode with I/O
errors? This means that the ssd is a single point of
failure for a single node san with megaraid
controllers and CacheCade tecnology..... and 2) Why
drbd not worked as espected?<br>
</span>
What was the state in /proc/drbd ?</p>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote></span>
I think you will need to examine the logs to find out what happened.
It would appear (just making a wild guess) that either the cache is
happening between DRBD and iSCSI instead of between DRBD and RAID.
If it happened under DRBD then DRBD should see the read/write error,
and should automatically fail the local storage. It wouldn't
necessarily failover to the secondary, but it would do all
read/write from the secondary node. The fact this didn't happen
makes it look like the failure happened above DRBD.<br>
<br>
At least that is my understanding of how it will work in that
scenario.<br>
<br>
Regards,<br>
Adam<br>
</div>
<br></div></div><span class="">______________________________<wbr>_________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">http://lists.linbit.com/mailma<wbr>n/listinfo/drbd-user</a><br>
<br></span></blockquote></div><br></div>
</blockquote></div><br></div>