<p dir="ltr"></p>
<p dir="ltr">On 20 Sep 2016 5:00 pm, "Marco Marino" <<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>> wrote:<br>
><br>
> Furthermore there are logs from the secondary node:<br>
><br>
> <a href="http://pastebin.com/A2ySXDCB">http://pastebin.com/A2ySXDCB</a><br>
><br>
><br>
> Please compare time. It seems that also on the secondary node drbd goes to diskless mode. Why?<br>
><br>
In the secondary log you can see I/O errors too:</p>
<p dir="ltr">Sep 7 19:55:19 iscsi2 kernel: end_request: I/O error, dev sdb, sector 685931856<br>
Sep 7 19:55:19 iscsi2 kernel: block drbd1: write: error=-5 s=685931856s<br>
Sep 7 19:55:19 iscsi2 kernel: block drbd1: disk( UpToDate -> Failed )<br>
Sep 7 19:55:19 iscsi2 kernel: block drbd1: Local IO failed in drbd_endio_write_sec_final. Detaching...</p>
<p dir="ltr">and since your policy is:</p>
<p dir="ltr">disk {<br>
on-io-error detach;<br>
}</p>
<p dir="ltr">thats what drbd did. No disk => no master.</p>
<p dir="ltr">><br>
><br>
> 2016-09-20 8:44 GMT+02:00 Marco Marino <<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>>:<br>
>><br>
>> Hi, logs can be found here: <a href="http://pastebin.com/BGR33jN6">http://pastebin.com/BGR33jN6</a><br>
>><br>
>> @digimer:<br>
>> Using local-io-error should power off the node and switch the cluster on the remaing node.... is this a good idea?<br>
>><br>
>> Regards,<br>
>> Marco<br>
>><br>
>> 2016-09-19 12:58 GMT+02:00 Adam Goryachev <<a href="mailto:adam@websitemanagers.com.au">adam@websitemanagers.com.au</a>>:<br>
>>><br>
>>><br>
>>><br>
>>> On 19/09/2016 19:06, Marco Marino wrote:<br>
>>>><br>
>>>><br>
>>>><br>
>>>> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <<a href="mailto:igorc@encompasscorporation.com">igorc@encompasscorporation.com</a>>:<br>
>>>>><br>
>>>>> On 19 Sep 2016 5:45 pm, "Marco Marino" <<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>> wrote:<br>
>>>>> ><br>
>>>>> > Hi, I'm trying to build an active/passive cluster with drbd and pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) on each one. Each node has an ssd disk that works as cache for read (and write?) realizing the CacheCade proprietary tecnology. <br>
>>>>> ><br>
>>>>> Did you configure the CacheCade? If the write cache was enabled in write-back mode then suddenly removing the device from under the controller would have caused serious problems I guess since the controller expects to write to the ssd cache firts and then flush to the hdd's. Maybe this explains the read only mode?<br>
>>>><br>
>>>> Good point. It is exactly as you wrote. How can I mitigate this behavior in a clustered (active/passive) enviroment??? As I told in the other post, I think the best solution is to poweroff the node using local-io-error and switch all resources on the other node.... But please give me some suggestions....<br>
>>>><br>
>>><br>
>>>> <br>
>>>>><br>
>>>>> > Basically, the structure of the san is:<br>
>>>>> ><br>
>>>>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource (that use /dev/sdb as backend) (using pacemaker with a master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed with pacemaker)<br>
>>>>> ><br>
>>>>> > Few days ago, the ssd disk was wrongly removed from the primary node of the cluster and this caused a lot of problems: drbd resource and all logical volumes went in readonly mode with a lot of I/O errors but the cluster did not switched to the other node. All filesystem on initiators went to readonly mode. There are 2 problems involved here (I think): 1) Why removing the ssd disk cause a readonly mode with I/O errors? This means that the ssd is a single point of failure for a single node san with megaraid controllers and CacheCade tecnology..... and 2) Why drbd not worked as espected?<br>
>>>>> What was the state in /proc/drbd ?<br>
>>>><br>
>>>><br>
>>> I think you will need to examine the logs to find out what happened. It would appear (just making a wild guess) that either the cache is happening between DRBD and iSCSI instead of between DRBD and RAID. If it happened under DRBD then DRBD should see the read/write error, and should automatically fail the local storage. It wouldn't necessarily failover to the secondary, but it would do all read/write from the secondary node. The fact this didn't happen makes it look like the failure happened above DRBD.<br>
>>><br>
>>> At least that is my understanding of how it will work in that scenario.<br>
>>><br>
>>> Regards,<br>
>>> Adam<br>
>>><br>
>>> _______________________________________________<br>
>>> drbd-user mailing list<br>
>>> <a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>
>>> <a href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
>>><br>
>><br>
><br>
><br>
> _______________________________________________<br>
> drbd-user mailing list<br>
> <a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>
> <a href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
></p>