Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Furthermore there are logs from the secondary node: http://pastebin.com/A2ySXDCB Please compare time. It seems that also on the secondary node drbd goes to diskless mode. Why? 2016-09-20 8:44 GMT+02:00 Marco Marino <marino.mrc at gmail.com>: > Hi, logs can be found here: http://pastebin.com/BGR33jN6 > > @digimer: > Using local-io-error should power off the node and switch the cluster on > the remaing node.... is this a good idea? > > Regards, > Marco > > 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au>: > >> >> >> On 19/09/2016 19:06, Marco Marino wrote: >> >> >> >> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>: >> >>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote: >>> > >>> > Hi, I'm trying to build an active/passive cluster with drbd and >>> pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) >>> on each one. Each node has an ssd disk that works as cache for read (and >>> write?) realizing the CacheCade proprietary tecnology. >>> > >>> Did you configure the CacheCade? If the write cache was enabled in >>> write-back mode then suddenly removing the device from under the controller >>> would have caused serious problems I guess since the controller expects to >>> write to the ssd cache firts and then flush to the hdd's. Maybe this >>> explains the read only mode? >>> >> Good point. It is exactly as you wrote. How can I mitigate this behavior >> in a clustered (active/passive) enviroment??? As I told in the other post, >> I think the best solution is to poweroff the node using local-io-error and >> switch all resources on the other node.... But please give me some >> suggestions.... >> >> >> >> >>> > Basically, the structure of the san is: >>> > >>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource >>> (that use /dev/sdb as backend) (using pacemaker with a master/slave >>> resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker) >>> -> Iscsi LUNS (one for each logical volume in the VG, managed with >>> pacemaker) >>> > >>> > Few days ago, the ssd disk was wrongly removed from the primary node >>> of the cluster and this caused a lot of problems: drbd resource and all >>> logical volumes went in readonly mode with a lot of I/O errors but the >>> cluster did not switched to the other node. All filesystem on initiators >>> went to readonly mode. There are 2 problems involved here (I think): 1) Why >>> removing the ssd disk cause a readonly mode with I/O errors? This means >>> that the ssd is a single point of failure for a single node san with >>> megaraid controllers and CacheCade tecnology..... and 2) Why drbd not >>> worked as espected? >>> What was the state in /proc/drbd ? >>> >> >> I think you will need to examine the logs to find out what happened. It >> would appear (just making a wild guess) that either the cache is happening >> between DRBD and iSCSI instead of between DRBD and RAID. If it happened >> under DRBD then DRBD should see the read/write error, and should >> automatically fail the local storage. It wouldn't necessarily failover to >> the secondary, but it would do all read/write from the secondary node. The >> fact this didn't happen makes it look like the failure happened above DRBD. >> >> At least that is my understanding of how it will work in that scenario. >> >> Regards, >> Adam >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/25ef7ef5/attachment.htm>