Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, logs can be found here: http://pastebin.com/BGR33jN6 @digimer: Using local-io-error should power off the node and switch the cluster on the remaing node.... is this a good idea? Regards, Marco 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au>: > > > On 19/09/2016 19:06, Marco Marino wrote: > > > > 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>: > >> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote: >> > >> > Hi, I'm trying to build an active/passive cluster with drbd and >> pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) >> on each one. Each node has an ssd disk that works as cache for read (and >> write?) realizing the CacheCade proprietary tecnology. >> > >> Did you configure the CacheCade? If the write cache was enabled in >> write-back mode then suddenly removing the device from under the controller >> would have caused serious problems I guess since the controller expects to >> write to the ssd cache firts and then flush to the hdd's. Maybe this >> explains the read only mode? >> > Good point. It is exactly as you wrote. How can I mitigate this behavior > in a clustered (active/passive) enviroment??? As I told in the other post, > I think the best solution is to poweroff the node using local-io-error and > switch all resources on the other node.... But please give me some > suggestions.... > > > > >> > Basically, the structure of the san is: >> > >> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource >> (that use /dev/sdb as backend) (using pacemaker with a master/slave >> resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker) >> -> Iscsi LUNS (one for each logical volume in the VG, managed with >> pacemaker) >> > >> > Few days ago, the ssd disk was wrongly removed from the primary node of >> the cluster and this caused a lot of problems: drbd resource and all >> logical volumes went in readonly mode with a lot of I/O errors but the >> cluster did not switched to the other node. All filesystem on initiators >> went to readonly mode. There are 2 problems involved here (I think): 1) Why >> removing the ssd disk cause a readonly mode with I/O errors? This means >> that the ssd is a single point of failure for a single node san with >> megaraid controllers and CacheCade tecnology..... and 2) Why drbd not >> worked as espected? >> What was the state in /proc/drbd ? >> > > I think you will need to examine the logs to find out what happened. It > would appear (just making a wild guess) that either the cache is happening > between DRBD and iSCSI instead of between DRBD and RAID. If it happened > under DRBD then DRBD should see the read/write error, and should > automatically fail the local storage. It wouldn't necessarily failover to > the secondary, but it would do all read/write from the secondary node. The > fact this didn't happen makes it look like the failure happened above DRBD. > > At least that is my understanding of how it will work in that scenario. > > Regards, > Adam > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/4a240bd7/attachment.htm>