Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
mmm... This means that I do not understood this policy. I thought that I/O error happens only on the primary node, but it seems that all nodes become diskless in this case. Why? Basically I have an I/O error on the primary node because I removed wrongly the ssd (cachecade) disk. Why also the secondary node is affected?? And furthermore, using local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; will be shut down both nodes? and again, should I remove on-io-error detach; if I use local-io-error? Thank you 2016-09-20 10:33 GMT+02:00 Igor Cicimov <igorc at encompasscorporation.com>: > On 20 Sep 2016 5:00 pm, "Marco Marino" <marino.mrc at gmail.com> wrote: > > > > Furthermore there are logs from the secondary node: > > > > http://pastebin.com/A2ySXDCB > > > > > > Please compare time. It seems that also on the secondary node drbd goes > to diskless mode. Why? > > > In the secondary log you can see I/O errors too: > > Sep 7 19:55:19 iscsi2 kernel: end_request: I/O error, dev sdb, sector > 685931856 > Sep 7 19:55:19 iscsi2 kernel: block drbd1: write: error=-5 s=685931856s > Sep 7 19:55:19 iscsi2 kernel: block drbd1: disk( UpToDate -> Failed ) > Sep 7 19:55:19 iscsi2 kernel: block drbd1: Local IO failed in > drbd_endio_write_sec_final. Detaching... > > and since your policy is: > > disk { > on-io-error detach; > } > > thats what drbd did. No disk => no master. > > > > > > > 2016-09-20 8:44 GMT+02:00 Marco Marino <marino.mrc at gmail.com>: > >> > >> Hi, logs can be found here: http://pastebin.com/BGR33jN6 > >> > >> @digimer: > >> Using local-io-error should power off the node and switch the cluster > on the remaing node.... is this a good idea? > >> > >> Regards, > >> Marco > >> > >> 2016-09-19 12:58 GMT+02:00 Adam Goryachev <adam at websitemanagers.com.au > >: > >>> > >>> > >>> > >>> On 19/09/2016 19:06, Marco Marino wrote: > >>>> > >>>> > >>>> > >>>> 2016-09-19 10:50 GMT+02:00 Igor Cicimov <igorc at encompasscorporation. > com>: > >>>>> > >>>>> On 19 Sep 2016 5:45 pm, "Marco Marino" <marino.mrc at gmail.com> wrote: > >>>>> > > >>>>> > Hi, I'm trying to build an active/passive cluster with drbd and > pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) > on each one. Each node has an ssd disk that works as cache for read (and > write?) realizing the CacheCade proprietary tecnology. > >>>>> > > >>>>> Did you configure the CacheCade? If the write cache was enabled in > write-back mode then suddenly removing the device from under the controller > would have caused serious problems I guess since the controller expects to > write to the ssd cache firts and then flush to the hdd's. Maybe this > explains the read only mode? > >>>> > >>>> Good point. It is exactly as you wrote. How can I mitigate this > behavior in a clustered (active/passive) enviroment??? As I told in the > other post, I think the best solution is to poweroff the node using > local-io-error and switch all resources on the other node.... But please > give me some suggestions.... > >>>> > >>> > >>>> > >>>>> > >>>>> > Basically, the structure of the san is: > >>>>> > > >>>>> > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd > resource (that use /dev/sdb as backend) (using pacemaker with a > master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with > pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed > with pacemaker) > >>>>> > > >>>>> > Few days ago, the ssd disk was wrongly removed from the primary > node of the cluster and this caused a lot of problems: drbd resource and > all logical volumes went in readonly mode with a lot of I/O errors but the > cluster did not switched to the other node. All filesystem on initiators > went to readonly mode. There are 2 problems involved here (I think): 1) Why > removing the ssd disk cause a readonly mode with I/O errors? This means > that the ssd is a single point of failure for a single node san with > megaraid controllers and CacheCade tecnology..... and 2) Why drbd not > worked as espected? > >>>>> What was the state in /proc/drbd ? > >>>> > >>>> > >>> I think you will need to examine the logs to find out what happened. > It would appear (just making a wild guess) that either the cache is > happening between DRBD and iSCSI instead of between DRBD and RAID. If it > happened under DRBD then DRBD should see the read/write error, and should > automatically fail the local storage. It wouldn't necessarily failover to > the secondary, but it would do all read/write from the secondary node. The > fact this didn't happen makes it look like the failure happened above DRBD. > >>> > >>> At least that is my understanding of how it will work in that scenario. > >>> > >>> Regards, > >>> Adam > >>> > >>> _______________________________________________ > >>> drbd-user mailing list > >>> drbd-user at lists.linbit.com > >>> http://lists.linbit.com/mailman/listinfo/drbd-user > >>> > >> > > > > > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160920/f6f167f5/attachment.htm>