Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm trying to build an active/passive cluster with drbd and pacemaker for a san. I'm using 2 nodes with one raid controller (megaraid) on each one. Each node has an ssd disk that works as cache for read (and write?) realizing the CacheCade proprietary tecnology. Basically, the structure of the san is: Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd resource (that use /dev/sdb as backend) (using pacemaker with a master/slave resource) -> VG (managed with pacemaker) -> Iscsi target (with pacemaker) -> Iscsi LUNS (one for each logical volume in the VG, managed with pacemaker) Few days ago, the ssd disk was wrongly removed from the primary node of the cluster and this caused a lot of problems: drbd resource and all logical volumes went in readonly mode with a lot of I/O errors but the cluster did not switched to the other node. All filesystem on initiators went to readonly mode. There are 2 problems involved here (I think): 1) Why removing the ssd disk cause a readonly mode with I/O errors? This means that the ssd is a single point of failure for a single node san with megaraid controllers and CacheCade tecnology..... and 2) Why drbd not worked as espected? For point 1) I'm checking with the vendor and I doubt that I can do something For point 2) I have errors in the drbd configuration. My idea is that when an I/O error happens on the primary node, the cluster should switch to the secondary node and shut down the damaged node. Here -> http://pastebin.com/79dDK66m it is possible to see the actual drbd configuration, but I need to change a lot of things and I want to share my ideas here: 1) The "handlers" section should be moved in the "common" section of global_common.conf and not in the resource file. 2)I'm thinking to modify the "handlers" section as follow: handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # Hook into Pacemaker's fencing. fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; } In this way, when an I/O error happens, the node will be powered off and pacemaker will switch resources to the other node (or at least doesn't create problematic behaviors...) 3) I'm thinking to move the "fencing" directive from the resource to the global_common.conf file. Furthermore, I want to change it to fencing resource-and-stonith; 4) Finally, in the global "net" section I need to add: after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; At the end of the work configuration will be -> http://pastebin.com/r3N1gzwx Please, give me suggestion about mistakes and possible changes. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160919/19a03bb2/attachment.htm>