Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I am building a true no SPOF network/storage cluster that consists of: 2 Storage Servers, named SAN-n1 and SAN-n2, with the following config: 3 ethernet interfaces, Management, Storage, and Xover The DRBD resource looks like: resource rsdb1 { device /dev/drbd0; disk /dev/sdb1; meta-disk internal; on san01-n1 { address 10.0.0.1:7789; # Use 10Gb Xover } on san01-n2 { address 10.0.0.2:7789; # Use 10Gb Xover } } As configured, drbd connects to its peer over a 10G crossover. Each SAN node connects on eth3 up to it's own respective switch, whicjh are then stacked with dual 10G stack cables. All clients of this SAN also conenct to both switches with bonded ethernet interfaces. ANY component can fail, and the storage unit will stay online. First, my CIB: node san01-n1 node san01-n2 primitive drbd_disk ocf:linbit:drbd \ params drbd_resource="rsdb1" \ op monitor interval="9s" role="Master" \ op monitor interval="11s" role="Slave" primitive ip_mgmt ocf:heartbeat:IPaddr2 \ params ip="172.16.5.10" cidr_netmask="24" \ op monitor interval="10s" primitive ip_storage ocf:heartbeat:IPaddr2 \ params ip="172.16.10.10" cidr_netmask="24" \ op monitor interval="10s" primitive lvm_nfs ocf:heartbeat:LVM \ params volgrpname="vg_vmstore" \ op monitor interval="10s" timeout="30s" depth="0" \ op start interval="0" timeout="30s" \ op stop interval="0" timeout="30s" primitive res_iSCSILogicalUnit_1 ocf:heartbeat:iSCSILogicalUnit \ params target_iqn="iqn.2012-01.com.nfinausa:san01" lun="1" path="/dev/vg_vmstore/lv_vmstore" \ operations $id="res_iSCSILogicalUnit_1-operations" \ op start interval="0" timeout="10" \ op stop interval="0" timeout="10" \ op monitor interval="10" timeout="10" start-delay="0" primitive res_iSCSITarget_p_iscsitarget ocf:heartbeat:iSCSITarget \ params implementation="tgt" iqn="iqn.2012-01.com.nfinausa:san01" tid="1" \ operations $id="res_iSCSITarget_p_iscsitarget-operations" \ op start interval="0" timeout="10" \ op stop interval="0" timeout="10" \ op monitor interval="10" timeout="10" start-delay="0" group rg_vmstore lvm_nfs ip_storage ip_mgmt res_iSCSITarget_p_iscsitarget res_iSCSILogicalUnit_1 \ meta target-role="started" ms ms_drbd_disk drbd_disk \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location cli-standby-rg_vmstore rg_vmstore \ rule $id="cli-standby-rule-rg_vmstore" -inf: #uname eq san01-n2 colocation colo_drbd_with_lvm inf: rg_vmstore ms_drbd_disk:Master order o_drbd_bef_nfs inf: ms_drbd_disk:promote rg_vmstore:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1343691857" DRBD is controlled by pacemaker Also, in global_common config for DRBD, I have the fence-peer handlers configured: fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; and: disk { fencing resource-only; resync-rate 2000M; on-io-error detach; c-max-rate 2000M; #we use raid Battery backed cache + ssd cache so dont cache disk-flushes no; md-flushes no; } The failure mode which is not being handled properly by this configuration is the failure of the storage network interface on the Primary (relative to DRBD) node. DRBD communicates over the xover connection, so it remains in state Connected:Primary/Secondary if the network interface on the primary server dies. What I see on the secondary unit is an error promoting drbd to primary, statting that there can only be pone primary. So my question is how do I demote the primary to secondary when this failure mode occurs on either SAN Node? I don't see any demotion logic in the DRBD resource agent. -- Regards, Nik