[Pacemaker] no failover if fencing device is unreachable (i.e. power loss)
Digimer
lists at alteeve.ca
Mon Aug 18 19:53:46 CEST 2014
On 18/08/14 01:50 PM, Felix Schrage wrote:
> Hi,
>
> I'am building a two-node cluster running XenServer, pacemaker and DRBD. There's a problem when testing the failover by powering off the current active node.
> When using the fence_xenapi agent, the resource ClusterIP will not be moved to the 2nd node until the first node was successfully shut down.
> However because the XenAPI is unreachable when the machine is powered off, the 2nd node continuously is trying to shut down the node and the resource is never moved.
>
> To check if it's an error with the fence_xenapi-agent I tried fence_ipmilan which is working fine as long as the IPMI is is reachable. When pulling the power cords from the machine
> however the behavior is the same as with the fence_xenapi agent.
> Am I missing an option which should be set? A timeout or a retry counter?
This is the expected behaviour. Being unable to connect to the fence
device (or to fail to confirm the "off" action) can not be treated as a
successful fence. Without a successful fence, it can not be assumed that
the peer is gone. To do so would be to risk a split-brain, so the
cluster's only sane and safe option is to block.
For this reason, this is why we always use switched PDUs as a backup
fence method. You can see how to configure this with STONITH levels:
http://clusterlabs.org/wiki/STONITH_Levels
> Here's how I setup the cluster (fence_xenapi) using pcs:
>
> pcs cluster cib ftp_ha_cluster
> pcs -f ftp_ha_cluster resource create ClusterIP IPaddr2 ip=172.20.150.150 cidr_netmask=32 op monitor interval=20s
> pcs -f ftp_ha_cluster constraint location ClusterIP prefers ftp-test01=50
> pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp1 fence_xenapi pcmk_host_list="ftp-test01" action="off" session_url="https://test-xen-01" port="ftp-test01" login="root" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp2 fence_xenapi pcmk_host_list="ftp-test02" action="off" session_url="https://test-xen-02" port="ftp-test02" login="root" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp1 prefers ftp-test01=-INFINITY
> pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp2 prefers ftp-test02=-INFINITY
> pcs -f ftp_ha_cluster property set stonith-enabled=true
> pcs -f ftp_ha_cluster property set stonith-action=off
> pcs -f ftp_ha_cluster property set stonith-timeout=40s
> pcs -f ftp_ha_cluster property set no-quorum-policy=ignore
> pcs -f ftp_ha_cluster resource create Ping ocf:pacemaker:ping dampen="5s" multiplier="100" host_list="172.20.150.1 172.20.150.151 172.20.150.152" attempts="3" op monitor interval=20s
> pcs -f ftp_ha_cluster resource clone Ping
> pcs -f ftp_ha_cluster constraint location ClusterIP rule score=-INF not_defined pingd or pingd lte 0
> pcs -f ftp_ha_cluster constraint location ClusterIP rule score=pingd defined pingd
> pcs cluster cib-push ftp_ha_cluster
>
> for testing with fence_ipmilan I replaced the appropriate lines with the following:
>
> pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-01 fence_ipmilan pcmk_host_list="ftp-test01" action="off" ipaddr="test-xen-01-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-02 fence_ipmilan pcmk_host_list="ftp-test02" action="off" ipaddr="test-xen-02-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-01 prefers ftp-test01=-INFINITY
> pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-02 prefers ftp-test02=-INFINITY
>
>
> the content of /etc/corosync/corosync.conf:
>
> compatibility: whitetank
>
> totem {
> version: 2
> secauth: off
> threads: 0
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.199.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> ttl: 1
> }
> }
>
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: no
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> service {
> ver: 1
> name: pacemaker
> }
>
> Any idea what could be missing/wrong?
>
> Kind regards,
>
> Felix
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list