[Pacemaker] can't get external/xen0 fencing to work on debian wheezy

Thu Feb 6 16:54:25 EST 2014

Hi,

I am setting up a cluster with Corosync 1.4.2 (openais stack) and
Pacemaker 1.1.7 (as a plugin of corosync) and fence-agents package
version 3.1.5-2. All versions are the default packaged debian version
for wheezy.
Cluster nodes are xen virtual machines running on a single dom0.
It works as I want but trying to setup fencing I am now facing a problem.

I use external/xen0 as a fencing agent which is installed on all nodes
and I have setup all nodes ssh pubkey to allow root login without user
interaction (tested on all hosts).
The problem is if i brutally break communication on one host, it moves
to offline/unclean state, but fencing fails timing out, as you can see
in the logs bellow:

Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: unpack_rsc_op:
Preventing cln_shared_storage from re-starting on mx01: operation
monitor failed 'not installed' (rc=5)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: unpack_rsc_op:
Preventing pri_openldap from re-starting on mx01: operation monitor
failed 'not installed' (rc=5)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: unpack_rsc_op:
Preventing cln_aoe from re-starting on mx01: operation monitor failed
'not installed' (rc=5)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: unpack_rsc_op:
Preventing pri_openldap from re-starting on ms02: operation monitor
failed 'not installed' (rc=5)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: unpack_rsc_op:
Preventing cln_aoe from re-starting on ms02: operation monitor failed
'not installed' (rc=5)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Action
pri_dlm:1_stop_0 on ms02 is unrunnable (offline)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Marking
node ms02 unclean
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Action
pri_clvmd:1_stop_0 on ms02 is unrunnable (offline)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Marking
node ms02 unclean
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Action
pri_spamassassin:1_stop_0 on ms02 is unrunnable (offline)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Marking
node ms02 unclean
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Action
pri_dovecot:1_stop_0 on ms02 is unrunnable (offline)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Marking
node ms02 unclean
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Action
pri_exim4:6_stop_0 on ms02 is unrunnable (offline)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: custom_action: Marking
node ms02 unclean
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: stage6: Scheduling
Node ms02 for STONITH
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: LogActions: Stop
pri_dlm:1#011(ms02)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: LogActions: Stop
pri_clvmd:1#011(ms02)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: LogActions: Stop
pri_spamassassin:1#011(ms02)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: LogActions: Stop
pri_dovecot:1#011(ms02)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: LogActions: Stop
pri_exim4:6#011(ms02)
Feb  6 22:43:05 sanaoe01 pengine: [2553]: WARN: process_pe_message:
Transition 562: WARNINGs found during PE processing. PEngine Input
stored in: /var/lib/pengine/pe-warn-0.bz2
Feb  6 22:43:05 sanaoe01 pengine: [2553]: notice: process_pe_message:
Configuration WARNINGs found during PE processing.  Please run
"crm_verify -L" to identify issues.
Feb  6 22:43:05 sanaoe01 crmd: [2554]: notice: do_state_transition:
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb  6 22:43:05 sanaoe01 crmd: [2554]: info: do_te_invoke: Processing
graph 562 (ref=pe_calc-dc-1391722985-1174) derived from
/var/lib/pengine/pe-warn-0.bz2
Feb  6 22:43:05 sanaoe01 crmd: [2554]: notice: te_fence_node:
Executing reboot fencing operation (141) on ms02 (timeout=60000)
Feb  6 22:43:05 sanaoe01 stonith-ng: [2550]: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
ms02: 47e0bc91-ddda-4f5a-b09d-f360f911a8af
Feb  6 22:43:05 sanaoe01 stonith-ng: [2550]: info: stonith_command:
Processed st_query from sanaoe01: rc=0
Feb  6 22:43:11 sanaoe01 stonith-ng: [2550]: ERROR: remote_op_done:
Operation reboot of ms02 by <no-one> for
sanaoe01[df395707-b30d-4b3e-a35d-bc3bd2e2f78f]: Operation timed out
Feb  6 22:43:11 sanaoe01 crmd: [2554]: info: tengine_stonith_callback:
StonithOp <remote-op state="0" st_target="ms02" st_op="reboot" />
Feb  6 22:43:11 sanaoe01 crmd: [2554]: notice:
tengine_stonith_callback: Stonith operation 533 for ms02 failed
(Operation timed out): aborting transition.
Feb  6 22:43:11 sanaoe01 crmd: [2554]: info: abort_transition_graph:
tengine_stonith_callback:454 - Triggered transition abort (complete=0)
: Stonith failed
Feb  6 22:43:11 sanaoe01 crmd: [2554]: notice: tengine_stonith_notify:
Peer ms02 was not terminated (reboot) by <anyone> for sanaoe01:
Operation timed out (ref=47e0bc91-ddda-4f5a-b09d-f360f911a8af)
Feb  6 22:43:11 sanaoe01 crmd: [2554]: notice: run_graph: ====
Transition 562 (Complete=2, Pending=0, Fired=0, Skipped=16,
Incomplete=3, Source=/var/lib/pengine/pe-warn-0.bz2): Stopped
Feb  6 22:43:11 sanaoe01 crmd: [2554]: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [
input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb  6 22:43:11 sanaoe01 pengine: [2553]: WARN: pe_fence_node: Node
ms02 will be fenced because it is un-expectedly down
Feb  6 22:43:11 sanaoe01 pengine: [2553]: WARN:
determine_online_status: Node ms02 is unclean

my crm config is shown bellow:

node dir01 \
        attributes standby="off"
node dir02
node ms01 \
        attributes standby="off"
node ms02 \
        attributes standby="off"
node mta01
node mta02
node mx01
node sanaoe01 \
        attributes standby="off"
node sanaoe02 \
        attributes standby="false"
primitive dom0Fence stonith:external/xen0 \
        params hostlist="ms02:/etc/xen/ms02" dom0="barneygumble" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="45" \
        op monitor interval="3600" timeout="30"
xml <primitive class="ocf" id="pri_aoe1" provider="heartbeat"
type="AoEtarget"> \
        <instance_attributes id="pri_aoe1.1-instance_attributes"> \
                <rule id="node-sanaoe01" score="1"> \
                        <expression attribute="#uname"
id="expr-node-sanaoe01" operation="eq" value="sanaoe01"/> \
                </rule> \
                <nvpair id="pri_aoe1.1-instance_attributes-device"
name="device" value="/dev/xvdb"/> \
                <nvpair id="pri_aoe1.1-instance_attributes-nic"
name="nic" value="eth0"/> \
                <nvpair id="pri_aoe1.1-instance_attributes-shelf"
name="shelf" value="1"/> \
                <nvpair id="pri_aoe1.1-instance_attributes-slot"
name="slot" value="1"/> \
        </instance_attributes> \
        <instance_attributes id="pri_aoe2.1-instance_attributes"> \
                <rule id="node-sanaoe02" score="2"> \
                        <expression attribute="#uname"
id="expr-node-sanaoe2" operation="eq" value="sanaoe02"/> \
                </rule> \
                <nvpair id="pri_aoe2.1-instance_attributes-device"
name="device" value="/dev/xvdb"/> \
                <nvpair id="pri_aoe2.1-instance_attributes-nic"
name="nic" value="eth1"/> \
                <nvpair id="pri_aoe2.1-instance_attributes-shelf"
name="shelf" value="2"/> \
                <nvpair id="pri_aoe2.1-instance_attributes-slot"
name="slot" value="1"/> \
        </instance_attributes> \
        <meta_attributes id="pri_aoe1.1-meta_attributes"> \
                <nvpair id="pri_aoe1.1-meta_attributes-target-role"
name="target-role" value="Started"/> \
        </meta_attributes> \
        <meta_attributes id="pri_aoe2.1-meta_attributes"> \
                <nvpair id="pri_aoe2.1-meta_attributes-target-role"
name="target-role" value="Started"/> \
        </meta_attributes> \
</primitive>
primitive pri_clvmd lsb:clvm \
        op monitor interval="20" timeout="20" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive pri_dlm ocf:pacemaker:controld \
        op monitor interval="120" timeout="30" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive pri_dovecot lsb:dovecot \
        op start interval="0" timeout="20" \
        op stop interval="0" timeout="30" \
        op monitor interval="5" timeout="10"
primitive pri_exim4 lsb:exim4 \
        op start interval="0" timeout="20" \
        op stop interval="0" timeout="30" \
        op monitor interval="5" timeout="10"
primitive pri_openldap lsb:slapd \
        op monitor interval="10" timeout="30" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive pri_spamassassin lsb:spamassassin \
        op start interval="0" timeout="50" \
        op stop interval="0" timeout="60" \
        op monitor interval="5" timeout="20"
group grp_aoe pri_aoe1
group grp_cluster_volumes pri_dlm pri_clvmd
group grp_mail_delivery pri_spamassassin pri_dovecot
group grp_mta pri_exim4
clone cln_aoe grp_aoe \
        meta ordered="true" interleave="true" clone-max="2"
clone cln_mail_delivery grp_mail_delivery \
        meta ordered="false" interleave="true" clone-max="2"
clone cln_mta grp_mta \
        meta ordered="false" interleave="true"
clone cln_shared_storage grp_cluster_volumes \
        meta ordered="false" interleave="true" clone-max="2"
location LOC_AOE_ETHERD1 cln_aoe inf: sanaoe01
location LOC_AOE_ETHERD2 cln_aoe inf: sanaoe02
location LOC_CLUSTER-STORAGE1 cln_shared_storage inf: ms01
location LOC_CLUSTER-STORAGE2 cln_shared_storage inf: ms02
location LOC_MAIL-STORE1 cln_mail_delivery inf: ms01
location LOC_MAIL-STORE2 cln_mail_delivery inf: ms02
location LOC_MTA01 cln_mta inf: ms01
location LOC_MTA02 cln_mta inf: ms02
location LOC_MTA03 cln_mta inf: dir01
location LOC_MTA04 cln_mta inf: dir02
location LOC_MTA05 cln_mta inf: sanaoe01
location LOC_MTA06 cln_mta inf: sanaoe02
location LOC_MTA07 cln_mta inf: mta01
location LOC_MTA08 cln_mta inf: mta02
location LOC_MTA09 cln_mta inf: mx01
location LOC_MTA10 cln_mta inf: mx02
location LOC_OPENLDAP1 pri_openldap 200: dir01
location LOC_OPENLDAP2 pri_openldap 100: dir02
location cli-prefer-pri_openldap pri_openldap \
        rule $id="cli-prefer-rule-pri_openldap" inf: #uname eq dir01
order ORD_vols-then-mailstore inf: cln_shared_storage:start
cln_mail_delivery:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="9" \
        stonith-enabled="true" \
        last-lrm-refresh="1391557394" \
        no-quorum-policy="freeze" \
        symmetric-cluster="false"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100" \
        migration-threshold="3" \
        failure-timeout="600"

On the dom0 host, I can't any ssh connection attemps.
I feel stuck as I have no idea what can be the issue here and I don't
know how to further debug that.