[Pacemaker] newb - stonith not working - require others to stonith node

Fri Jun 29 11:43:24 EDT 2012

Hi - 

Am new to pacemaker and now have a shiny new configuration that will not stonith.  This is a test system using KVM and external/libvirt - all VMs are running CentOS 5.

Am (really) hoping someone might be willing to help troubleshoot this configuration.  Thank you for your time and effort!

The items that are suspect to me are:
1.  st-nodes has no 'location' entry
2.  logs report node_list=
3.  resource st-nodes is Stopped

Have attached a clip of the configuration below.  The full configuration and log file may be found at - http://pastebin.com/bS87FXUr

Per 'stonith -t external/libvirt -h' I have configured stonith using:

primitive st-nodes stonith:external/libvirt \
        params hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2" hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
        op start interval="0" timeout="60"
 \
        op stop interval="0" timeout="60" \
        op monitor interval="60"

And a section of the log file is:

Jun 29 11:02:07 st15-mds2 stonithd: [4485]: ERROR: Failed to STONITH the node st15-mds1: optype=RESET, op_result=TIMEOUT
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: tengine_stonith_callback: call=-65, optype=1, node_name=st15-mds1, result=2, node_list=, action=23:90:0:aac961e7-b06b-4dfd-ae60-c882407b16b5
Jun 29 11:02:07 st15-mds2 crmd: [4490]: ERROR: tengine_stonith_callback: Stonith of st15-mds1 failed (2)... aborting transition.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: abort_transition_graph: tengine_stonith_callback:409 - Triggered transition abort (complete=0) : Stonith failed
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
 action done superceeded by restart
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: run_graph: ====================================================
Jun 29 11:02:07 st15-mds2 crmd: [4490]: notice: run_graph: Transition 90 (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pengine/pe-warn-173.bz2): Stopped
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_graph_trigger: Transition 90 is now complete
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: All 3 cluster nodes are eligible to run resources.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke: Query 299: Requesting the current CIB: S_POLICY_ENGINE
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke_callback: Invoking the PE: query=299,
 ref=pe_calc-dc-1340982127-223, seq=396, quorate=1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-mds2 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: pe_fence_node: Node st15-mds1 will be fenced because it is un-expectedly down
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status_fencing:     ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: determine_online_status: Node st15-mds1 is unclean
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-oss1 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-oss2 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice:
 native_print: lustre-OST0000    (ocf::heartbeat:Filesystem):    Started st15-oss1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0001    (ocf::heartbeat:Filesystem):    Started st15-oss1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0002    (ocf::heartbeat:Filesystem):    Started st15-oss2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0003    (ocf::heartbeat:Filesystem):    Started st15-oss2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-MDT0000    (ocf::heartbeat:Filesystem):    Started st15-mds1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: st-nodes    (stonith:external/libvirt):    Stopped 
Jun 29 11:02:07 st15-mds2 pengine:
 [4489]: info: native_color: Resource st-nodes cannot run anywhere
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Action lustre-MDT0000_stop_0 on st15-mds1 is unrunnable (offline)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Marking node st15-mds1 unclean
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: RecurringOp:  Start recurring monitor (120s) for lustre-MDT0000 on st15-mds2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: stage6: Scheduling Node st15-mds1 for STONITH
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_stop_constraints: lustre-MDT0000_stop_0 is implicit after st15-mds1 is fenced
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource lustre-OST0000    (Started st15-oss1)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource lustre-OST0001    (Started st15-oss1)
Jun
 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource lustre-OST0002    (Started st15-oss2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource lustre-OST0003    (Started st15-oss2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Move    resource lustre-MDT0000    (Started st15-mds1 -> st15-mds2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource st-nodes    (Stopped)
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: process_pe_message: Transition 91: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-174.bz2
Jun
 29 11:02:07 st15-mds2 crmd: [4490]: info: unpack_graph: Unpacked transition 91: 7 actions in 7 synapses
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_te_invoke: Processing graph 91 (ref=pe_calc-dc-1340982127-223) derived from /var/lib/pengine/pe-warn-174.bz2
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_pseudo_action: Pseudo action 21 fired and confirmed
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_fence_node: Executing reboot fencing operation (23) on st15-mds1 (timeout=60000)
Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: client tengine [pid: 4490] requests a STONITH operation RESET on node st15-mds1
Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: we can't manage st15-mds1, broadcast request to other nodes
Jun 29 11:02:07 st15-mds2 stonithd:
 [4485]: info: Broadcasting the message succeeded: require others to stonith node st15-mds1.

Thank you!

Brett Lee
Everything Penguin - http://etpenguin.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120629/db6308d8/attachment-0002.html>