[Pacemaker] newb - stonith not working - require others to stonith node

Sun Jul 1 21:23:16 UTC 2012

On Sat, Jun 30, 2012 at 5:41 AM, Brett Lee <brettlee at yahoo.com> wrote:
> Hello - Am thinking that this is progress.
>
> Have made some updates, but still getting the same result ("require others
> to stonith node st15-mds1").

We'd still need to see a hb_report containing the logs and PE files
before we can comment.

> Referencing this link for the updates made:
> http://www.hastexo.com/resources/hints-and-kinks/fencing-libvirtkvm-virtualized-cluster-nodes
>
> Updates include removing the previous 'primitive st-nodes' entry and adding
> the following:
>
> primitive stonith_st15-mds1 stonith:external/libvirt \
>         params hostlist="st15-mds1"
> hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
>
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60" \
>         op monitor interval="60"
> primitive stonith_st15-mds2 stonith:external/libvirt \
>         params hostlist="st15-mds2"
> hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
>
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60" \
>         op monitor interval="60"
> location l_stonith_st15-mds1 stonith_st15-mds1 -inf: st15-mds1
> location l_stonith_st15-mds2 stonith_st15-mds2 -inf: st15-mds2
>
> Any suggestions would certainly be appreciated.  Thanks!
>
> Brett Lee
> Everything Penguin - http://etpenguin.com
>
> ________________________________
> From: Brett Lee <brettlee at yahoo.com>
> To: "pacemaker at oss.clusterlabs.org" <pacemaker at oss.clusterlabs.org>
> Sent: Friday, June 29, 2012 9:43 AM
> Subject: [Pacemaker] newb - stonith not working - require others to stonith
> node
>
> Hi -
>
> Am new to pacemaker and now have a shiny new configuration that will not
> stonith.  This is a test system using KVM and external/libvirt - all VMs are
> running CentOS 5.
>
> Am (really) hoping someone might be willing to help troubleshoot this
> configuration.  Thank you for your time and effort!
>
> The items that are suspect to me are:
> 1.  st-nodes has no 'location' entry
> 2.  logs report node_list=
> 3.  resource st-nodes is Stopped
>
> Have attached a clip of the configuration below.  The full configuration and
> log file may be found at - http://pastebin.com/bS87FXUr
>
> Per 'stonith -t external/libvirt -h' I have configured stonith using:
>
> primitive st-nodes stonith:external/libvirt \
>         params hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2"
> hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60" \
>         op monitor interval="60"
>
> And a section of the log file is:
>
> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: ERROR: Failed to STONITH the
> node st15-mds1: optype=RESET, op_result=TIMEOUT
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: tengine_stonith_callback:
> call=-65, optype=1, node_name=st15-mds1, result=2, node_list=,
> action=23:90:0:aac961e7-b06b-4dfd-ae60-c882407b16b5
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: ERROR: tengine_stonith_callback:
> Stonith of st15-mds1 failed (2)... aborting transition.
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: abort_transition_graph:
> tengine_stonith_callback:409 - Triggered transition abort (complete=0) :
> Stonith failed
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
> priority upgraded from 0 to 1000000
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
> action done superceeded by restart
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: run_graph:
> ====================================================
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: notice: run_graph: Transition 90
> (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0,
> Source=/var/lib/pengine/pe-warn-173.bz2): Stopped
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_graph_trigger: Transition
> 90 is now complete
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: All 3
> cluster nodes are eligible to run resources.
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke: Query 299:
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke_callback:
> Invoking the PE: query=299, ref=pe_calc-dc-1340982127-223, seq=396,
> quorate=1
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
> Node st15-mds2 is online
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: pe_fence_node: Node
> st15-mds1 will be fenced because it is un-expectedly down
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info:
> determine_online_status_fencing:     ha_state=active, ccm_state=false,
> crm_state=online, join_state=member, expected=member
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: determine_online_status:
> Node st15-mds1 is unclean
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
> Node st15-oss1 is online
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
> Node st15-oss2 is online
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
> lustre-OST0000    (ocf::heartbeat:Filesystem):    Started st15-oss1
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
> lustre-OST0001    (ocf::heartbeat:Filesystem):    Started st15-oss1
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
> lustre-OST0002    (ocf::heartbeat:Filesystem):    Started st15-oss2
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
> lustre-OST0003    (ocf::heartbeat:Filesystem):    Started st15-oss2
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
> lustre-MDT0000    (ocf::heartbeat:Filesystem):    Started st15-mds1
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: st-nodes
> (stonith:external/libvirt):    Stopped
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_color: Resource
> st-nodes cannot run anywhere
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Action
> lustre-MDT0000_stop_0 on st15-mds1 is unrunnable (offline)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Marking node
> st15-mds1 unclean
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: RecurringOp:  Start
> recurring monitor (120s) for lustre-MDT0000 on st15-mds2
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: stage6: Scheduling Node
> st15-mds1 for STONITH
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_stop_constraints:
> lustre-MDT0000_stop_0 is implicit after st15-mds1 is fenced
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
> resource lustre-OST0000    (Started st15-oss1)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
> resource lustre-OST0001    (Started st15-oss1)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
> resource lustre-OST0002    (Started st15-oss2)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
> resource lustre-OST0003    (Started st15-oss2)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Move
> resource lustre-MDT0000    (Started st15-mds1 -> st15-mds2)
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
> resource st-nodes    (Stopped)
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: process_pe_message:
> Transition 91: WARNINGs found during PE processing. PEngine Input stored in:
> /var/lib/pengine/pe-warn-174.bz2
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: unpack_graph: Unpacked
> transition 91: 7 actions in 7 synapses
> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: process_pe_message:
> Configuration WARNINGs found during PE processing.  Please run "crm_verify
> -L" to identify issues.
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_te_invoke: Processing graph
> 91 (ref=pe_calc-dc-1340982127-223) derived from
> /var/lib/pengine/pe-warn-174.bz2
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_pseudo_action: Pseudo
> action 21 fired and confirmed
> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_fence_node: Executing
> reboot fencing operation (23) on st15-mds1 (timeout=60000)
> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: client tengine [pid: 4490]
> requests a STONITH operation RESET on node st15-mds1
> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: we can't manage st15-mds1,
> broadcast request to other nodes
> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: Broadcasting the message
> succeeded: require others to stonith node st15-mds1.
>
> Thank you!
>
> Brett Lee
> Everything Penguin - http://etpenguin.com
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>