[Pacemaker] Question about recovery policy after "Too many failures to fence"
Andrew Beekhof
andrew at beekhof.net
Mon Apr 8 03:01:16 UTC 2013
On 27/03/2013, at 7:45 PM, Kazunori INOUE <inouekazu at intellilink.co.jp> wrote:
> Hi,
>
> I'm using pacemaker-1.1 (c7910371a5. the latest devel).
>
> When fencing failed 10 times, S_TRANSITION_ENGINE state is kept.
> (related: https://github.com/ClusterLabs/pacemaker/commit/e29d2f9)
>
> How should I recover? what kind of procedure should I make S_IDLE in?
The intention was that the node should proceed to S_IDLE when this occurs, so you shouldn't have to do anything and the cluster would try again once the recheck-interval expired or a config change was made.
I assume you're saying this does not occur?
>
>
> Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_callback:
> Stonith operation 12/22:14:0:0927a8a0-8e09-494e-acf8-7fb273ca8c9e: Generic
> Pacemaker error (-1001)
> Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_callback:
> Stonith operation 12 for dev2 failed (Generic Pacemaker error): aborting
> transition.
> Mar 27 15:34:34 dev2 crmd[17937]: info: abort_transition_graph:
> tengine_stonith_callback:426 - Triggered transition abort (complete=0) :
> Stonith failed
> Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_notify: Peer
> dev2 was not terminated (st_notify_fence) by dev1 for dev2: Generic
> Pacemaker error (ref=05f75ab8-34ae-4aae-bbc6-aa20dbfdc845) by client
> crmd.17937
> Mar 27 15:34:34 dev2 crmd[17937]: notice: run_graph: Transition 14
> (Complete=1, Pending=0, Fired=0, Skipped=8, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Mar 27 15:34:34 dev2 crmd[17937]: notice: too_many_st_failures: Too many
> failures to fence dev2 (11), giving up
>
> $ crmadmin -S dev2
> Status of crmd at dev2: S_TRANSITION_ENGINE (ok)
>
> $ crm_mon
> Last updated: Wed Mar 27 15:35:12 2013
> Last change: Wed Mar 27 15:33:16 2013 via cibadmin on dev1
> Stack: corosync
> Current DC: dev2 (3232261523) - partition with quorum
> Version: 1.1.10-1.el6-c791037
> 2 Nodes configured, unknown expected votes
> 3 Resources configured.
>
>
> Node dev2 (3232261523): UNCLEAN (online)
> Online: [ dev1 ]
>
> prmDummy (ocf::pacemaker:Dummy): Started dev2 FAILED
> Resource Group: grpStonith1
> prmStonith1 (stonith:external/stonith-helper): Started dev2
> Resource Group: grpStonith2
> prmStonith2 (stonith:external/stonith-helper): Started dev1
>
> Failed actions:
> prmDummy_monitor_10000 (node=dev2, call=23, rc=7, status=complete): not
> running
>
> ----
> Best Regards,
> Kazunori INOUE
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list