[Pacemaker] Question about recovery policy after "Too many failures to fence"

Thu Apr 11 18:22:33 EDT 2013

On 11/04/2013, at 7:23 PM, Kazunori INOUE <inouekazu at intellilink.co.jp> wrote:

> Hi Andrew,
> 
> (13.04.08 12:01), Andrew Beekhof wrote:
>> 
>> On 27/03/2013, at 7:45 PM, Kazunori INOUE <inouekazu at intellilink.co.jp> wrote:
>> 
>>> Hi,
>>> 
>>> I'm using pacemaker-1.1 (c7910371a5. the latest devel).
>>> 
>>> When fencing failed 10 times, S_TRANSITION_ENGINE state is kept.
>>> (related: https://github.com/ClusterLabs/pacemaker/commit/e29d2f9)
>>> 
>>> How should I recover?  what kind of procedure should I make S_IDLE in?
>> 
>> The intention was that the node should proceed to S_IDLE when this occurs, so you shouldn't have to do anything and the cluster would try again once the recheck-interval expired or a config change was made.
>> 
>> I assume you're saying this does not occur?
>> 
> 
> I recognize that the timer of cluster-recheck-interval is invalid
> between S_TRANSITION_ENGINE.
> So even if waited for a long time, it was still S_TRANSITION_ENGINE.
> * I attached crm_report.
> 
> What do I have to do in order to make the cluster retry STONITH?
> For example, I need to run 'crmadmin -E' to change config?

Basically, you'd need to fix the crmd.  Which is rather bad.
cibadmin -E is a big stick though, any change (once this bug is fixed) would be enough.

Actually, you should be able to trigger it yourself with stonith_admin.
As long as it works, it should also reset the fencing fail count.

> 
> ----
> Best Regards,
> Kazunori INOUE
> 
>>> 
>>> 
>>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_callback:
>>> Stonith operation 12/22:14:0:0927a8a0-8e09-494e-acf8-7fb273ca8c9e: Generic
>>> Pacemaker error (-1001)
>>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_callback:
>>> Stonith operation 12 for dev2 failed (Generic Pacemaker error): aborting
>>> transition.
>>> Mar 27 15:34:34 dev2 crmd[17937]:     info: abort_transition_graph:
>>> tengine_stonith_callback:426 - Triggered transition abort (complete=0) :
>>> Stonith failed
>>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_notify: Peer
>>> dev2 was not terminated (st_notify_fence) by dev1 for dev2: Generic
>>> Pacemaker error (ref=05f75ab8-34ae-4aae-bbc6-aa20dbfdc845) by client
>>> crmd.17937
>>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: run_graph: Transition 14
>>> (Complete=1, Pending=0, Fired=0, Skipped=8, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
>>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: too_many_st_failures: Too many
>>> failures to fence dev2 (11), giving up
>>> 
>>> $ crmadmin -S dev2
>>> Status of crmd at dev2: S_TRANSITION_ENGINE (ok)
>>> 
>>> $ crm_mon
>>> Last updated: Wed Mar 27 15:35:12 2013
>>> Last change: Wed Mar 27 15:33:16 2013 via cibadmin on dev1
>>> Stack: corosync
>>> Current DC: dev2 (3232261523) - partition with quorum
>>> Version: 1.1.10-1.el6-c791037
>>> 2 Nodes configured, unknown expected votes
>>> 3 Resources configured.
>>> 
>>> 
>>> Node dev2 (3232261523): UNCLEAN (online)
>>> Online: [ dev1 ]
>>> 
>>> prmDummy       (ocf::pacemaker:Dummy): Started dev2 FAILED
>>> Resource Group: grpStonith1
>>>     prmStonith1        (stonith:external/stonith-helper):      Started dev2
>>> Resource Group: grpStonith2
>>>     prmStonith2        (stonith:external/stonith-helper):      Started dev1
>>> 
>>> Failed actions:
>>>    prmDummy_monitor_10000 (node=dev2, call=23, rc=7, status=complete): not
>>> running
>>> 
>>> ----
>>> Best Regards,
>>> Kazunori INOUE
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> <too-many-failures-to-fence.tar.bz2>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org