[Pacemaker] Question about recovery policy after "Too many failures to fence"

Wed Apr 17 04:54:01 EDT 2013

Hi Andrew,

> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: Wednesday, April 17, 2013 2:28 PM
> To: The Pacemaker cluster resource manager
> Cc: shimazakik at intellilink.co.jp
> Subject: Re: [Pacemaker] Question about recovery policy after "Too many
> failures to fence"
> 
> 
> On 11/04/2013, at 7:23 PM, Kazunori INOUE <inouekazu at intellilink.co.jp>
wrote:
> 
> > Hi Andrew,
> >
> > (13.04.08 12:01), Andrew Beekhof wrote:
> >>
> >> On 27/03/2013, at 7:45 PM, Kazunori INOUE <inouekazu at intellilink.co.jp>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm using pacemaker-1.1 (c7910371a5. the latest devel).
> >>>
> >>> When fencing failed 10 times, S_TRANSITION_ENGINE state is kept.
> >>> (related: https://github.com/ClusterLabs/pacemaker/commit/e29d2f9)
> >>>
> >>> How should I recover?  what kind of procedure should I make S_IDLE in?
> >>
> >> The intention was that the node should proceed to S_IDLE when this
occurs,
> so you shouldn't have to do anything and the cluster would try again once
the
> recheck-interval expired or a config change was made.
> >>
> >> I assume you're saying this does not occur?
> >>
> >
> > I recognize that the timer of cluster-recheck-interval is invalid
> > between S_TRANSITION_ENGINE.
> > So even if waited for a long time, it was still S_TRANSITION_ENGINE.
> > * I attached crm_report.
> 
> I think
>    https://github.com/beekhof/pacemaker/commit/ef8068e9
> should fix this part of the problem.
> 

I confirmed that this problem was fixed.
Thanks!!

> >
> > What do I have to do in order to make the cluster retry STONITH?
> > For example, I need to run 'crmadmin -E' to change config?
> >
> > ----
> > Best Regards,
> > Kazunori INOUE
> >
> >>>
> >>>
> >>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_callback:
> >>> Stonith operation 12/22:14:0:0927a8a0-8e09-494e-acf8-7fb273ca8c9e:
> Generic
> >>> Pacemaker error (-1001)
> >>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_callback:
> >>> Stonith operation 12 for dev2 failed (Generic Pacemaker error):
aborting
> >>> transition.
> >>> Mar 27 15:34:34 dev2 crmd[17937]:     info: abort_transition_graph:
> >>> tengine_stonith_callback:426 - Triggered transition abort (complete=0)
:
> >>> Stonith failed
> >>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: tengine_stonith_notify:
Peer
> >>> dev2 was not terminated (st_notify_fence) by dev1 for dev2: Generic
> >>> Pacemaker error (ref=05f75ab8-34ae-4aae-bbc6-aa20dbfdc845) by client
> >>> crmd.17937
> >>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: run_graph: Transition 14
> >>> (Complete=1, Pending=0, Fired=0, Skipped=8, Incomplete=0,
> >>> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> >>> Mar 27 15:34:34 dev2 crmd[17937]:   notice: too_many_st_failures: Too
many
> >>> failures to fence dev2 (11), giving up
> >>>
> >>> $ crmadmin -S dev2
> >>> Status of crmd at dev2: S_TRANSITION_ENGINE (ok)
> >>>
> >>> $ crm_mon
> >>> Last updated: Wed Mar 27 15:35:12 2013
> >>> Last change: Wed Mar 27 15:33:16 2013 via cibadmin on dev1
> >>> Stack: corosync
> >>> Current DC: dev2 (3232261523) - partition with quorum
> >>> Version: 1.1.10-1.el6-c791037
> >>> 2 Nodes configured, unknown expected votes
> >>> 3 Resources configured.
> >>>
> >>>
> >>> Node dev2 (3232261523): UNCLEAN (online)
> >>> Online: [ dev1 ]
> >>>
> >>> prmDummy       (ocf::pacemaker:Dummy): Started dev2 FAILED
> >>> Resource Group: grpStonith1
> >>>     prmStonith1        (stonith:external/stonith-helper):      Started
> dev2
> >>> Resource Group: grpStonith2
> >>>     prmStonith2        (stonith:external/stonith-helper):      Started
> dev1
> >>>
> >>> Failed actions:
> >>>    prmDummy_monitor_10000 (node=dev2, call=23, rc=7, status=complete):
> not
> >>> running
> >>>
> >>> ----
> >>> Best Regards,
> >>> Kazunori INOUE
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> <too-many-failures-to-fence.tar.bz2>______________________________________
> _________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org