[Pacemaker] S_POLICY_ENGINE state continues being maintained
Kazunori INOUE
inouekazu at intellilink.co.jp
Fri May 24 08:02:38 UTC 2013
(13.05.24 13:38), Andrew Beekhof wrote:
>
> On 24/05/2013, at 2:19 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>>
>> On 23/05/2013, at 4:44 PM, Kazunori INOUE <inouekazu at intellilink.co.jp> wrote:
>>
>>> Hi,
>>>
>>> I'm using pacemaker-1.1 (c3486a4a8d. the latest devel).
>>> After fencing caused by split-brain failed 11 times, S_POLICY_ENGINE state is kept even if I recover split-brain.
>>
>> Odd, I get:
>>
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: tengine_stonith_callback: Stonith operation 12/69:23:0:9b069b96-3565-4219-85a5-8782bdb5d9d3: No route to host (-113)
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: tengine_stonith_callback: Stonith operation 12 for corosync-host-6 failed (No route to host): aborting transition.
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: run_graph: Transition 23 (Complete=1, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-110.bz2): Stopped
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: too_many_st_failures: Too many failures to fence corosync-host-6 (11), giving up
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> May 24 00:17:08 corosync-host-1 crmd[3056]: notice: tengine_stonith_notify: Peer corosync-host-6 was not terminated (reboot) by corosync-host-1 for corosync-host-1: No route to host (ref=9dd3711e-c87d-4b2e-acd1-854391a6fa9d) by client crmd.3056
>
> Same for you:
>
> May 23 13:17:28 [24868] dev1 crmd: notice: too_many_st_failures: Too many failures to fence dev2 (11), giving up
> May 23 13:17:28 [24868] dev1 crmd: debug: notify_crmd: Transition 10 status: restart - Stonith failed
> May 23 13:17:28 [24868] dev1 crmd: debug: s_crmd_fsa: Processing I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 23 13:17:28 [24868] dev1 crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
>
> and
>
> May 23 13:17:28 [7107] dev2 crmd: notice: too_many_st_failures: Too many failures to fence dev1 (11), giving up
> May 23 13:17:28 [7107] dev2 crmd: debug: notify_crmd: Transition 13 status: restart - Stonith failed
> May 23 13:17:28 [7107] dev2 crmd: debug: s_crmd_fsa: Processing I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 23 13:17:28 [7107] dev2 crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> May 23 13:17:28 [7107] dev2 crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> oh, but not here:
>
> May 23 13:24:23 [7107] dev2 crmd: debug: do_te_invoke: Cancelling the transition: inactive
> May 23 13:24:23 [7107] dev2 crmd: info: abort_transition_graph: do_te_invoke:155 - Triggered transition abort (complete=1) : Peer Cancelled
> May 23 13:24:23 [7107] dev2 crmd: notice: too_many_st_failures: Too many failures to fence dev1 (11), giving up
> May 23 13:24:23 [7107] dev2 crmd: debug: s_crmd_fsa: Processing I_TE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> May 23 13:24:23 [7107] dev2 crmd: warning: do_log: FSA: Input I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE
> May 23 13:24:23 [7107] dev2 crmd: debug: te_update_diff: Processing diff (cib_modify): 0.5.24 -> 0.5.25 (S_POLICY_ENGINE)
> May 23 13:24:23 [7107] dev2 crmd: debug: te_update_diff: Processing diff (cib_modify): 0.5.25 -> 0.5.26 (S_POLICY_ENGINE)
> May 23 13:24:23 [7107] dev2 crmd: debug: join_update_complete_callback: Join update 95 complete
> May 23 13:24:23 [7107] dev2 crmd: debug: check_join_state: Invoked by join_update_complete_callback in state: S_POLICY_ENGINE
> May 23 13:47:54 [7107] dev2 crmd: notice: handle_request: Current ping state: S_POLICY_ENGINE
>
> Can you try the following patch?
>
> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
> index ae4c5de..f3e0d9f 100644
> --- a/crmd/te_utils.c
> +++ b/crmd/te_utils.c
> @@ -408,15 +408,11 @@ abort_transition_graph(int abort_priority, enum transition_action abort_action,
> fsa_pe_ref = NULL;
>
> if (transition_graph->complete) {
> - if (too_many_st_failures() == FALSE) {
> - if (transition_timer->period_ms > 0) {
> - crm_timer_stop(transition_timer);
> - crm_timer_start(transition_timer);
> - } else {
> - register_fsa_input(C_FSA_INTERNAL, I_PE_CALC, NULL);
> - }
> + if (transition_timer->period_ms > 0) {
> + crm_timer_stop(transition_timer);
> + crm_timer_start(transition_timer);
> } else {
> - register_fsa_input(C_FSA_INTERNAL, I_TE_SUCCESS, NULL);
> + register_fsa_input(C_FSA_INTERNAL, I_PE_CALC, NULL);
> }
> return;
> }
>
Hi Andrew,
I confirmed that this problem was fixed.
>
> The expected behavior is that after too_many_st_failures() returns true, we will retry once per re-check interval until either the node is confirmed down with stonith_admin -C or fencing succeeds.
> If the node comes back and fencing is no longer needed, but has still not been confirmed to work, then the count in too_many_st_failures() is not cleared.
>
> Make sense?
>
It makes sense.
Thanks!
>
>>
>>
>>
>>>
>>> 1. disconnect network connection
>>> [dev1 ~]$ crm_mon
>>> Last updated: Thu May 23 13:16:41 2013
>>> Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
>>> Stack: corosync
>>> Current DC: dev1 (3232261525) - partition WITHOUT quorum
>>> Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
>>> 2 Nodes configured, unknown expected votes
>>> 2 Resources configured.
>>>
>>>
>>> Node dev2 (3232261523): UNCLEAN (offline)
>>> Online: [ dev1 ]
>>>
>>> f1 (stonith:external/libvirt.NG): Started dev2
>>> f2 (stonith:external/libvirt.NG): Started dev1
>>>
>>> [dev2 ~]$ crm_mon
>>> Last updated: Thu May 23 13:16:41 2013
>>> Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
>>> Stack: corosync
>>> Current DC: dev2 (3232261523) - partition WITHOUT quorum
>>> Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
>>> 2 Nodes configured, unknown expected votes
>>> 2 Resources configured.
>>>
>>>
>>> Node dev1 (3232261525): UNCLEAN (offline)
>>> Online: [ dev2 ]
>>>
>>> f1 (stonith:external/libvirt.NG): Started dev2
>>> f2 (stonith:external/libvirt.NG): Started dev1
>>>
>>>
>>> 2. wait until fencing failed 11 times
>>> [dev1 ~]$ egrep "CRIT|too_many_st_failures" /var/log/ha-log
>>> May 23 13:16:46 dev1 stonith: [24981]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev2 failed with rc 1
>>> (snip)
>>> May 23 13:17:24 dev1 stonith: [25105]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev2 failed with rc 1
>>> May 23 13:17:28 dev1 stonith: [25118]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev2 failed with rc 1
>>> May 23 13:17:28 dev1 crmd[24868]: notice: too_many_st_failures: Too many failures to fence dev2 (11), giving up
>>>
>>> [dev2 ~]$ egrep "CRIT|too_many_st_failures" /var/log/ha-log
>>> May 23 13:16:46 dev2 stonith: [7177]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev1 failed with rc 1
>>> (snip)
>>> May 23 13:17:23 dev2 stonith: [7295]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev1 failed with rc 1
>>> May 23 13:17:28 dev2 stonith: [7309]: CRIT: external_reset_req: 'libvirt.NG reset' for host dev1 failed with rc 1
>>> May 23 13:17:28 dev2 crmd[7107]: notice: too_many_st_failures: Too many failures to fence dev1 (11), giving up
>>>
>>>
>>> 3. recover network disconnection
>>> [dev1 ~]$ crm_mon
>>> Last updated: Thu May 23 13:24:23 2013
>>> Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
>>> Stack: corosync
>>> Current DC: dev2 (3232261523) - partition with quorum
>>> Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
>>> 2 Nodes configured, unknown expected votes
>>> 2 Resources configured.
>>>
>>>
>>> Online: [ dev1 dev2 ]
>>>
>>> f1 (stonith:external/libvirt.NG): Started dev2
>>> f2 (stonith:external/libvirt.NG): Started dev1
>>>
>>>
>>> S_POLICY_ENGINE state continues being maintained although a member's join seems to have succeeded.
>>>
>>> [13:47:54 root at dev1 ~]$ crmadmin -S dev2
>>> Status of crmd at dev2: S_POLICY_ENGINE (ok)
>>>
>>>
>>> Best Regards,
>>> Kazunori INOUE
>>> <keeping-S_POLICY_ENGINE.tar.bz2>_______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list