[Pacemaker] About behavior in "Action Lost".

Wed Sep 22 04:52:00 EDT 2010

On Tue, Sep 21, 2010 at 8:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker.
> Action Lost occurred in stop movement after the error of the monitor occurred.
>
> Sep  8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]:
> In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
> Sep  8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 -
> Triggered transition abort (complete=0) : Action lost
>
>
> For the load of the node, We think that the stop movement did not go well.
> But cannot nodes execute stonith.

A long time ago in a galaxy far away, some messaging layers used to
loose quite a few actions, including stops.
About the same time, we decided that fencing because a stop action was
lost wasn't a good idea.

The rationale was that if the operation eventually completed, it would
end up in the CIB anyway.
And even if it didn't, the PE would continue to try the operation
again until the whole node fell over at which point it would get shot
anyway.

Now, having said that, things have improved since then and perhaps,
the interest of speeding up recovery in these situations, it is time
to stop treating stop operations differently.
Would you agree?