[Pacemaker] About behavior in "Action Lost".

Wed Sep 22 09:18:24 UTC 2010

Hi Andrew,

Thank you for comment.

> A long time ago in a galaxy far away, some messaging layers used to
> loose quite a few actions, including stops.
> About the same time, we decided that fencing because a stop action was
> lost wasn't a good idea.
>
> The rationale was that if the operation eventually completed, it would
> end up in the CIB anyway.
> And even if it didn't, the PE would continue to try the operation
> again until the whole node fell over at which point it would get shot
> anyway.

Sorry...
I did not know the fact that there was such an argument in old days.

> Now, having said that, things have improved since then and perhaps,
> the interest of speeding up recovery in these situations, it is time
> to stop treating stop operations differently.
> Would you agree?

That means, you change it in the case of "Action Lost" of the stop this time to carry out stonith?
If my recognition is right, I agree too.

if(timer->action->type != action_type_rsc) { 
send_update = FALSE; 
} else if(safe_str_eq(task, "cancel")) { 
/* we dont need to update the CIB with these */ 
send_update = FALSE; 
}
---> delete "else if(safe_str_eq(task, "stop")){..}" ? 

if(send_update) { 
/* cib_action_update(timer->action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ 
cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); 
} 

Best Regards,
Hideo Yamauchi.

--- Andrew Beekhof <andrew at beekhof.net> wrote:

> On Tue, Sep 21, 2010 at 8:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> > Hi,
> >
> > Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker.
> > Action Lost occurred in stop movement after the error of the monitor occurred.
> >
> > Sep &#65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost:
> [Action 9]:
> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
> > Sep &#65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486
-
> > Triggered transition abort (complete=0) : Action lost
> >
> >
> > For the load of the node, We think that the stop movement did not go well.
> > But cannot nodes execute stonith.
> 
> A long time ago in a galaxy far away, some messaging layers used to
> loose quite a few actions, including stops.
> About the same time, we decided that fencing because a stop action was
> lost wasn't a good idea.
> 
> The rationale was that if the operation eventually completed, it would
> end up in the CIB anyway.
> And even if it didn't, the PE would continue to try the operation
> again until the whole node fell over at which point it would get shot
> anyway.
> 
> Now, having said that, things have improved since then and perhaps,
> the interest of speeding up recovery in these situations, it is time
> to stop treating stop operations differently.
> Would you agree?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>