[Pacemaker] cluster got stuck on stopping resources

Mon Jun 7 12:36:47 UTC 2010

Hi,

On Mon, Jun 07, 2010 at 12:13:41PM +0200, Andreas Kurz wrote:
> Hi all,
> 
> I observed a strange behaviour when trying to stop two resources with latest 
> pacemaker:
> 
> I updated two resources (ping) and changed some constraints. One of the 
> changed resources is mentioned in the logs with "strange" lrmd messages :
> 
> ...
>  Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: ERROR: do_lrm_rsc_op: Operation 
> monitor on res_ping_ABC failed: -1
> Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: notice: on_msg_perform_op: 
> resource res_ping_ABC is frozen, no ops can run.

This happens in case the resource is being deleted or operations
flushed, but there is still an operation running on the resource
and lrmd is waiting for that operation to finish. Before this
operation is done, no new operations can run on the resource.

> Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: debug: RA output [dummy status to 
> fool heartbeat
> ] didn't match any pattern
> Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: WARN: do_log: FSA: Input I_FAIL 
> from do_lrm_rsc_op() received in state S_TRANSITION_ENGINE
> Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> ....
> 
> Then I try to stop two other resources (part of a group) and nothing happens. 
> One of this resources is a dependency of  res_ping_ABC that is mentioned as 
> "frozen" by the lrmd. 
> 
> Running ptest -L shows that pengine knows what to do (stop the two resources 
> and all dependencies).

Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: native_print: res_ping_ABC      (ocf::pacemaker:ping):  Started emahqwienfw1b
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: WARN: check_action_definition: Parameters to res_ping_ABC_start_0 on emahqwienfw1b changed: recorded 3e6589d0db01fb229fd441bb0d1d50f3 vs. 584dbc4ad2ec43013bd447445557c554 (all:3.0.1) 0:0;22:344:0:8e44c059-ca7d-41ce-b81a-793882819347
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: RecurringOp:  Start recurring monitor (30s) for res_ping_ABC on emahqwienfw1b
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: LogActions: Restart resource res_ping_ABC       (Started emahqwienfw1b)
Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: te_rsc_command: Initiating action 42: monitor res_ping_ABC_monitor_0 on emahqwienfw1a

PE decides to restart the resource, but then it does a probe even
though the resource's state is Started. That operation fails, but
should be retried. Obviously we need to improve the interaction
between lrmd and crmd. Please file a bugzilla.

Thanks,

Dejan

> Any ideas? hb_report is attached .... I left the cluster in this state so if 
> there is anything else I should provide for debugging please tell me.
> 
> Regards,
> Andreas
> 

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker