[Pacemaker] Apparent delay in starting resource during migration

Tue Mar 22 18:53:39 UTC 2011

22.03.2011 17:35, Phil Armstrong:
> I have 4 primitive resources that I was in the process of migrating 
> from one node (lotus) to the other (aston). I was using the 'crm 
> resource migrate' command (4 times). Normally this process works just 
> fine, but in this instance, one of the resources (copan_shelf_07) got 
> 'stuck', in that it didn't get started on the target node for about 15 
> minutes after the others (04, 05, 06), and then it seems it was kicked 
> off by crm_timer_popped. I'm hoping someone can tell me why 
> copan_shelf_07 took so long to be started. I have the complete 
> hb_report if that is required. Here are the basics:
>

Probably I'm wrong on why exactly all this happened. But core of your 
problem is here:

Mar 22 07:45:12 aston crmd: [10962]: WARN: status_from_rc: Action 17 
(copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error

Resource agents should not fail to stop resource. In your case, the 
action has failed in 2 seconds. It was not a timeout.

Lotus node should be fenced after this, if you had stonith enabled. Or 
the resource should be unmanaged, if not.

But in you configuration (which you did not show), the resource had been 
stopped two more times, and third stop eventually succeded. Do you have 
on_fail='stop' parameter for the action?

After cluster recheck timeout, pengine had realized that the service is 
stopped but should be started.

So, basically, you need to fix your Resource Agent.

--
Pavel Levshin

> Here are the copan_shelf_07 syslog messages:
>
> Mar 22 07:45:07 aston pengine: [10961]: notice: LogActions: Leave 
> resource copan_shelf_07       (Started lotus)
> Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff: 
> + <rsc_location id="cli-prefer-copan_shelf_07" rsc="copan_shelf_07" 
> __crm_diff_marker__="added:top" >
> Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff: 
> + <rule id="cli-prefer-rule-copan_shelf_07" score="INFINITY" 
> boolean-op="and" >
> Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff: 
> + <expression id="cli-prefer-expr-copan_shelf_07" attribute="#uname" 
> operation="eq" value="aston" type="string" />
> Mar 22 07:45:10 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Started lotus
> Mar 22 07:45:10 aston pengine: [10961]: notice: RecurringOp:  Start 
> recurring monitor (30s) for copan_shelf_07 on aston
> Mar 22 07:45:10 aston pengine: [10961]: notice: LogActions: Move 
> resource copan_shelf_07        (Started lotus -> aston)
> Mar 22 07:45:10 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Started lotus
> Mar 22 07:45:10 aston pengine: [10961]: notice: RecurringOp:  Start 
> recurring monitor (30s) for copan_shelf_07 on aston
> Mar 22 07:45:10 aston pengine: [10961]: notice: LogActions: Move 
> resource copan_shelf_07        (Started lotus -> aston)
> Mar 22 07:45:10 aston crmd: [10962]: info: te_rsc_command: Initiating 
> action 17: stop copan_shelf_07_stop_0 on lotus
> Mar 22 07:45:12 aston crmd: [10962]: WARN: status_from_rc: Action 17 
> (copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error
> Mar 22 07:45:12 aston crmd: [10962]: WARN: update_failcount: Updating 
> failcount for copan_shelf_07 on lotus after failed stop: rc=1 
> (update=INFINITY, time=1300797912)
> Mar 22 07:45:12 aston crmd: [10962]: info: abort_transition_graph: 
> match_graph_event:276 - Triggered transition abort (complete=0, 
> tag=lrm_rsc_op, id=copan_shelf_07_stop_0, 
> magic=0:1;17:509:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d, cib=0.920.5) 
> : Event failed
> Mar 22 07:45:12 aston crmd: [10962]: info: match_graph_event: Action 
> copan_shelf_07_stop_0 (17) confirmed on lotus (rc=4)
> Mar 22 07:45:17 aston pengine: [10961]: WARN: unpack_rsc_op: 
> Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
> Mar 22 07:45:17 aston pengine: [10961]: ERROR: unpack_rsc_op: Making 
> sure copan_shelf_07 doesn't come up again
> Mar 22 07:45:17 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Started lotus FAILED
> Mar 22 07:45:17 aston pengine: [10961]: info: get_failcount: 
> copan_shelf_07 has failed INFINITY times on lotus
> Mar 22 07:45:17 aston pengine: [10961]: WARN: common_apply_stickiness: 
> Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
> Mar 22 07:45:17 aston pengine: [10961]: info: rsc_merge_weights: 
> cxfs-client-clone: Rolling back scores from copan_shelf_07
> Mar 22 07:45:17 aston pengine: [10961]: info: native_color: Resource 
> copan_shelf_07 cannot run anywhere
> Mar 22 07:45:17 aston pengine: [10961]: notice: LogActions: Stop 
> resource copan_shelf_07        (lotus)
> Mar 22 07:45:17 aston pengine: [10961]: WARN: unpack_rsc_op: 
> Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
> Mar 22 07:45:17 aston pengine: [10961]: ERROR: unpack_rsc_op: Making 
> sure copan_shelf_07 doesn't come up again
> Mar 22 07:45:17 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Started lotus FAILED
> Mar 22 07:45:17 aston pengine: [10961]: info: get_failcount: 
> copan_shelf_07 has failed INFINITY times on lotus
> Mar 22 07:45:17 aston pengine: [10961]: WARN: common_apply_stickiness: 
> Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
> Mar 22 07:45:17 aston pengine: [10961]: info: rsc_merge_weights: 
> cxfs-client-clone: Rolling back scores from copan_shelf_07
> Mar 22 07:45:17 aston pengine: [10961]: info: native_color: Resource 
> copan_shelf_07 cannot run anywhere
> Mar 22 07:45:17 aston pengine: [10961]: notice: LogActions: Stop 
> resource copan_shelf_07        (lotus)
> Mar 22 07:45:18 aston crmd: [10962]: info: te_rsc_command: Initiating 
> action 11: stop copan_shelf_07_stop_0 on lotus
> Mar 22 07:45:19 aston crmd: [10962]: WARN: status_from_rc: Action 11 
> (copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error
> Mar 22 07:45:19 aston crmd: [10962]: WARN: update_failcount: Updating 
> failcount for copan_shelf_07 on lotus after failed stop: rc=1 
> (update=INFINITY, time=1300797919)
> Mar 22 07:45:19 aston crmd: [10962]: info: abort_transition_graph: 
> match_graph_event:276 - Triggered transition abort (complete=0, 
> tag=lrm_rsc_op, id=copan_shelf_07_stop_0, 
> magic=0:1;11:511:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d, cib=0.920.16) 
> : Event failed
> Mar 22 07:45:19 aston crmd: [10962]: info: match_graph_event: Action 
> copan_shelf_07_stop_0 (11) confirmed on lotus (rc=4)
> Mar 22 07:45:22 aston pengine: [10961]: WARN: unpack_rsc_op: 
> Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
> Mar 22 07:45:22 aston pengine: [10961]: ERROR: unpack_rsc_op: Making 
> sure copan_shelf_07 doesn't come up again
> Mar 22 07:45:22 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Started lotus FAILED
> Mar 22 07:45:22 aston pengine: [10961]: info: get_failcount: 
> copan_shelf_07 has failed INFINITY times on lotus
> Mar 22 07:45:22 aston pengine: [10961]: WARN: common_apply_stickiness: 
> Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
> Mar 22 07:45:22 aston pengine: [10961]: info: rsc_merge_weights: 
> cxfs-client-clone: Rolling back scores from copan_shelf_07
> Mar 22 07:45:22 aston pengine: [10961]: info: native_color: Resource 
> copan_shelf_07 cannot run anywhere
> Mar 22 07:45:22 aston pengine: [10961]: notice: LogActions: Stop 
> resource copan_shelf_07        (lotus)
> Mar 22 07:45:23 aston crmd: [10962]: info: te_rsc_command: Initiating 
> action 14: stop copan_shelf_07_stop_0 on lotus
> Mar 22 07:45:24 aston crmd: [10962]: info: match_graph_event: Action 
> copan_shelf_07_stop_0 (14) confirmed on lotus (rc=0)
> Mar 22 08:00:25 aston pengine: [10961]: notice: native_print: 
> copan_shelf_07    (ocf::sgi:copan_ov_client):     Stopped
> Mar 22 08:00:25 aston pengine: [10961]: info: get_failcount: 
> copan_shelf_07 has failed INFINITY times on lotus
> Mar 22 08:00:25 aston pengine: [10961]: WARN: common_apply_stickiness: 
> Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
> Mar 22 08:00:25 aston pengine: [10961]: notice: RecurringOp:  Start 
> recurring monitor (30s) for copan_shelf_07 on aston
> Mar 22 08:00:25 aston pengine: [10961]: notice: LogActions: Start 
> copan_shelf_07        (aston)
> Mar 22 08:00:25 aston crmd: [10962]: info: te_rsc_command: Initiating 
> action 16: start copan_shelf_07_start_0 on aston (local)
> Mar 22 08:00:25 aston crmd: [10962]: info: do_lrm_rsc_op: Performing 
> key=16:513:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d 
> op=copan_shelf_07_start_0 )
> Mar 22 08:00:25 aston lrmd: [10959]: debug: on_msg_perform_op:2359: 
> copying parameters for rsc copan_shelf_07
> Mar 22 08:00:25 aston lrmd: [10959]: debug: on_msg_perform_op: add an 
> operation operation start[142] on ocf::copan_ov_client::copan_shelf_07 
> for client 10962, its parameters: CRM_meta_name=[start] 
> crm_feature_set=[3.0.2] CRM_meta_on_fail=[restart] 
> CRM_meta_timeout=[60000] shelf_name=[c07]  to the operation list.
> Mar 22 08:00:25 aston lrmd: [10959]: info: rsc:copan_shelf_07:142: start
>
>
> Thanks,
>
> Phil
>