[Pacemaker] Time out issue while stopping resource in pacemaker
Lax
lkota at cisco.com
Fri Oct 10 08:03:48 UTC 2014
Andrew Beekhof <andrew at ...> writes:
>
>
> On 10 Oct 2014, at 12:12 pm, Lax <lkota at ...> wrote:
>
> > Hi All,
> >
> > I ran into a time out issue while failing over from master to the peer
> > server and I have a 2 node setup with 2 resources. Though it was working all
> > along, this was the first time this issue is seen for me.
> >
> > It fail with following error 'error: process_lrm_event: LRM operation
> > resourceB_stop_0 (40) Timed Out (timeout=20000ms)'.
> >
>
> Have you considered making the timeout longer?
>
> >
> >
> > Here is the complete log snippet from pacemaker, appreciate your help on
this.
> >
> >
> > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: Diff: +++ 0.3.1
> > 4e9bfa03cf2fef61843c18e127044d81
> > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: -- <cib
> > admin_epoch="0" epoch="2" num_updates="8" />
> > Oct 9 14:57:38 server1 crmd[373]: notice: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> > origin=abort_transition_graph ]
> > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++
> > <instance_attributes id="nodes-server1" >
> > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++ <nvpair
> > id="nodes-server1-standby" name="standby" value="true" />
> > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++
> > </instance_attributes>
> > Oct 9 14:57:38 server1 pengine[372]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct 9 14:57:38 server1 pengine[372]: notice: LogActions: Move
> > ClusterIP#011(Started server1 -> 172.28.0.64)
> > Oct 9 14:57:38 server1 pengine[372]: notice: LogActions: Move
> > resourceB#011(Started server1 -> 172.28.0.64)
> > Oct 9 14:57:38 server1 pengine[372]: notice: process_pe_message:
> > Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-1710.bz2
> > Oct 9 14:57:58 server1 lrmd[370]: warning: child_timeout_callback:
> > resourceB_stop_0 process (PID 17327) timed out
> > Oct 9 14:57:58 server1 lrmd[370]: warning: operation_finished:
> > resourceB_stop_0:17327 - timed out after 20000ms
> > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished:
> > resourceB_stop_0:17327 [ % Total % Received % Xferd Average Speed
> > Time Time Time Current ]
> > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished:
> > resourceB_stop_0:17327 [ Dload Upload
> > Total Spent Left Speed ]
> > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished:
> > resourceB_stop_0:17327 [ #015 0 0 0 0 0 0 0 0
> > --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0
> > 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0
> > 0 0 --:--:-- 0:00:02 --:--:-- 0#015 0 0 0 0 0
> > 0 0 0 --:--:-- 0:00:03 --:--:-- 0#015 0 0 0 0
> > 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0#015 0 0 0
> > 0 0 0 0 0 --:--:-- 0:00:05 -
> > Oct 9 14:57:58 server1 crmd[373]: error: process_lrm_event: LRM
> > operation resourceB_stop_0 (40) Timed Out (timeout=20000ms)
> > Oct 9 14:57:58 server1 crmd[373]: warning: status_from_rc: Action 10
> > (resourceB_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
> > Oct 9 14:57:58 server1 crmd[373]: warning: update_failcount: Updating
> > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
> > time=1412891878)
> > Oct 9 14:57:58 server1 attrd[371]: notice: attrd_trigger_update: Sending
> > flush op to all hosts for: fail-count-resourceB (INFINITY)
> > Oct 9 14:57:58 server1 crmd[373]: warning: update_failcount: Updating
> > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
> > time=1412891878)
> > Oct 9 14:57:58 server1 crmd[373]: notice: run_graph: Transition 11
> > (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-1710.bz2): Stopped
> > Oct 9 14:57:58 server1 attrd[371]: notice: attrd_perform_update: Sent
> > update 11: fail-count-resourceB=INFINITY
> >
> >
> > Thanks
> > Lax
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at ...
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at ...
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
Thanks for getting back Andrew.
I have only given timeout for monitor, but not for stop in the resource
definition. Do you mean I increase timeout for stop?
Also when I try to force stop pacemaker service, it keeps saying 'Waiting
for shutdown of managed resources;' and does not stop.
Pacemaker log says failed to stop because of unknown error
Oct 10 00:36:25 server1 crmd[373]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 10 00:51:25 server1 crmd[373]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Oct 10 00:51:25 server1 pengine[372]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct 10 00:51:25 server1 pengine[372]: warning: unpack_rsc_op: Processing
failed op stop for resourceB on server1.cisco.com: unknown error (1)
Oct 10 00:51:25 server1 pengine[372]: warning: common_apply_stickiness:
Forcing resourceB away from server1.cisco.com after 1000000 failures
(max=1000000)
Oct 10 00:51:25 server1 pengine[372]: notice: LogActions: Stop
ClusterIP#011(Started 172.28.0.64)
Oct 10 00:51:25 server1 pengine[372]: notice: process_pe_message:
Calculated Transition 56: (null)
Oct 10 00:51:25 server1 crmd[373]: notice: run_graph: Transition 56
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=unknown):
Complete
Oct 10 00:51:25 server1 crmd[373]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 10 00:53:03 server1 attrd[371]: notice: attrd_trigger_update: Sending
flush op to all hosts for: standby (true)
Oct 10 00:53:03 server1 attrd[371]: notice: attrd_perform_update: Sent
update 21: standby=true
Oct 10 00:53:03 server1 crmd[373]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Oct 10 00:53:03 server1 pengine[372]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct 10 00:53:03 server1 pengine[372]: warning: unpack_rsc_op: Processing
failed op stop for resourceB on server1.cisco.com: unknown error (1)
Oct 10 00:53:03 server1 pengine[372]: warning: common_apply_stickiness:
Forcing resourceB away from server1.cisco.com after 1000000 failures
(max=1000000)
Oct 10 00:53:03 server1 pengine[372]: notice: LogActions: Stop
ClusterIP#011(Started 172.28.0.64)
Oct 10 00:53:03 server1 pengine[372]: notice: process_pe_message:
Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-1717.bz2
Oct 10 00:53:03 server1 crmd[373]: notice: run_graph: Transition 57
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-1717.bz2): Complete
Oct 10 00:53:03 server1 crmd[373]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Thanks
Lax
More information about the Pacemaker
mailing list