[Pacemaker] master/slave resource does not stop (tries start repeatedly)
Kazunori INOUE
inouekazu at intellilink.co.jp
Fri Sep 14 09:26:52 UTC 2012
Hi Andrew,
I confirmed that this problem had been resolved.
- ClusterLabs/pacemaker : 7a9bf21cfc
However, I found two problems.
(1) it is output with orphan in crm_mon.
# crm_mon -rf1
:
Full list of resources:
Master/Slave Set: msAP [prmAP]
Stopped: [ prmAP:0 prmAP:1 ]
Migration summary:
* Node vm5:
prmAP: orphan
* Node vm6:
prmAP: orphan
Failed actions:
prmAP_monitor_10000 (node=vm5, call=15, rc=1, status=complete): unknown error
prmAP_monitor_10000 (node=vm6, call=21, rc=1, status=complete): unknown error
(2) and, cannot clear the failure status.
CIB is not updated even if I execute a 'crm_resource -C'.
# crm_resource -C -r msAP
Cleaning up prmAP:0 on vm5
Cleaning up prmAP:0 on vm6
Cleaning up prmAP:1 on vm5
Cleaning up prmAP:1 on vm6
Waiting for 1 replies from the CRMd. OK
# cibadmin -Q -o status
<status>
<node_state id="2439358656" uname="vm5" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource">
<transient_attributes id="2439358656">
<instance_attributes id="status-2439358656">
<nvpair id="status-2439358656-probe_complete" name="probe_complete" value="true"/>
<nvpair id="status-2439358656-fail-count-prmAP" name="fail-count-prmAP" value="1"/>
<nvpair id="status-2439358656-last-failure-prmAP" name="last-failure-prmAP" value="1347598951"/>
</instance_attributes>
</transient_attributes>
<lrm id="2439358656">
<lrm_resources>
<lrm_resource id="prmAP" type="Stateful" class="ocf" provider="pacemaker">
<lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:0;1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="24" rc-code="0" op-status="0" interval="0" last-run="1347598936" last-rc-change="0" exec-time="205" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
<lrm_rsc_op id="prmAP_monitor_10000" operation_key="prmAP_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:8;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="15" rc-code="8" op-status="0" interval="10000" last-rc-change="1347598916" exec-time="40" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
<lrm_rsc_op id="prmAP_last_failure_0" operation_key="prmAP_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:1;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="15" rc-code="1" op-status="0" interval="10000" last-rc-change="1347598936" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
</lrm_resource>
</lrm_resources>
</lrm>
</node_state>
<node_state id="2456135872" uname="vm6" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource">
<transient_attributes id="2456135872">
<instance_attributes id="status-2456135872">
<nvpair id="status-2456135872-probe_complete" name="probe_complete" value="true"/>
<nvpair id="status-2456135872-fail-count-prmAP" name="fail-count-prmAP" value="1"/>
<nvpair id="status-2456135872-last-failure-prmAP" name="last-failure-prmAP" value="1347598962"/>
</instance_attributes>
</transient_attributes>
<lrm id="2456135872">
<lrm_resources>
<lrm_resource id="prmAP" type="Stateful" class="ocf" provider="pacemaker">
<lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:0;1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="30" rc-code="0" op-status="0" interval="0" last-run="1347598962" last-rc-change="0" exec-time="230" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
<lrm_rsc_op id="prmAP_monitor_10000" operation_key="prmAP_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:8;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="21" rc-code="8" op-status="0" interval="10000" last-rc-change="1347598952" exec-time="43" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
<lrm_rsc_op id="prmAP_last_failure_0" operation_key="prmAP_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" transition-magic="0:1;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" call-id="21" rc-code="1" op-status="0" interval="10000" last-rc-change="1347598962" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
</lrm_resource>
</lrm_resources>
</lrm>
</node_state>
</status>
I wrote a patch for crm_mon and crm_resource.
(I am not checking whether other commands have the similar problem..)
https://github.com/inouekazu/pacemaker/commit/36cf730751080de197438cfaa34163150059d89c
- when searching the data of a resource_s structure, resource-id which
attached instance number (:0) is used as a key as needed.
- resource-id which removed instance number is used for the update
request to CIB.
Are the specifications (approach) of this patch right?
Best Regards,
Kazunori INOUE
(12.09.11 20:17), Andrew Beekhof wrote:
> On Tue, Sep 11, 2012 at 9:13 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> Yikes!
>>
>> Fixed in:
>> https://github.com/beekhof/pacemaker/commit/7d098ce
>
> That link should have been:
>
> https://github.com/beekhof/pacemaker/commit/c1f409baaaf388d03f6124ec0d9da440445c4a23
>
>>
>> On Fri, Sep 7, 2012 at 7:49 PM, Kazunori INOUE
>> <inouekazu at intellilink.co.jp> wrote:
>>> Hi,
>>>
>>> I am using Pacemaker-1.1.
>>> - ClusterLabs/pacemaker : 872a2f1af1 (Sep 07)
>>>
>>> Though a monitor of master resource fails and there is no node which
>>> the master/slave resource can run, the master/slave resource does not stop.
>>>
>>> [test case]
>>> 1. use StatefulRA which set on-fail="restart" of monitor and
>>> migration-threshold is 1.
>>>
>>> # crm_mon
>>>
>>> Online: [ vm5 vm6 ]
>>>
>>> Master/Slave Set: msAP [prmAP]
>>> Masters: [ vm5 ]
>>> Slaves: [ vm6 ]
>>>
>>> 2. let the master resource on vm5 fail, and move it to vm6.
>>>
>>> Online: [ vm5 vm6 ]
>>>
>>> Master/Slave Set: msAP [prmAP]
>>> Masters: [ vm6 ]
>>> Stopped: [ prmAP:1 ]
>>>
>>> Failed actions:
>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): unknown error
>>>
>>> 3. let the master resource on vm6 fail again, then
>>> the master/slave resource tries start repeatedly.
>>> the state of following (a) and (b) is repeated.
>>>
>>> (a)
>>> Online: [ vm5 vm6 ]
>>>
>>>
>>> Failed actions:
>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): unknown error
>>> prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete): unknown error
>>>
>>> (b)
>>> Online: [ vm5 vm6 ]
>>>
>>> Master/Slave Set: msAP [prmAP]
>>> Slaves: [ vm5 vm6 ]
>>>
>>> Failed actions:
>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): unknown error
>>> prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete): unknown error
>>>
>>> # grep -e run_graph: -e common_apply_stickiness: -e LogActions: ha-log
>>>
>>>>> after the master resource on vm5 failed
>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Recover prmAP:0 (Master vm5)
>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 4 (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Stopped
>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Stop prmAP:0 (vm5)
>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Promote prmAP:1 (Slave -> Master vm6)
>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 5 (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-5.bz2): Stopped
>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Promote prmAP:0 (Slave -> Master vm6)
>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 6 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Stopped
>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 7 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-7.bz2): Complete
>>>
>>>>> after the master resource on vm6 failed
>>> Sep 7 16:06:33 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:33 vm5 pengine[23199]: notice: LogActions: Recover prmAP:0 (Master vm6)
>>> Sep 7 16:06:34 vm5 crmd[23200]: notice: run_graph: Transition 8 (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-8.bz2): Stopped
>>> Sep 7 16:06:34 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:34 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm6 after 1 failures (max=1)
>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Stop prmAP:0 (vm6)
>>> Sep 7 16:06:34 vm5 crmd[23200]: notice: run_graph: Transition 9 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-9.bz2): Stopped
>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Start prmAP:0 (vm5)
>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Promote prmAP:0 (Stopped -> Master vm5)
>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Start prmAP:1 (vm6)
>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 10 (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped
>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm5 after 1 failures (max=1)
>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm6 after 1 failures (max=1)
>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: Forcing msAP away from vm6 after 1 failures (max=1)
>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Stop prmAP:0 (vm5)
>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Stop prmAP:1 (vm6)
>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 11 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-11.bz2): Stopped
>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Start prmAP:0 (vm5)
>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Promote prmAP:0 (Stopped -> Master vm5)
>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Start prmAP:1 (vm6)
>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 12 (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-12.bz2): Stopped
>>> :
>>>
>>> Is it a known issue?
>>>
>>> Best Regards,
>>> Kazunori INOUE
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list