[Pacemaker] master/slave resource does not stop (tries start repeatedly)

Thu Sep 20 02:54:32 UTC 2012

Hi Andrew,

I confirmed that both problems were fixed.
Thanks.

(12.09.20 07:58), Andrew Beekhof wrote:
> On Fri, Sep 14, 2012 at 7:26 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>> Hi Andrew,
>>
>> I confirmed that this problem had been resolved.
>> - ClusterLabs/pacemaker : 7a9bf21cfc
>>
>> However, I found two problems.
>
> Ah, I see what you mean.
> I believe https://github.com/beekhof/pacemaker/commit/7ecc279 should
> fix both problems.
> Can you confirm please?
>
>>
>> (1) it is output with orphan in crm_mon.
>>
>>    # crm_mon -rf1
>>     :
>>    Full list of resources:
>>
>>     Master/Slave Set: msAP [prmAP]
>>         Stopped: [ prmAP:0 prmAP:1 ]
>>
>>    Migration summary:
>>    * Node vm5:
>>       prmAP: orphan
>>    * Node vm6:
>>       prmAP: orphan
>>
>>    Failed actions:
>>        prmAP_monitor_10000 (node=vm5, call=15, rc=1, status=complete):
>> unknown error
>>        prmAP_monitor_10000 (node=vm6, call=21, rc=1, status=complete):
>> unknown error
>>
>> (2) and, cannot clear the failure status.
>>
>>    CIB is not updated even if I execute a 'crm_resource -C'.
>>
>>    # crm_resource -C -r msAP
>>    Cleaning up prmAP:0 on vm5
>>    Cleaning up prmAP:0 on vm6
>>    Cleaning up prmAP:1 on vm5
>>    Cleaning up prmAP:1 on vm6
>>    Waiting for 1 replies from the CRMd. OK
>>
>>    # cibadmin -Q -o status
>>    <status>
>>      <node_state id="2439358656" uname="vm5" in_ccm="true" crmd="online"
>> join="member" expected="member" crm-debug-origin="do_update_resource">
>>        <transient_attributes id="2439358656">
>>          <instance_attributes id="status-2439358656">
>>            <nvpair id="status-2439358656-probe_complete"
>> name="probe_complete" value="true"/>
>>            <nvpair id="status-2439358656-fail-count-prmAP"
>> name="fail-count-prmAP" value="1"/>
>>            <nvpair id="status-2439358656-last-failure-prmAP"
>> name="last-failure-prmAP" value="1347598951"/>
>>          </instance_attributes>
>>        </transient_attributes>
>>        <lrm id="2439358656">
>>          <lrm_resources>
>>            <lrm_resource id="prmAP" type="Stateful" class="ocf"
>> provider="pacemaker">
>>              <lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0"
>> operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.6"
>> transition-key="1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:0;1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="24" rc-code="0" op-status="0" interval="0" last-run="1347598936"
>> last-rc-change="0" exec-time="205" queue-time="0"
>> op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>>              <lrm_rsc_op id="prmAP_monitor_10000"
>> operation_key="prmAP_monitor_10000" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
>> transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:8;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="15" rc-code="8" op-status="0" interval="10000"
>> last-rc-change="1347598916" exec-time="40" queue-time="0"
>> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>>              <lrm_rsc_op id="prmAP_last_failure_0"
>> operation_key="prmAP_monitor_10000" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
>> transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:1;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="15" rc-code="1" op-status="0" interval="10000"
>> last-rc-change="1347598936" exec-time="0" queue-time="0"
>> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>>            </lrm_resource>
>>          </lrm_resources>
>>        </lrm>
>>      </node_state>
>>      <node_state id="2456135872" uname="vm6" in_ccm="true" crmd="online"
>> join="member" expected="member" crm-debug-origin="do_update_resource">
>>        <transient_attributes id="2456135872">
>>          <instance_attributes id="status-2456135872">
>>            <nvpair id="status-2456135872-probe_complete"
>> name="probe_complete" value="true"/>
>>            <nvpair id="status-2456135872-fail-count-prmAP"
>> name="fail-count-prmAP" value="1"/>
>>            <nvpair id="status-2456135872-last-failure-prmAP"
>> name="last-failure-prmAP" value="1347598962"/>
>>          </instance_attributes>
>>        </transient_attributes>
>>        <lrm id="2456135872">
>>          <lrm_resources>
>>            <lrm_resource id="prmAP" type="Stateful" class="ocf"
>> provider="pacemaker">
>>              <lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0"
>> operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.6"
>> transition-key="1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:0;1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="30" rc-code="0" op-status="0" interval="0" last-run="1347598962"
>> last-rc-change="0" exec-time="230" queue-time="0"
>> op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>>              <lrm_rsc_op id="prmAP_monitor_10000"
>> operation_key="prmAP_monitor_10000" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
>> transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:8;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="21" rc-code="8" op-status="0" interval="10000"
>> last-rc-change="1347598952" exec-time="43" queue-time="0"
>> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>>              <lrm_rsc_op id="prmAP_last_failure_0"
>> operation_key="prmAP_monitor_10000" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
>> transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> transition-magic="0:1;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>> call-id="21" rc-code="1" op-status="0" interval="10000"
>> last-rc-change="1347598962" exec-time="0" queue-time="0"
>> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>>            </lrm_resource>
>>          </lrm_resources>
>>        </lrm>
>>      </node_state>
>>    </status>
>>
>>
>> I wrote a patch for crm_mon and crm_resource.
>> (I am not checking whether other commands have the similar problem..)
>>
>> https://github.com/inouekazu/pacemaker/commit/36cf730751080de197438cfaa34163150059d89c
>>
>> - when searching the data of a resource_s structure, resource-id which
>>    attached instance number (:0) is used as a key as needed.
>> - resource-id which removed instance number is used for the update
>>    request to CIB.
>>
>> Are the specifications (approach) of this patch right?
>>
>> Best Regards,
>> Kazunori INOUE
>>
>>
>> (12.09.11 20:17), Andrew Beekhof wrote:
>>>
>>> On Tue, Sep 11, 2012 at 9:13 PM, Andrew Beekhof <andrew at beekhof.net>
>>> wrote:
>>>>
>>>> Yikes!
>>>>
>>>> Fixed in:
>>>>      https://github.com/beekhof/pacemaker/commit/7d098ce
>>>
>>>
>>> That link should have been:
>>>
>>>
>>> https://github.com/beekhof/pacemaker/commit/c1f409baaaf388d03f6124ec0d9da440445c4a23
>>>
>>>>
>>>> On Fri, Sep 7, 2012 at 7:49 PM, Kazunori INOUE
>>>> <inouekazu at intellilink.co.jp> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Pacemaker-1.1.
>>>>> - ClusterLabs/pacemaker : 872a2f1af1 (Sep 07)
>>>>>
>>>>> Though a monitor of master resource fails and there is no node which
>>>>> the master/slave resource can run, the master/slave resource does not
>>>>> stop.
>>>>>
>>>>> [test case]
>>>>> 1. use StatefulRA which set on-fail="restart" of monitor and
>>>>>      migration-threshold is 1.
>>>>>
>>>>>      # crm_mon
>>>>>
>>>>>      Online: [ vm5 vm6 ]
>>>>>
>>>>>       Master/Slave Set: msAP [prmAP]
>>>>>           Masters: [ vm5 ]
>>>>>           Slaves: [ vm6 ]
>>>>>
>>>>> 2. let the master resource on vm5 fail, and move it to vm6.
>>>>>
>>>>>      Online: [ vm5 vm6 ]
>>>>>
>>>>>       Master/Slave Set: msAP [prmAP]
>>>>>           Masters: [ vm6 ]
>>>>>           Stopped: [ prmAP:1 ]
>>>>>
>>>>>      Failed actions:
>>>>>          prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete):
>>>>> unknown error
>>>>>
>>>>> 3. let the master resource on vm6 fail again, then
>>>>>      the master/slave resource tries start repeatedly.
>>>>>      the state of following (a) and (b) is repeated.
>>>>>
>>>>>     (a)
>>>>>      Online: [ vm5 vm6 ]
>>>>>
>>>>>
>>>>>      Failed actions:
>>>>>          prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete):
>>>>> unknown error
>>>>>          prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete):
>>>>> unknown error
>>>>>
>>>>>     (b)
>>>>>      Online: [ vm5 vm6 ]
>>>>>
>>>>>       Master/Slave Set: msAP [prmAP]
>>>>>           Slaves: [ vm5 vm6 ]
>>>>>
>>>>>      Failed actions:
>>>>>          prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete):
>>>>> unknown error
>>>>>          prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete):
>>>>> unknown error
>>>>>
>>>>> # grep -e run_graph: -e common_apply_stickiness: -e LogActions: ha-log
>>>>>
>>>>>>> after the master resource on vm5 failed
>>>>>
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:   notice: LogActions: Recover
>>>>> prmAP:0       (Master vm5)
>>>>> Sep  7 16:06:03 vm5 crmd[23200]:   notice: run_graph: Transition 4
>>>>> (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Stopped
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:   notice: LogActions: Stop
>>>>> prmAP:0       (vm5)
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:   notice: LogActions: Promote
>>>>> prmAP:1       (Slave -> Master vm6)
>>>>> Sep  7 16:06:03 vm5 crmd[23200]:   notice: run_graph: Transition 5
>>>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-5.bz2): Stopped
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:   notice: LogActions: Promote
>>>>> prmAP:0       (Slave -> Master vm6)
>>>>> Sep  7 16:06:03 vm5 crmd[23200]:   notice: run_graph: Transition 6
>>>>> (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Stopped
>>>>> Sep  7 16:06:03 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:03 vm5 crmd[23200]:   notice: run_graph: Transition 7
>>>>> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-7.bz2): Complete
>>>>>
>>>>>>> after the master resource on vm6 failed
>>>>>
>>>>> Sep  7 16:06:33 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:33 vm5 pengine[23199]:   notice: LogActions: Recover
>>>>> prmAP:0       (Master vm6)
>>>>> Sep  7 16:06:34 vm5 crmd[23200]:   notice: run_graph: Transition 8
>>>>> (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-8.bz2): Stopped
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm6 after 1 failures (max=1)
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:   notice: LogActions: Stop
>>>>> prmAP:0       (vm6)
>>>>> Sep  7 16:06:34 vm5 crmd[23200]:   notice: run_graph: Transition 9
>>>>> (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-9.bz2): Stopped
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:   notice: LogActions: Start
>>>>> prmAP:0       (vm5)
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:   notice: LogActions: Promote
>>>>> prmAP:0       (Stopped -> Master vm5)
>>>>> Sep  7 16:06:34 vm5 pengine[23199]:   notice: LogActions: Start
>>>>> prmAP:1       (vm6)
>>>>> Sep  7 16:06:35 vm5 crmd[23200]:   notice: run_graph: Transition 10
>>>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm5 after 1 failures (max=1)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm6 after 1 failures (max=1)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:  warning: common_apply_stickiness:
>>>>> Forcing msAP away from vm6 after 1 failures (max=1)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:   notice: LogActions: Stop
>>>>> prmAP:0       (vm5)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:   notice: LogActions: Stop
>>>>> prmAP:1       (vm6)
>>>>> Sep  7 16:06:35 vm5 crmd[23200]:   notice: run_graph: Transition 11
>>>>> (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-11.bz2): Stopped
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:   notice: LogActions: Start
>>>>> prmAP:0       (vm5)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:   notice: LogActions: Promote
>>>>> prmAP:0       (Stopped -> Master vm5)
>>>>> Sep  7 16:06:35 vm5 pengine[23199]:   notice: LogActions: Start
>>>>> prmAP:1       (vm6)
>>>>> Sep  7 16:06:35 vm5 crmd[23200]:   notice: run_graph: Transition 12
>>>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-12.bz2): Stopped
>>>>>    :
>>>>>
>>>>> Is it a known issue?
>>>>>
>>>>> Best Regards,
>>>>> Kazunori INOUE
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org