[Pacemaker] master/slave resource does not stop (tries start repeatedly)

Tue Sep 18 20:24:19 UTC 2012

----- Original Message -----
> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Friday, September 14, 2012 4:26:52 AM
> Subject: Re: [Pacemaker] master/slave resource does not stop (tries start repeatedly)
> 
> Hi Andrew,
> 
> I confirmed that this problem had been resolved.
> - ClusterLabs/pacemaker : 7a9bf21cfc
> 
> However, I found two problems.
> 
> (1) it is output with orphan in crm_mon.
> 
>    # crm_mon -rf1
>     :
>    Full list of resources:
> 
>     Master/Slave Set: msAP [prmAP]
>         Stopped: [ prmAP:0 prmAP:1 ]
> 
>    Migration summary:
>    * Node vm5:
>       prmAP: orphan
>    * Node vm6:
>       prmAP: orphan
> 
>    Failed actions:
>        prmAP_monitor_10000 (node=vm5, call=15, rc=1,
>        status=complete): unknown error
>        prmAP_monitor_10000 (node=vm6, call=21, rc=1,
>        status=complete): unknown error
> 
> (2) and, cannot clear the failure status.
> 
>    CIB is not updated even if I execute a 'crm_resource -C'.
> 
>    # crm_resource -C -r msAP
>    Cleaning up prmAP:0 on vm5
>    Cleaning up prmAP:0 on vm6
>    Cleaning up prmAP:1 on vm5
>    Cleaning up prmAP:1 on vm6
>    Waiting for 1 replies from the CRMd. OK
> 
>    # cibadmin -Q -o status
>    <status>
>      <node_state id="2439358656" uname="vm5" in_ccm="true"
>      crmd="online" join="member" expected="member"
>      crm-debug-origin="do_update_resource">
>        <transient_attributes id="2439358656">
>          <instance_attributes id="status-2439358656">
>            <nvpair id="status-2439358656-probe_complete"
>            name="probe_complete" value="true"/>
>            <nvpair id="status-2439358656-fail-count-prmAP"
>            name="fail-count-prmAP" value="1"/>
>            <nvpair id="status-2439358656-last-failure-prmAP"
>            name="last-failure-prmAP" value="1347598951"/>
>          </instance_attributes>
>        </transient_attributes>
>        <lrm id="2439358656">
>          <lrm_resources>
>            <lrm_resource id="prmAP" type="Stateful" class="ocf"
>            provider="pacemaker">
>              <lrm_rsc_op id="prmAP_last_0"
>              operation_key="prmAP_stop_0" operation="stop"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:0;1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="24" rc-code="0" op-status="0" interval="0"
>              last-run="1347598936" last-rc-change="0"
>              exec-time="205" queue-time="0"
>              op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>              <lrm_rsc_op id="prmAP_monitor_10000"
>              operation_key="prmAP_monitor_10000" operation="monitor"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:8;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="15" rc-code="8" op-status="0" interval="10000"
>              last-rc-change="1347598916" exec-time="40"
>              queue-time="0"
>              op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>              <lrm_rsc_op id="prmAP_last_failure_0"
>              operation_key="prmAP_monitor_10000" operation="monitor"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:1;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="15" rc-code="1" op-status="0" interval="10000"
>              last-rc-change="1347598936" exec-time="0"
>              queue-time="0"
>              op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>            </lrm_resource>
>          </lrm_resources>
>        </lrm>
>      </node_state>
>      <node_state id="2456135872" uname="vm6" in_ccm="true"
>      crmd="online" join="member" expected="member"
>      crm-debug-origin="do_update_resource">
>        <transient_attributes id="2456135872">
>          <instance_attributes id="status-2456135872">
>            <nvpair id="status-2456135872-probe_complete"
>            name="probe_complete" value="true"/>
>            <nvpair id="status-2456135872-fail-count-prmAP"
>            name="fail-count-prmAP" value="1"/>
>            <nvpair id="status-2456135872-last-failure-prmAP"
>            name="last-failure-prmAP" value="1347598962"/>
>          </instance_attributes>
>        </transient_attributes>
>        <lrm id="2456135872">
>          <lrm_resources>
>            <lrm_resource id="prmAP" type="Stateful" class="ocf"
>            provider="pacemaker">
>              <lrm_rsc_op id="prmAP_last_0"
>              operation_key="prmAP_stop_0" operation="stop"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:0;1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="30" rc-code="0" op-status="0" interval="0"
>              last-run="1347598962" last-rc-change="0"
>              exec-time="230" queue-time="0"
>              op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>              <lrm_rsc_op id="prmAP_monitor_10000"
>              operation_key="prmAP_monitor_10000" operation="monitor"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:8;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="21" rc-code="8" op-status="0" interval="10000"
>              last-rc-change="1347598952" exec-time="43"
>              queue-time="0"
>              op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>              <lrm_rsc_op id="prmAP_last_failure_0"
>              operation_key="prmAP_monitor_10000" operation="monitor"
>              crm-debug-origin="do_update_resource"
>              crm_feature_set="3.0.6"
>              transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              transition-magic="0:1;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc"
>              call-id="21" rc-code="1" op-status="0" interval="10000"
>              last-rc-change="1347598962" exec-time="0"
>              queue-time="0"
>              op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>            </lrm_resource>
>          </lrm_resources>
>        </lrm>
>      </node_state>
>    </status>
> 
> 
> I wrote a patch for crm_mon and crm_resource.
> (I am not checking whether other commands have the similar problem..)
> 
> https://github.com/inouekazu/pacemaker/commit/36cf730751080de197438cfaa34163150059d89c

I took a look at the patch.  I can see why it fixes the output but I'm not sure the method used to find the resource is a good idea.  I know the append ":0" to the resource method is already used in some places and that this patch is copying logic that already exists, but it feels like a hack.  I'm working through some ideas to fix this.

-- Vossel 

> - when searching the data of a resource_s structure, resource-id
> which
>    attached instance number (:0) is used as a key as needed.
> - resource-id which removed instance number is used for the update
>    request to CIB.
> 
> Are the specifications (approach) of this patch right?
> 
> Best Regards,
> Kazunori INOUE
>