[Pacemaker] Question on ILO stonith resource config and restarting

Tue Nov 4 11:24:36 EST 2008

Hi,

On Thu, Oct 30, 2008 at 02:51:45PM -0400, Aaron Bush wrote:
> >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with
> resources
> >> as follows:
> >> 
> >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
> >> Clone-pingd: set to monitor a couple of Ips and used to set a weight
> for
> >> where to run the LVS group.
> >> 
> >> -- This is the area that I have a question on --
> >> Clone-stonith-node1: HP ILO to shoot node1
> >> Clone-stonith-node2: HP ILO to shoot node2
> >> 
> >> I read on the old linux-ha site that using a clone for ILO/stonith
> was
> >> the way to go.  I'm not sure I see how this would work correctly and
> be
> >> preferred over a standard resource.  What I am confused about is
> this:
> >> the external/riloe stonith plugin only knows how to shoot one node so
> >
> >Please make sure that you use the latest edition of
> >external/riloe. The previous one didn't work under all
> >circumstances.
> 
> I am using the version that came with heartbeat-common-2.99.0-3.1
> (according rpm -qf)

I don't think that that release includes the fix. You need at
least 2.99.2.

> To clear my current issue where the stonith resource was not started
> (and since this is still in the lab) I have rebooted both nodes to start
> with a somewhat clean slate.  I have attempted to grab some more useful
> information from the logs on why the resource is not restarting from.
> Again I disconnect the LAN cable connecting a node to the rest of the
> network (a private HB channel is still available and the ILO is still
> up).  I noticed these entries in the log:
> 
> Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_start_0
> key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start
> Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_can_reset from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_protocol from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_powerdown_method from StonithNVpair
> Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link
> wwwlb01.microcenter.com:eth0 dead.
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]

Ping node? Why use the cluster node as ping node too?

> Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for
> cl_stonith_lb02:0 is empty, please fix your constraints
> Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0
> failed, because its hostlist is empty
> Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete
> Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop
> Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource
> cl_stonith_lb02:0 who is not in started resource queue.
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_stop_0
> key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete
> 
> 
> 
> Looks like I should specify from additional nvpair's for the ilo's.

Only if you really need them.

> The
> WARN host list empty message is what looks bad to me.

That looks OK to me. The cluster tried to start the stonith
resource which is to manage (shoot) wwwlb02 on the same node.

> Here is the cib
> section for the clone resource and the cib constraint for this resource.
> Please let me know if there is some obvious errors in this
> configuration.  This is the stonith resource that is to shoot the 02
> node, intended to run on the 01 node (the 01 node was the node who had a
> network cable disconnect).
> 
> 
> 	<clone id="cl_stonithset_lb02">
>          <meta_attributes id="cl_stonithset_lb02_meta_attrs">
>            <attributes>
>              <nvpair id="cl_stonithset_lb02_metaattr_target_role"
> name="target_role" value="started"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_max"
> name="clone_max" value="1"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max"
> name="clone_node_max" value="1"/>
>            </attributes>
>          </meta_attributes>
>          <primitive id="cl_stonith_lb02" class="stonith"
> type="external/riloe" provider="heartbeat">
>            <instance_attributes id="cl_stonith_lb02_instance_attrs">
>              <attributes>
>                <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224"
> name="hostlist" value="wwwlb02.microcenter.com"/>
>                <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80"
> name="ilo_hostname" value="10.100.254.162"/>
>                <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33"
> name="ilo_user" value="Administrator"/>
>                <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a"
> name="ilo_password" value="PASSWORD"/>
>              </attributes>
>            </instance_attributes>
>            <operations>
>              <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb"
> name="monitor" interval="30" timeout="20" start_delay="15"
> disabled="false" role="Started" on_fail="restart"/>
>              <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start"
> timeout="20" start_delay="0" disabled="false" role="Started"
> on_fail="restart"/>
>            </operations>
>            <meta_attributes id="cl_stonith_lb02:0_meta_attrs">
>              <attributes>
>                <nvpair id="cl_stonith_lb02:0_metaattr_target_role"
> name="target_role" value="started"/>
>              </attributes>
>            </meta_attributes>
>          </primitive>
>        </clone>
> 
>      <constraints>
>        <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02">
>          <rule id="prefered_location_on_lb01" score="INFINITY">
>            <expression attribute="#uname"
> id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq"
> value="wwwlb01.microcenter.com"/>
>          </rule>
>        </rsc_location>
>      </constraints>

This configuration is fine. Not sure if that constraint would
make any difference though.

Thanks,

Dejan

> 
> Thanks,
> -ab
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker