[Pacemaker] Question on ILO stonith resource config and restarting

Tue Nov 4 16:26:11 UTC 2008

On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote:
> Just realized that I only included the log entries from the node that
> was not experiencing a network disconnect.  Attached are the log entries
> from the node (01) that had the stonith resource running before the
> cable disconnect and looks like they provide some more useful
> information.  Also included up through when the network cable was
> reconnected.

The monitor operation on riloe failed. You should definitely
upgrade.

Thanks,

Dejan

> 
> -ab
> 
> >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with
> resources
> >> as follows:
> >> 
> >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
> >> Clone-pingd: set to monitor a couple of Ips and used to set a weight
> for
> >> where to run the LVS group.
> >> 
> >> -- This is the area that I have a question on --
> >> Clone-stonith-node1: HP ILO to shoot node1
> >> Clone-stonith-node2: HP ILO to shoot node2
> >> 
> >> I read on the old linux-ha site that using a clone for ILO/stonith
> was
> >> the way to go.  I'm not sure I see how this would work correctly and
> be
> >> preferred over a standard resource.  What I am confused about is
> this:
> >> the external/riloe stonith plugin only knows how to shoot one node so
> >
> >Please make sure that you use the latest edition of
> >external/riloe. The previous one didn't work under all
> >circumstances.
> 
> I am using the version that came with heartbeat-common-2.99.0-3.1
> (according rpm -qf)
> 
> To clear my current issue where the stonith resource was not started
> (and since this is still in the lab) I have rebooted both nodes to start
> with a somewhat clean slate.  I have attempted to grab some more useful
> information from the logs on why the resource is not restarting from.
> Again I disconnect the LAN cable connecting a node to the rest of the
> network (a private HB channel is still available and the ILO is still
> up).  I noticed these entries in the log:
> 
> Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_start_0
> key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start
> Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_can_reset from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_protocol from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_powerdown_method from StonithNVpair
> Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link
> wwwlb01.microcenter.com:eth0 dead.
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for
> cl_stonith_lb02:0 is empty, please fix your constraints
> Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0
> failed, because its hostlist is empty
> Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete
> Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop
> Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource
> cl_stonith_lb02:0 who is not in started resource queue.
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_stop_0
> key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete
> 
> 
> 
> Looks like I should specify from additional nvpair's for the ilo's.  The
> WARN host list empty message is what looks bad to me.  Here is the cib
> section for the clone resource and the cib constraint for this resource.
> Please let me know if there is some obvious errors in this
> configuration.  This is the stonith resource that is to shoot the 02
> node, intended to run on the 01 node (the 01 node was the node who had a
> network cable disconnect).
> 
> 
> 	<clone id="cl_stonithset_lb02">
>          <meta_attributes id="cl_stonithset_lb02_meta_attrs">
>            <attributes>
>              <nvpair id="cl_stonithset_lb02_metaattr_target_role"
> name="target_role" value="started"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_max"
> name="clone_max" value="1"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max"
> name="clone_node_max" value="1"/>
>            </attributes>
>          </meta_attributes>
>          <primitive id="cl_stonith_lb02" class="stonith"
> type="external/riloe" provider="heartbeat">
>            <instance_attributes id="cl_stonith_lb02_instance_attrs">
>              <attributes>
>                <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224"
> name="hostlist" value="wwwlb02.microcenter.com"/>
>                <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80"
> name="ilo_hostname" value="10.100.254.162"/>
>                <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33"
> name="ilo_user" value="Administrator"/>
>                <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a"
> name="ilo_password" value="PASSWORD"/>
>              </attributes>
>            </instance_attributes>
>            <operations>
>              <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb"
> name="monitor" interval="30" timeout="20" start_delay="15"
> disabled="false" role="Started" on_fail="restart"/>
>              <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start"
> timeout="20" start_delay="0" disabled="false" role="Started"
> on_fail="restart"/>
>            </operations>
>            <meta_attributes id="cl_stonith_lb02:0_meta_attrs">
>              <attributes>
>                <nvpair id="cl_stonith_lb02:0_metaattr_target_role"
> name="target_role" value="started"/>
>              </attributes>
>            </meta_attributes>
>          </primitive>
>        </clone>
> 
>      <constraints>
>        <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02">
>          <rule id="prefered_location_on_lb01" score="INFINITY">
>            <expression attribute="#uname"
> id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq"
> value="wwwlb01.microcenter.com"/>
>          </rule>
>        </rsc_location>
>      </constraints>
> 
> Thanks,
> -ab
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker