[Pacemaker] Question on ILO stonith resource config and restarting

Wed Nov 5 03:24:44 CET 2008

Hi Aaron.

First of all, what I say in this message is bases on my experiences
when I dealt with Heartbeat 2.1.3 and external/igmrsa-telnet
several months ago. So you should be careful in the case you apply
what I say to your problems.

 > What I am confused about is this:
 > the external/riloe stonith plugin only knows how to shoot one node so
 > why would you want to run it as a clone since each external/riloe is
 > configured differently.
...
 > I then noticed that my ILO clones were starting on the 'wrong' nodes.
 > As in the stonith resource to kill node 2 was actually running on node
 > 2; which is pointless if node 2 locks up.  So I added resource
 > constraints to force the stonith clone to stay on a node that was not
 > the one to be shot.  This seemed to work well.

I think external/riloe isn't made to run as a clone resource
just same as external/ibmrsa-telnet. They are made to run in
a similar way to usual resource agents.
See the attached sample configuration.

 > The next issue I have is that when I disconnect the LAN cable on a
 > single node that connects it to the rest of the network the clone
 > stonith monitor will fail since it can't connect to the other nodes ILO
 > for status.  After some time (minutes let's say) I reconnect the LAN
 > cable but never see the clone stonith come back to life, just stays
 > failed.  What should I be looking at to make sure that the clone stonith
 > restarts properly.

I presume you want to know how to recover a monitor failure of
a stonith plugin. Is my guess write? If so, what you must do is
run the following commands.

# crm_failcount -D -r prmStonithN2 -U node01
# crm_resource -C -r prmStonithN2 -H node02

!!! Caution !!!
Some options might be changed in the latest Pacemaker.

Aaron Bush wrote:
> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with resources
> as follows:
> =

> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
> Clone-pingd: set to monitor a couple of Ips and used to set a weight for
> where to run the LVS group.
> =

> -- This is the area that I have a question on --
> Clone-stonith-node1: HP ILO to shoot node1
> Clone-stonith-node2: HP ILO to shoot node2
> =

> I read on the old linux-ha site that using a clone for ILO/stonith was
> the way to go.  I'm not sure I see how this would work correctly and be
> preferred over a standard resource.  What I am confused about is this:
> the external/riloe stonith plugin only knows how to shoot one node so
> why would you want to run it as a clone since each external/riloe is
> configured differently.  I went ahead and configured the riloe's as
> clones feeling that the docs are correct and that the reason would
> become obvious to me later.  (I also saw a similar post with no
> response:
> http://www.gossamer-threads.com/lists/linuxha/users/35685?nohighlight=3D1#
> 35685)
> =

> I then noticed that my ILO clones were starting on the 'wrong' nodes.
> As in the stonith resource to kill node 2 was actually running on node
> 2; which is pointless if node 2 locks up.  So I added resource
> constraints to force the stonith clone to stay on a node that was not
> the one to be shot.  This seemed to work well.
> =

> The next issue I have is that when I disconnect the LAN cable on a
> single node that connects it to the rest of the network the clone
> stonith monitor will fail since it can't connect to the other nodes ILO
> for status.  After some time (minutes let's say) I reconnect the LAN
> cable but never see the clone stonith come back to life, just stays
> failed.  What should I be looking at to make sure that the clone stonith
> restarts properly.
> =

> Any advice on how to more properly setup an HP ILO stonith in this
> scenario would be greatly appreciated.  (I can see where a clone stonith
> would be useful in a large cluster of n>2 nodes since all nodes could
> have a chance to shoot a failed node and maybe this is the reason for
> cloned stonith with ILO?  Basically in a cluster of N nodes each node
> would be running N-1 stonith resources, ready to shoot a dead node.)
> =

> Thanks in advance,
> -ab
> =

> =

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> =

-- =

Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
NTT Open Source Software Center

-------------- next part --------------
 <cib admin_epoch=3D"0" epoch=3D"0" num_updates=3D"0">
   <configuration>
     <crm_config>
... snip ...
     <resources>
... write configurations for resources that you really want to use ...

<!-- the configurations for the stonith plugin that shoots node01 from node=
02 -->
       <primitive id=3D"prmStonithN1" class=3D"stonith" type=3D"external/ib=
mrsa-telnet" provider=3D"heartbeat" resource_stickiness=3D"INFINITY">
         <operations>
           <op name=3D"monitor" interval=3D"20" timeout=3D"300" prereq=3D"n=
othing" id=3D"prmStonithN1:monitor"/>
           <op name=3D"start" timeout=3D"180" id=3D"prmStonithN1:start"/>
           <op name=3D"stop" timeout=3D"180" id=3D"prmStonithN1:stop"/>
         </operations>
         <instance_attributes id=3D"prmStonithN1:attr">
           <attributes>
             <nvpair id=3D"prmStonithN1:nodename" name=3D"nodename" value=
=3D"node01"/>
             <nvpair id=3D"prmStonithN1:ipaddr" name=3D"ip_address" value=
=3D"192.168.16.126"/>
             <nvpair id=3D"prmStonithN1:userid" name=3D"username" value=3D"=
USERID"/>
             <nvpair id=3D"prmStonithN1:passwd" name=3D"password" value=3D"=
***"/>
           </attributes>
         </instance_attributes>
       </primitive>

<!-- the configurations for the stonith plugin that shoots node02 from node=
01 -->
       <primitive id=3D"prmStonithN2" class=3D"stonith" type=3D"external/ib=
mrsa-telnet" provider=3D"heartbeat" resource_stickiness=3D"INFINITY">
         <operations>
           <op name=3D"monitor" interval=3D"20" timeout=3D"300" prereq=3D"n=
othing" id=3D"prmStonithN2:monitor"/>
           <op name=3D"start" timeout=3D"180" id=3D"prmStonithN2:start"/>
           <op name=3D"stop" timeout=3D"180" id=3D"prmStonithN2:stop"/>
         </operations>
         <instance_attributes id=3D"prmStonithN2:attr">
           <attributes>
             <nvpair id=3D"prmStonithN2:nodename" name=3D"nodename" value=
=3D"node02"/>
             <nvpair id=3D"prmStonithN2:ipaddr" name=3D"ip_address" value=
=3D"192.168.16.127"/>
             <nvpair id=3D"prmStonithN2:userid" name=3D"username" value=3D"=
USERID"/>
             <nvpair id=3D"prmStonithN2:passwd" name=3D"password" value=3D"=
###"/>
           </attributes>
         </instance_attributes>
       </primitive>
     </resources>

     <constraints>
... write constraints for resources that you really want to use ...

<!-- the constraints to keep the stonith plugin that shoots node01 at node0=
2 -->
       <rsc_location id=3D"prmStonithN1_hates_node01" rsc=3D"prmStonithN1">
         <rule id=3D"prmStonithN1_hates_node01_rule" score=3D"-INFINITY">
           <expression attribute=3D"#uname" operation=3D"eq" value=3D"node0=
1" id=3D"prmStonithN1_hates_N1_expr"/>
         </rule>
       </rsc_location>

<!-- the constraints to keep the stonith plugin that shoots node02 at node0=
1 -->
       <rsc_location id=3D"prmStonithN2_hates_node02" rsc=3D"prmStonithN2">
         <rule id=3D"prmStonithN2_hates_node02_rule" score=3D"-INFINITY">
           <expression attribute=3D"#uname" operation=3D"eq" value=3D"node0=
2" id=3D"prmStonithN2_hates_N2_expr"/>
         </rule>
       </rsc_location>

     </constraints>
   </configuration>
   <status/>
 </cib>