[Pacemaker] stonith pacemaker problem

Sun Oct 10 21:20:58 UTC 2010

Andrew,

We were able to solve our problem. Obviously if no one else is having
it then it has to be our environment. It's just that time pressure and
mgmt pressure was causing us to go really bonkers.

We had been struggling with this for past 4 days.
So here is the story:

We had following versions of HA libs existing on our appliance:

heartbeat=3.0.0
openais=1.0.0
pacemaker=1.0.9

When I started installing glue=1.0.3 on top of it I started getting
bunch of conflicts so I basically
uninstalled the heartbeat and openais and proceeded to install the
following in the given order:

1.  glue=1.0.3
2.  corosync=1.1.1
3. pacemaker=1.0.9
4. agents=1.0.3

And that's when we started seeing this problem.
So after 2 days of going nowhere with this we said let's leave the
packages as such try to install using --replace-files option.

We are using a build tool called conary which has this option and not
standard make/make install.

So we let the above heartbeat and openais remain as such and installed
glue,corosync and pacemaker on top of it with the --replace-files
options , this time with no conflicts and bingo it all works fine.

So that sort of confused me as to why do we still need heartbeat given
the above 4 packages.
 understand that /usr/lib/ocf/resource.d/heartbeat has ocf scripts
provided by heartbeat but that can be part of the "Reusable cluster
agents" subsystem.

Frankly I thought the way I had installed the system by erasing and
installing the fresh packages it should have worked.

But all said and done I learned a lot of cluster code by gdbing it.
I'll be having a peaceful thanksgiving.

Thanks and happy thanks giving.
Shravan

On Sun, Oct 10, 2010 at 2:46 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> Not enough information.
> We'd need more than just the lrmd's logs, they only show what happened not why.
>
> On Thu, Oct 7, 2010 at 11:02 PM, Shravan Mishra
> <shravan.mishra at gmail.com> wrote:
>> Hi,
>>
>> Description of my environment:
>>   corosync=1.2.8
>>   pacemaker=1.1.3
>>   Linux= 2.6.29.6-0.6.smp.gcc4.1.x86_64 #1 SMP
>>
>>
>> We are having a problem with our pacemaker which is continuously
>> canceling the monitoring operation of our stonith devices.
>>
>> We ran:
>>
>> stonith -d -t external/safe/ipmi hostname=ha2.itactics.com
>> ipaddr=192.168.2.7 userid=hellouser passwd=hello interface=lanplus -S
>>
>> it's output is attached as stonith.output.
>>
>> We have been trying to debug this issue for  a few days now with no success.
>> We are hoping that someone can help us as we are under immense
>> pressure to move to RCS unless we can solve this issue in a day or two
>> ,which I personally don't want to because we like the product.
>>
>> Any help will be greatly appreciated.
>>
>>
>> Here is an excerpt from the /var/log/messages:
>> =========================
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11155: start
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11156: monitor
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11156] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11157: stop
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11158: start
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11159: monitor
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11159] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11160: stop
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11161: start
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11162: monitor
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11162] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11163: stop
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11164: start
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11165: monitor
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11165] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11166: stop
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11167: start
>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11168: monitor
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11168] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11169: stop
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11170: start
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: stonithRA plugin: got
>> metadata: <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM
>> "ra-api-1.dtd"> <resource-agent name="external/safe/ipmi">
>> <version>1.0</version>   <longdesc lang="en"> ipmitool based power
>> management. Apparently, the power off method of ipmitool is
>> intercepted by ACPI which then makes a regular shutdown. If case of a
>> split brain on a two-node it may happen that no node survives. For
>> two-node clusters use only the reset method.    </longdesc>
>> <shortdesc lang="en">IPMI STONITH external device </shortdesc>
>> <parameters> <parameter name="hostname" unique="1"> <content
>> type="string" /> <shortdesc lang="en"> Hostname </shortdesc> <longdesc
>> lang="en"> The name of the host to be managed by this STONITH device.
>> </longdesc> </parameter>  <parameter name="ipaddr" unique="1">
>> <content type="string" /> <shortdesc lang="en"> IP Address
>> </shortdesc> <longdesc lang="en"> The IP address of the STONITH
>> device. </longdesc> </parameter>  <parameter name="userid" unique="1">
>> <content type="string" /> <shortdesc lang="en"> Login </shortdesc>
>> <longdesc lang="en"> The username used for logging in to the STONITH
>> device. </longdesc> </parameter>  <parameter name="passwd" unique="1">
>> <content type="string" /> <shortdesc lang="en"> Password </shortdesc>
>> <longdesc lang="en"> The password used for logging in to the STONITH
>> device. </longdesc> </parameter>  <parameter name="interface"
>> unique="1"> <content type="string" default="lan"/> <shortdesc
>> lang="en"> IPMI interface </shortdesc> <longdesc lang="en"> IPMI
>> interface to use, such as "lan" or "lanplus". </longdesc> </parameter>
>>  </parameters>    <actions>     <action name="start"   timeout="15" />
>>    <action name="stop"    timeout="15" />     <action name="status"
>> timeout="15" />     <action name="monitor" timeout="15" interval="15"
>> start-delay="15" />     <action name="meta-data"  timeout="15" />
>> </actions>   <special tag="heartbeat">     <version>2.0</version>
>> </special> </resource-agent>
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11171: monitor
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation
>> monitor[11171] on
>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>> userid=[safe_ipmi_admin]  cancelled
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11172: stop
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11173: start
>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>> rsc:ha2.itactics.com-stonith:11174: monitor
>>
>> ==========================
>>
>> Thanks
>>
>> Shravan
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>