[Pacemaker] failure handling on a cloned resource

Mon May 6 22:08:57 EDT 2013

I have a much clearer idea of the problem you're seeing now, thankyou.

Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?

On 03/05/2013, at 10:40 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:

> Hi,
> 
> Below you can see my setup and my test, this shows that my cloned resource with on-fail=block does not recover automatically.
> 
> My Setup:
> 
> # rpm -aq | grep -i pacemaker
> pacemaker-libs-1.1.9-1512.el6.i686
> pacemaker-cluster-libs-1.1.9-1512.el6.i686
> pacemaker-cli-1.1.9-1512.el6.i686
> pacemaker-1.1.9-1512.el6.i686
> 
> # crm configure show
> node CSE-1
> node CSE-2
> primitive d_tomcat ocf:ntc:tomcat \
>    op monitor interval="15s" timeout="510s" on-fail="block" \
>    op start interval="0" timeout="510s" \
>    params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" monitor_timeout="120" \
>    meta migration-threshold="1"
> primitive ip_11 ocf:heartbeat:IPaddr2 \
>    op monitor interval="10s" \
>    params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" iflabel="ha" \
>    meta migration-threshold="1" failure-timeout="10"
> primitive ip_19 ocf:heartbeat:IPaddr2 \
>    op monitor interval="10s" \
>    params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" iflabel="ha" \
>    meta migration-threshold="1" failure-timeout="10"
> group svc-cse ip_19 ip_11
> clone cl_tomcat d_tomcat
> colocation colo_tomcat inf: svc-cse cl_tomcat
> order order_tomcat inf: cl_tomcat svc-cse
> property $id="cib-bootstrap-options" \
>    dc-version="1.1.9-1512.el6-2a917dd" \
>    cluster-infrastructure="cman" \
>    pe-warn-series-max="9" \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    pe-input-series-max="9" \
>    pe-error-series-max="9" \
>    last-lrm-refresh="1367582088"
> 
> Currently only 1 node is available, CSE-1.
> 
> 
> This is how I am currently testing my setup:
> 
> => Starting point: Everything up and running
> 
> # crm resource status
> Resource Group: svc-cse
>     ip_19    (ocf::heartbeat:IPaddr2):    Started
>     ip_11    (ocf::heartbeat:IPaddr2):    Started
> Clone Set: cl_tomcat [d_tomcat]
>     Started: [ CSE-1 ]
>     Stopped: [ d_tomcat:1 ]
> 
> => Causing failure: Change system so tomcat is running but has a failure (in attachment step_2.log)
> 
> # crm resource status
> Resource Group: svc-cse
>     ip_19    (ocf::heartbeat:IPaddr2):    Stopped
>     ip_11    (ocf::heartbeat:IPaddr2):    Stopped
> Clone Set: cl_tomcat [d_tomcat]
>     d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
>     Stopped: [ d_tomcat:1 ]
> 
> => Fixing failure: Revert system so tomcat is running without failure (in attachment step_3.log)
> 
> # crm resource status
> Resource Group: svc-cse
>     ip_19    (ocf::heartbeat:IPaddr2):    Stopped
>     ip_11    (ocf::heartbeat:IPaddr2):    Stopped
> Clone Set: cl_tomcat [d_tomcat]
>     d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
>     Stopped: [ d_tomcat:1 ]
> 
> As you can see in the logs the OCF script doesn't return any failure. This is noticed by pacemaker,
> however it doesn't reflect in crm_mon and it doesn't start the depending resources.
> 
> Gr.
> Johan
> 
> On 2013-05-03 03:04, Andrew Beekhof wrote:
>> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>> 
>>> On 2013-05-01 05:48, Andrew Beekhof wrote:
>>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I'm trying to setup a specific configuration in our cluster, however I'm struggling with my configuration.
>>>>> 
>>>>> This is what I'm trying to achieve:
>>>>> On both nodes of the cluster a daemon must be running (tomcat).
>>>>> Some failover addresses are configured and must be running on the node with a correctly running tomcat.
>>>>> 
>>>>> I have this achieved with a cloned tomcat resource and an collocation between the cloned tomcat and the failover addresses.
>>>>> When I cause a failure in the tomcat on the node running the failover addresses, the failover addresses will failover to the other node as expected.
>>>>> crm_mon shows that this tomcat has a failure.
>>>>> When I configure the tomcat resource with failure-timeout=0, the failure alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
>>>> All sounds right so far.
>>> If my broken tomcat is automatically fixed, I expect this to be noticed by pacemaker and that that node will be able to run my failover addresses,
>>> however I don't see this happening.
>> This is very hard to discuss without seeing logs.
>> 
>> So you created a tomcat error, waited for pacemaker to notice, fixed the error and observed the pacemaker did not re-notice?
>> How long did you wait? More than the 15s repeat interval I assume?  Did at least the resource agent notice?
>> 
>>>>> When I configure the tomcat resource with failure-timeout=30, the failure alarm in crm_mon is cleared after 30seconds however the tomcat is still having a failure.
>>>> Can you define "still having a failure"?
>>>> You mean it still shows up in crm_mon?
>>>> Have you read this link?
>>>>    http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
>>> "Still having a failure" means that the tomcat is still broken and my OCF script reports it as a failure.
>>>>> What I expect is that pacemaker reports the failure as the failure exists and as long as it exists and that pacemaker reports that everything is ok once everything is back ok.
>>>>> 
>>>>> Do I do something wrong with my configuration?
>>>>> Or how can I achieve my wanted setup?
>>>>> 
>>>>> Here is my configuration:
>>>>> 
>>>>> node CSE-1
>>>>> node CSE-2
>>>>> primitive d_tomcat ocf:custom:tomcat \
>>>>>    op monitor interval="15s" timeout="510s" on-fail="block" \
>>>>>    op start interval="0" timeout="510s" \
>>>>>    params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" monitor_timeout="120" \
>>>>>    meta migration-threshold="1" failure-timeout="0"
>>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \
>>>>>    op monitor interval="10s" \
>>>>>    params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
>>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \
>>>>>    op monitor interval="10s" \
>>>>>    params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
>>>>> group svc-cse ip_1 ip_2
>>>>> clone cl_tomcat d_tomcat
>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>>>> order order_tomcat inf: cl_tomcat svc-cse
>>>>> property $id="cib-bootstrap-options" \
>>>>>    dc-version="1.1.8-7.el6-394e906" \
>>>>>    cluster-infrastructure="cman" \
>>>>>    no-quorum-policy="ignore" \
>>>>>    stonith-enabled="false"
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Greetings,
>>>>> Johan Huysmans
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> <step_2.log><step_3.log>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org