[Pacemaker] failure handling on a cloned resource
Johan Huysmans
johan.huysmans at inuits.be
Fri May 3 12:40:45 UTC 2013
Hi,
Below you can see my setup and my test, this shows that my cloned
resource with on-fail=block does not recover automatically.
My Setup:
# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686
# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
op monitor interval="15s" timeout="510s" on-fail="block" \
op start interval="0" timeout="510s" \
params instance_name="NMS" monitor_use_ssl="no"
monitor_urls="/cse/health" monitor_timeout="120" \
meta migration-threshold="1"
primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111"
iflabel="ha" \
meta migration-threshold="1" failure-timeout="10"
primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval="10s" \
params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119"
iflabel="ha" \
meta migration-threshold="1" failure-timeout="10"
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
dc-version="1.1.9-1512.el6-2a917dd" \
cluster-infrastructure="cman" \
pe-warn-series-max="9" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
pe-input-series-max="9" \
pe-error-series-max="9" \
last-lrm-refresh="1367582088"
Currently only 1 node is available, CSE-1.
This is how I am currently testing my setup:
=> Starting point: Everything up and running
# crm resource status
Resource Group: svc-cse
ip_19 (ocf::heartbeat:IPaddr2): Started
ip_11 (ocf::heartbeat:IPaddr2): Started
Clone Set: cl_tomcat [d_tomcat]
Started: [ CSE-1 ]
Stopped: [ d_tomcat:1 ]
=> Causing failure: Change system so tomcat is running but has a failure
(in attachment step_2.log)
# crm resource status
Resource Group: svc-cse
ip_19 (ocf::heartbeat:IPaddr2): Stopped
ip_11 (ocf::heartbeat:IPaddr2): Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
=> Fixing failure: Revert system so tomcat is running without failure
(in attachment step_3.log)
# crm resource status
Resource Group: svc-cse
ip_19 (ocf::heartbeat:IPaddr2): Stopped
ip_11 (ocf::heartbeat:IPaddr2): Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
As you can see in the logs the OCF script doesn't return any failure.
This is noticed by pacemaker,
however it doesn't reflect in crm_mon and it doesn't start the depending
resources.
Gr.
Johan
On 2013-05-03 03:04, Andrew Beekhof wrote:
> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>
>> On 2013-05-01 05:48, Andrew Beekhof wrote:
>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm trying to setup a specific configuration in our cluster, however I'm struggling with my configuration.
>>>>
>>>> This is what I'm trying to achieve:
>>>> On both nodes of the cluster a daemon must be running (tomcat).
>>>> Some failover addresses are configured and must be running on the node with a correctly running tomcat.
>>>>
>>>> I have this achieved with a cloned tomcat resource and an collocation between the cloned tomcat and the failover addresses.
>>>> When I cause a failure in the tomcat on the node running the failover addresses, the failover addresses will failover to the other node as expected.
>>>> crm_mon shows that this tomcat has a failure.
>>>> When I configure the tomcat resource with failure-timeout=0, the failure alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
>>> All sounds right so far.
>> If my broken tomcat is automatically fixed, I expect this to be noticed by pacemaker and that that node will be able to run my failover addresses,
>> however I don't see this happening.
> This is very hard to discuss without seeing logs.
>
> So you created a tomcat error, waited for pacemaker to notice, fixed the error and observed the pacemaker did not re-notice?
> How long did you wait? More than the 15s repeat interval I assume? Did at least the resource agent notice?
>
>>>> When I configure the tomcat resource with failure-timeout=30, the failure alarm in crm_mon is cleared after 30seconds however the tomcat is still having a failure.
>>> Can you define "still having a failure"?
>>> You mean it still shows up in crm_mon?
>>> Have you read this link?
>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
>> "Still having a failure" means that the tomcat is still broken and my OCF script reports it as a failure.
>>>> What I expect is that pacemaker reports the failure as the failure exists and as long as it exists and that pacemaker reports that everything is ok once everything is back ok.
>>>>
>>>> Do I do something wrong with my configuration?
>>>> Or how can I achieve my wanted setup?
>>>>
>>>> Here is my configuration:
>>>>
>>>> node CSE-1
>>>> node CSE-2
>>>> primitive d_tomcat ocf:custom:tomcat \
>>>> op monitor interval="15s" timeout="510s" on-fail="block" \
>>>> op start interval="0" timeout="510s" \
>>>> params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" monitor_timeout="120" \
>>>> meta migration-threshold="1" failure-timeout="0"
>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \
>>>> op monitor interval="10s" \
>>>> params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \
>>>> op monitor interval="10s" \
>>>> params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
>>>> group svc-cse ip_1 ip_2
>>>> clone cl_tomcat d_tomcat
>>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>>> order order_tomcat inf: cl_tomcat svc-cse
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.1.8-7.el6-394e906" \
>>>> cluster-infrastructure="cman" \
>>>> no-quorum-policy="ignore" \
>>>> stonith-enabled="false"
>>>>
>>>> Thanks!
>>>>
>>>> Greetings,
>>>> Johan Huysmans
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: step_2.log
Type: text/x-log
Size: 7137 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130503/9168a8b5/attachment-0008.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: step_3.log
Type: text/x-log
Size: 2475 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130503/9168a8b5/attachment-0009.bin>
More information about the Pacemaker
mailing list