[Pacemaker] on-fail is not effective

Mon Apr 9 11:56:51 CEST 2012

Hi,

(12.04.07 06:19), David Vossel wrote:
> ----- Original Message -----
>> From: "Kazunori INOUE"<inouekazu at intellilink.co.jp>
>> To: "pacemaker at oss"<pacemaker at oss.clusterlabs.org>
>> Cc: koichi at intellilink.co.jp
>> Sent: Thursday, April 5, 2012 10:08:44 PM
>> Subject: [Pacemaker]  on-fail is not effective
>>
>> Hi,
>>
>> I am using Pacemaker-1.1 (devel:
>> 7172b7323bb72c51999ce11c6fa5d3ff0a0a4b4f).
>> The setting of "on-fail" does not become effective.
>> For example, it becomes default action("restart") even if it
>> specifies "stop".
>
> The resource is stopping, but if there is nothing to prevent the resource from starting again it will start after the stop action has completed. This is probably why 'restart' and 'stop' appear to have the same behavior.
>
> -- Vossel
>
Is it specifications?

I tested it using the same configuration in Pacemaker-1.0.
As expected, the behavior of the resource differed in Pacemaker-1.1.

At the time of monitor(on-fail="stop") failure,
- Pacemaker-1.0:
   the resource stopped and did not start elsewhere.
- Pacemaker-1.1:
   the resource stopped and started again on a different node.

---- ----
Configuration:
property no-quorum-policy="ignore" \
         stonith-enabled="false" \
         startup-fencing="false"
rsc_defaults resource-stickiness="INFINITY" \
         migration-threshold="1"
primitive prmDummy1 ocf:pacemaker:Dummy \
         op start timeout="90s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="stop" \
         op stop timeout="100s" on-fail="block"

---- ----
State of Pacemaker-1.0:

# crm_mon -rf1
============
Last updated: Mon Apr  9 11:35:02 2012
Stack: Heartbeat
Current DC: vm2 (f370d087-433e-462e-8b83-d4a6c13219fa) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ vm1 vm2 ]

Full list of resources:

  prmDummy1      (ocf::pacemaker:Dummy): Started vm1

Migration summary:
* Node vm1:
* Node vm2:

I let 'monitor' fail.
# /bin/rm -f /var/run/Dummy-prmDummy1.state

# crm_mon -rf1
============
(snip)
Full list of resources:

  prmDummy1      (ocf::pacemaker:Dummy): Stopped

Migration summary:
* Node vm1:
    prmDummy1: migration-threshold=1 fail-count=1
* Node vm2:

Failed actions:
     prmDummy1_monitor_10000 (node=vm1, call=4, rc=7, status=complete): not running
#

---- ----
State of Pacemaker-1.1:

# crm_mon -rf1
============
Last updated: Mon Apr  9 13:03:34 2012
Last change: Mon Apr  9 13:03:13 2012 via cibadmin on vm1
Stack: Heartbeat
Current DC: vm2 (f370d087-433e-462e-8b83-d4a6c13219fa) - partition with quorum
Version: 1.1.8-1.el6-0cff1b528574f280a28c030034acabee56004f0f
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ vm2 vm1 ]

Full list of resources:

  prmDummy1      (ocf::pacemaker:Dummy): Started vm1

Migration summary:
* Node vm2:
* Node vm1:

# /bin/rm -f /var/run/Dummy-prmDummy1.state
# crm_mon -rf1
============
(snip)
Online: [ vm2 vm1 ]

Full list of resources:

  prmDummy1      (ocf::pacemaker:Dummy): Started vm2

Migration summary:
* Node vm2:
* Node vm1:
    prmDummy1: migration-threshold=1 fail-count=1

Failed actions:
     prmDummy1_monitor_10000 (node=vm1, call=4, rc=7, status=complete): not running
#

Best Regards,
Kazunori INOUE

>> [root at vm1 ~]# crm configure show | grep -A3 "primitive prmDummy1"
>> primitive prmDummy1 ocf:pacemaker:Dummy \
>>          op start interval="0" timeout="60s" on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" on-fail="stop" \
>>          op stop interval="0" timeout="60s" on-fail="block"
>> [root at vm1 ~]#
>> [root at vm1 ~]# crm_mon -f1
>> ============
>> Last updated: Fri Apr  6 10:13:14 2012
>> Last change: Fri Apr  6 10:12:42 2012 via cibadmin on vm1
>> Stack: Heartbeat
>> Current DC: vm1 (87e0eef1-0d86-4e8a-adfe-51f444a4054f) - partition
>> with quorum
>> Version: 1.1.7-7172b73
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ vm1 vm2 ]
>>
>>   prmDummy1      (ocf::pacemaker:Dummy): Started vm1
>>
>> Migration summary:
>> * Node vm1:
>> * Node vm2:
>> [root at vm1 ~]#
>> [root at vm1 ~]# rm -f /var/run/Dummy-prmDummy1.state
>> [root at vm1 ~]# crm_mon -f1
>> ============
>> Last updated: Fri Apr  6 10:13:33 2012
>> Last change: Fri Apr  6 10:12:42 2012 via cibadmin on vm1
>> Stack: Heartbeat
>> Current DC: vm1 (87e0eef1-0d86-4e8a-adfe-51f444a4054f) - partition
>> with quorum
>> Version: 1.1.7-7172b73
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ vm1 vm2 ]
>>
>>   prmDummy1      (ocf::pacemaker:Dummy): Started vm2
>>
>> Migration summary:
>> * Node vm1:
>>     prmDummy1: migration-threshold=1 fail-count=1
>> * Node vm2:
>>
>> Failed actions:
>>      prmDummy1_monitor_10000 (node=vm1, call=4, rc=7,
>>      status=complete): not running
>> [root at vm1 ~]#
>>
>> Attached gdb_pengine.log is a log of gdb at the time of monitor
>> failure.
>> Is it because the 2nd argument (variable 'key') of the
>> find_rsc_op_entry()
>> function is "prmDummy1_last_failure_0"?
>> Thereby, it seems that "on-fail" cannot be identified. (L117~L205)
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org