[Pacemaker] Restart of resources

Thu Feb 6 22:52:52 EST 2014

On 3 Feb 2014, at 9:40 pm, Frank Brendel <Frank.Brendel at eurolog.com> wrote:

> I've solved the problem.
> 
> When I set cluster-recheck-interval to a value less than failure-timeout 
> it works.
> 
> Is this an expected behavior?

Yes.

> 
> This is not documented anywhere.

Its somewhat inferred in the description of cluster-recheck-interval

       cluster-recheck-interval = time [15min]
           Polling interval for time based changes to options, resource parameters and constraints.

           The Cluster is primarily event driven, however the configuration can have elements that change based on time. To ensure these changes take effect, we can optionally poll the cluster's status for changes. Allowed values: Zero disables polling. Positive values are an interval in seconds (unless other SI units are specified.
           eg. 5min)

the failure-timeout doesn't result in any events on its own, so the reprocessing happens the next time the PE gets kicked by the recheck timer.

> Neither here 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html
> nor here 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html
> 
> 
> Regards
> Frank
> 
> 
> Am 28.01.2014 14:44, schrieb Frank Brendel:
>> No one with an idea?
>> Or can someone tell me if it is even possible?
>> 
>> 
>> Thanks
>> Frank
>> 
>> 
>> Am 23.01.2014 10:50, schrieb Frank Brendel:
>>> Hi list,
>>> 
>>> I have some trouble configuring a resource that is allowed to fail
>>> once in two minutes.
>>> The documentation states that I have to configure migration-threshold
>>> and failure-timeout to achieve this.
>>> Here is the configuration for the resource.
>>> 
>>> # pcs config
>>> Cluster Name: mycluster
>>> Corosync Nodes:
>>> 
>>> Pacemaker Nodes:
>>>  Node1 Node2 Node3
>>> 
>>> Resources:
>>>  Clone: resClamd-clone
>>>   Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
>>>   Resource: resClamd (class=lsb type=clamd)
>>>    Meta Attrs: failure-timeout=120s migration-threshold=2
>>>    Operations: monitor on-fail=restart interval=60s
>>> (resClamd-monitor-on-fail-restart)
>>> 
>>> Stonith Devices:
>>> Fencing Levels:
>>> 
>>> Location Constraints:
>>> Ordering Constraints:
>>> Colocation Constraints:
>>> 
>>> Cluster Properties:
>>>  cluster-infrastructure: cman
>>>  dc-version: 1.1.10-14.el6_5.1-368c726
>>>  last-lrm-refresh: 1390468150
>>>  stonith-enabled: false
>>> 
>>> # pcs resource defaults
>>> resource-stickiness: INFINITY
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:12:49 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node1 Node2 Node3 ]
>>> 
>>> 
>>> Stopping the clamd daemon sets the failcount to 1 and the daemon is
>>> started again. Ok.
>>> 
>>> 
>>> # service clamd stop
>>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>> 
>>> /var/log/messages
>>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: fail-count-resClamd (1)
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 177: fail-count-resClamd=1
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-resClamd (1390468520)
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 179: last-failure-resClamd=1390468520
>>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>>> Jan 23 10:15:21 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_stop_0 (call=310, rc=0, cib-update=110,
>>> confirmed=true) ok
>>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>>> LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111,
>>> confirmed=true) ok
>>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>>> LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112,
>>> confirmed=false) ok
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:16:48 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node1 Node2 Node3 ]
>>> 
>>> Failed actions:
>>>     resClamd_monitor_60000 on Node1 'not running' (7): call=305,
>>> status=complete, last-rc-change='Thu Jan 23 10:15:20 2014',
>>> queued=0ms, exec=0ms
>>> 
>>> # pcs resource failcount show resClamd
>>> Failcounts for resClamd
>>>  Node1: 1
>>> 
>>> 
>>> After 7 Minutes I let it fail again and as I understood it should be
>>> started as well. But it doesn't.
>>> 
>>> 
>>> # service clamd stop
>>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>> 
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113,
>>> confirmed=false) not running
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: fail-count-resClamd (2)
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 181: fail-count-resClamd=2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-resClamd (1390468950)
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 183: last-failure-resClamd=1390468950
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_stop_0 (call=322, rc=0, cib-update=114,
>>> confirmed=true) ok
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:22:41 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node2 Node3 ]
>>>      Stopped: [ Node1 ]
>>> 
>>> Failed actions:
>>>     resClamd_monitor_60000 on Node1 'not running' (7): call=317,
>>> status=complete, last-rc-change='Thu Jan 23 10:22:30 2014',
>>> queued=0ms, exec=0ms
>>> 
>>> 
>>> What's wrong with my configuration?
>>> 
>>> 
>>> Thanks in advance
>>> Frank
>>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140207/375a954f/attachment-0003.sig>