[Pacemaker] Restart of resources

Mon Feb 3 10:40:03 UTC 2014

I've solved the problem.

When I set cluster-recheck-interval to a value less than failure-timeout 
it works.

Is this an expected behavior?

This is not documented anywhere.
Neither here 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html
nor here 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html

Regards
Frank

Am 28.01.2014 14:44, schrieb Frank Brendel:
> No one with an idea?
> Or can someone tell me if it is even possible?
>
>
> Thanks
> Frank
>
>
> Am 23.01.2014 10:50, schrieb Frank Brendel:
>> Hi list,
>>
>> I have some trouble configuring a resource that is allowed to fail
>> once in two minutes.
>> The documentation states that I have to configure migration-threshold
>> and failure-timeout to achieve this.
>> Here is the configuration for the resource.
>>
>> # pcs config
>> Cluster Name: mycluster
>> Corosync Nodes:
>>
>> Pacemaker Nodes:
>>   Node1 Node2 Node3
>>
>> Resources:
>>   Clone: resClamd-clone
>>    Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
>>    Resource: resClamd (class=lsb type=clamd)
>>     Meta Attrs: failure-timeout=120s migration-threshold=2
>>     Operations: monitor on-fail=restart interval=60s
>> (resClamd-monitor-on-fail-restart)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>> Colocation Constraints:
>>
>> Cluster Properties:
>>   cluster-infrastructure: cman
>>   dc-version: 1.1.10-14.el6_5.1-368c726
>>   last-lrm-refresh: 1390468150
>>   stonith-enabled: false
>>
>> # pcs resource defaults
>> resource-stickiness: INFINITY
>>
>> # pcs status
>> Cluster name: mycluster
>> Last updated: Thu Jan 23 10:12:49 2014
>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
>> Stack: cman
>> Current DC: Node2 - partition with quorum
>> Version: 1.1.10-14.el6_5.1-368c726
>> 3 Nodes configured
>> 3 Resources configured
>>
>>
>> Online: [ Node1 Node2 Node3 ]
>>
>> Full list of resources:
>>
>>   Clone Set: resClamd-clone [resClamd]
>>       Started: [ Node1 Node2 Node3 ]
>>
>>
>> Stopping the clamd daemon sets the failcount to 1 and the daemon is
>> started again. Ok.
>>
>>
>> # service clamd stop
>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>
>> /var/log/messages
>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>> relayed from Node2
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: fail-count-resClamd (1)
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>> Sent update 177: fail-count-resClamd=1
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>> relayed from Node2
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-resClamd (1390468520)
>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>> Sent update 179: last-failure-resClamd=1390468520
>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>> Jan 23 10:15:21 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>> operation resClamd_stop_0 (call=310, rc=0, cib-update=110,
>> confirmed=true) ok
>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>> LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111,
>> confirmed=true) ok
>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>> LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112,
>> confirmed=false) ok
>>
>> # pcs status
>> Cluster name: mycluster
>> Last updated: Thu Jan 23 10:16:48 2014
>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>> Stack: cman
>> Current DC: Node2 - partition with quorum
>> Version: 1.1.10-14.el6_5.1-368c726
>> 3 Nodes configured
>> 3 Resources configured
>>
>>
>> Online: [ Node1 Node2 Node3 ]
>>
>> Full list of resources:
>>
>>   Clone Set: resClamd-clone [resClamd]
>>       Started: [ Node1 Node2 Node3 ]
>>
>> Failed actions:
>>      resClamd_monitor_60000 on Node1 'not running' (7): call=305,
>> status=complete, last-rc-change='Thu Jan 23 10:15:20 2014',
>> queued=0ms, exec=0ms
>>
>> # pcs resource failcount show resClamd
>> Failcounts for resClamd
>>   Node1: 1
>>
>>
>> After 7 Minutes I let it fail again and as I understood it should be
>> started as well. But it doesn't.
>>
>>
>> # service clamd stop
>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>
>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>> operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113,
>> confirmed=false) not running
>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>> relayed from Node2
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: fail-count-resClamd (2)
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>> Sent update 181: fail-count-resClamd=2
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>> relayed from Node2
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-resClamd (1390468950)
>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>> Sent update 183: last-failure-resClamd=1390468950
>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>> operation resClamd_stop_0 (call=322, rc=0, cib-update=114,
>> confirmed=true) ok
>>
>> # pcs status
>> Cluster name: mycluster
>> Last updated: Thu Jan 23 10:22:41 2014
>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>> Stack: cman
>> Current DC: Node2 - partition with quorum
>> Version: 1.1.10-14.el6_5.1-368c726
>> 3 Nodes configured
>> 3 Resources configured
>>
>>
>> Online: [ Node1 Node2 Node3 ]
>>
>> Full list of resources:
>>
>>   Clone Set: resClamd-clone [resClamd]
>>       Started: [ Node2 Node3 ]
>>       Stopped: [ Node1 ]
>>
>> Failed actions:
>>      resClamd_monitor_60000 on Node1 'not running' (7): call=317,
>> status=complete, last-rc-change='Thu Jan 23 10:22:30 2014',
>> queued=0ms, exec=0ms
>>
>>
>> What's wrong with my configuration?
>>
>>
>> Thanks in advance
>> Frank
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org