[Pacemaker] Restart of resources
Frank Brendel
Frank.Brendel at eurolog.com
Thu Jan 23 09:50:20 UTC 2014
Hi list,
I have some trouble configuring a resource that is allowed to fail once
in two minutes.
The documentation states that I have to configure migration-threshold
and failure-timeout to achieve this.
Here is the configuration for the resource.
# pcs config
Cluster Name: mycluster
Corosync Nodes:
Pacemaker Nodes:
Node1 Node2 Node3
Resources:
Clone: resClamd-clone
Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
Resource: resClamd (class=lsb type=clamd)
Meta Attrs: failure-timeout=120s migration-threshold=2
Operations: monitor on-fail=restart interval=60s
(resClamd-monitor-on-fail-restart)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.10-14.el6_5.1-368c726
last-lrm-refresh: 1390468150
stonith-enabled: false
# pcs resource defaults
resource-stickiness: INFINITY
# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:12:49 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured
Online: [ Node1 Node2 Node3 ]
Full list of resources:
Clone Set: resClamd-clone [resClamd]
Started: [ Node1 Node2 Node3 ]
Stopping the clamd daemon sets the failcount to 1 and the daemon is
started again. Ok.
# service clamd stop
Stopping Clam AntiVirus Daemon: [ OK ]
/var/log/messages
Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event:
Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update
relayed from Node2
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-resClamd (1)
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: Sent
update 177: fail-count-resClamd=1
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update
relayed from Node2
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-resClamd (1390468520)
Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: Sent
update 179: last-failure-resClamd=1390468520
Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event:
Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
Jan 23 10:15:21 Node1 crmd[6075]: notice: process_lrm_event: LRM
operation resClamd_stop_0 (call=310, rc=0, cib-update=110,
confirmed=true) ok
Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: LRM
operation resClamd_start_0 (call=314, rc=0, cib-update=111,
confirmed=true) ok
Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: LRM
operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112,
confirmed=false) ok
# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:16:48 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured
Online: [ Node1 Node2 Node3 ]
Full list of resources:
Clone Set: resClamd-clone [resClamd]
Started: [ Node1 Node2 Node3 ]
Failed actions:
resClamd_monitor_60000 on Node1 'not running' (7): call=305,
status=complete, last-rc-change='Thu Jan 23 10:15:20 2014', queued=0ms,
exec=0ms
# pcs resource failcount show resClamd
Failcounts for resClamd
Node1: 1
After 7 Minutes I let it fail again and as I understood it should be
started as well. But it doesn't.
# service clamd stop
Stopping Clam AntiVirus Daemon: [ OK ]
Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM
operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113,
confirmed=false) not running
Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event:
Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update
relayed from Node2
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-resClamd (2)
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: Sent
update 181: fail-count-resClamd=2
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update
relayed from Node2
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-resClamd (1390468950)
Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: Sent
update 183: last-failure-resClamd=1390468950
Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event:
Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM
operation resClamd_stop_0 (call=322, rc=0, cib-update=114,
confirmed=true) ok
# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:22:41 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured
Online: [ Node1 Node2 Node3 ]
Full list of resources:
Clone Set: resClamd-clone [resClamd]
Started: [ Node2 Node3 ]
Stopped: [ Node1 ]
Failed actions:
resClamd_monitor_60000 on Node1 'not running' (7): call=317,
status=complete, last-rc-change='Thu Jan 23 10:22:30 2014', queued=0ms,
exec=0ms
What's wrong with my configuration?
Thanks in advance
Frank
More information about the Pacemaker
mailing list