[Pacemaker] Restart of resources

Thu Jan 23 09:50:20 UTC 2014

Hi list,

I have some trouble configuring a resource that is allowed to fail once 
in two minutes.
The documentation states that I have to configure migration-threshold 
and failure-timeout to achieve this.
Here is the configuration for the resource.

# pcs config
Cluster Name: mycluster
Corosync Nodes:

Pacemaker Nodes:
  Node1 Node2 Node3

Resources:
  Clone: resClamd-clone
   Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
   Resource: resClamd (class=lsb type=clamd)
    Meta Attrs: failure-timeout=120s migration-threshold=2
    Operations: monitor on-fail=restart interval=60s 
(resClamd-monitor-on-fail-restart)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
  cluster-infrastructure: cman
  dc-version: 1.1.10-14.el6_5.1-368c726
  last-lrm-refresh: 1390468150
  stonith-enabled: false

# pcs resource defaults
resource-stickiness: INFINITY

# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:12:49 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured

Online: [ Node1 Node2 Node3 ]

Full list of resources:

  Clone Set: resClamd-clone [resClamd]
      Started: [ Node1 Node2 Node3 ]

Stopping the clamd daemon sets the failcount to 1 and the daemon is 
started again. Ok.

# service clamd stop
Stopping Clam AntiVirus Daemon:                            [  OK  ]

/var/log/messages
Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event: 
Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
relayed from Node2
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-resClamd (1)
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update: Sent 
update 177: fail-count-resClamd=1
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
relayed from Node2
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-resClamd (1390468520)
Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update: Sent 
update 179: last-failure-resClamd=1390468520
Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event: 
Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
Jan 23 10:15:21 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
operation resClamd_stop_0 (call=310, rc=0, cib-update=110, 
confirmed=true) ok
Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event: LRM 
operation resClamd_start_0 (call=314, rc=0, cib-update=111, 
confirmed=true) ok
Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event: LRM 
operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112, 
confirmed=false) ok

# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:16:48 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured

Online: [ Node1 Node2 Node3 ]

Full list of resources:

  Clone Set: resClamd-clone [resClamd]
      Started: [ Node1 Node2 Node3 ]

Failed actions:
     resClamd_monitor_60000 on Node1 'not running' (7): call=305, 
status=complete, last-rc-change='Thu Jan 23 10:15:20 2014', queued=0ms, 
exec=0ms

# pcs resource failcount show resClamd
Failcounts for resClamd
  Node1: 1

After 7 Minutes I let it fail again and as I understood it should be 
started as well. But it doesn't.

# service clamd stop
Stopping Clam AntiVirus Daemon:                            [  OK  ]

Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113, 
confirmed=false) not running
Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: 
Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
relayed from Node2
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-resClamd (2)
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update: Sent 
update 181: fail-count-resClamd=2
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
relayed from Node2
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-resClamd (1390468950)
Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update: Sent 
update 183: last-failure-resClamd=1390468950
Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: 
Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
operation resClamd_stop_0 (call=322, rc=0, cib-update=114, 
confirmed=true) ok

# pcs status
Cluster name: mycluster
Last updated: Thu Jan 23 10:22:41 2014
Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
Stack: cman
Current DC: Node2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
3 Nodes configured
3 Resources configured

Online: [ Node1 Node2 Node3 ]

Full list of resources:

  Clone Set: resClamd-clone [resClamd]
      Started: [ Node2 Node3 ]
      Stopped: [ Node1 ]

Failed actions:
     resClamd_monitor_60000 on Node1 'not running' (7): call=317, 
status=complete, last-rc-change='Thu Jan 23 10:22:30 2014', queued=0ms, 
exec=0ms

What's wrong with my configuration?

Thanks in advance
Frank