[ClusterLabs] Singleton resource not being migrated

Fri Aug 5 14:21:56 UTC 2016

On 08/05/2016 03:48 AM, Andreas Kurz wrote:
> Hi,
> 
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshikov at gmail.com
> <mailto:koshikov at gmail.com>> wrote:
> 
>     Hello list,
> 
>     Can you, please, help me in debugging 1 resource not being started
>     after node failover ?
> 
>     Here is configuration that I'm testing:
>     3 nodes(kvm VM) cluster, that have:
> 
>     node 10: aic-controller-58055.test.domain.local
>     node 6: aic-controller-50186.test.domain.local
>     node 9: aic-controller-12993.test.domain.local
>     primitive cmha cmha \
>             params conffile="/etc/cmha/cmha.conf"
>     daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid" user=cmha \
>             meta failure-timeout=30 resource-stickiness=1
>     target-role=Started migration-threshold=3 \
>             op monitor interval=10 on-fail=restart timeout=20 \
>             op start interval=0 on-fail=restart timeout=60 \
>             op stop interval=0 on-fail=block timeout=90
> 
> 
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>  
> 
>     primitive sysinfo_aic-controller-12993.test.domain.local
>     ocf:pacemaker:SysInfo \
>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>             op monitor interval=15s
>     primitive sysinfo_aic-controller-50186.test.domain.local
>     ocf:pacemaker:SysInfo \
>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>             op monitor interval=15s
>     primitive sysinfo_aic-controller-58055.test.domain.local
>     ocf:pacemaker:SysInfo \
>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>             op monitor interval=15s
> 
> 
> You can use a clone for this sysinfo resource and a symmetric cluster
> for a more compact configuration .... then you can skip all these
> location constraints.
> 
> 
>     location cmha-on-aic-controller-12993.test.domain.local cmha 100:
>     aic-controller-12993.test.domain.local
>     location cmha-on-aic-controller-50186.test.domain.local cmha 100:
>     aic-controller-50186.test.domain.local
>     location cmha-on-aic-controller-58055.test.domain.local cmha 100:
>     aic-controller-58055.test.domain.local
>     location sysinfo-on-aic-controller-12993.test.domain.local
>     sysinfo_aic-controller-12993.test.domain.local inf:
>     aic-controller-12993.test.domain.local
>     location sysinfo-on-aic-controller-50186.test.domain.local
>     sysinfo_aic-controller-50186.test.domain.local inf:
>     aic-controller-50186.test.domain.local
>     location sysinfo-on-aic-controller-58055.test.domain.local
>     sysinfo_aic-controller-58055.test.domain.local inf:
>     aic-controller-58055.test.domain.local
>     property cib-bootstrap-options: \
>             have-watchdog=false \
>             dc-version=1.1.14-70404b0 \
>             cluster-infrastructure=corosync \
>             cluster-recheck-interval=15s \
> 
> 
> Never tried such a low cluster-recheck-interval ... wouldn't do that. I
> saw setups with low intervals burning a lot of cpu cycles in bigger
> cluster setups and side-effects from aborted transitions. If you do this
> for "cleanup" the cluster state because you see resource-agent errors
> you should better fix the resource agent.

Strongly agree -- your recheck interval is lower than the various action
timeouts. The only reason recheck interval should ever be set less than
about 5 minutes is if you have time-based rules that you want to trigger
with a finer granularity.

Your issue does not appear to be coming from recheck interval, otherwise
it would go away after the recheck interval passed.

> Regards,
> Andreas
>  
> 
>             no-quorum-policy=stop \
>             stonith-enabled=false \
>             start-failure-is-fatal=false \
>             symmetric-cluster=false \
>             node-health-strategy=migrate-on-red \
>             last-lrm-refresh=1470334410
> 
>     When 3 nodes online, everything seemed OK, this is output of
>     scoreshow.sh:
>     Resource                                                Score    
>     Node                                   Stickiness #Fail  
>      Migration-Threshold
>     cmha                                                    -INFINITY
>     aic-controller-12993.test.domain.local 1          0
>     cmha                                                            
>      101 aic-controller-50186.test.domain.local 1          0
>     cmha                                                    -INFINITY

Everything is not OK; cmha has -INFINITY scores on two nodes, meaning it
won't be allowed to run on them. This is why it won't start after the
one allowed node goes down, and why cleanup gets it working again
(cleanup removes bans caused by resource failures).

It's likely the resource previously failed the maximum allowed times
(migration-threshold=3) on those two nodes.

The next step would be to figure out why the resource is failing. The
pacemaker logs will show any output from the resource agent.

>     aic-controller-58055.test.domain.local 1          0
>     sysinfo_aic-controller-12993.test.domain.local          INFINITY
>      aic-controller-12993.test.domain.local 0          0
>     sysinfo_aic-controller-50186.test.domain.local          -INFINITY
>     aic-controller-50186.test.domain.local 0          0
>     sysinfo_aic-controller-58055.test.domain.local          INFINITY
>      aic-controller-58055.test.domain.local 0          0
> 
>     The problem starts when 1 node, goes offline (aic-controller-50186).
>     The resource cmha is stocked in stopped state.
>     Here is the showscores:
>     Resource                                                Score    
>     Node                                   Stickiness #Fail  
>      Migration-Threshold
>     cmha                                                    -INFINITY
>     aic-controller-12993.test.domain.local 1          0
>     cmha                                                    -INFINITY
>     aic-controller-50186.test.domain.local 1          0
>     cmha                                                    -INFINITY
>     aic-controller-58055.test.domain.local 1          0
> 
>     Even it has target-role=Started pacemaker skipping this resource.
>     And in logs I see:
>     pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):
>      Stopped
>     pengine:     info: native_color:      Resource cmha cannot run anywhere
>     pengine:     info: LogActions:        Leave   cmha    (Stopped)
> 
>     To recover cmha resource I need to run either:
>     1) crm resource cleanup cmha
>     2) crm resource reprobe
> 
>     After any of the above commands, resource began to be picked up be
>     pacemaker and I see valid scores:
>     Resource                                                Score    
>     Node                                   Stickiness #Fail  
>      Migration-Threshold
>     cmha                                                    100      
>     aic-controller-58055.test.domain.local 1          0        3
>     cmha                                                    101      
>     aic-controller-12993.test.domain.local 1          0        3
>     cmha                                                    -INFINITY
>     aic-controller-50186.test.domain.local 1          0        3
> 
>     So the questions here - why cluster-recheck doesn't work, and should
>     it do reprobing ?
>     How to make migration work or what I missed in configuration that
>     prevents migration? 
> 
>     corosync  2.3.4
>     pacemaker 1.1.14