[ClusterLabs] Singleton resource not being migrated

Fri Aug 5 08:48:45 UTC 2016

Hi,

On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshikov at gmail.com> wrote:

> Hello list,
>
> Can you, please, help me in debugging 1 resource not being started after
> node failover ?
>
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
>
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
>         params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
> pidfile="/var/run/cmha/cmha.pid" user=cmha \
>         meta failure-timeout=30 resource-stickiness=1 target-role=Started
> migration-threshold=3 \
>         op monitor interval=10 on-fail=restart timeout=20 \
>         op start interval=0 on-fail=restart timeout=60 \
>         op stop interval=0 on-fail=block timeout=90
>

What is the output of crm_mon -1frA once a node is down ... any failed
actions?

> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>         op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>         op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>         op monitor interval=15s
>

You can use a clone for this sysinfo resource and a symmetric cluster for a
more compact configuration .... then you can skip all these location
constraints.

> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.14-70404b0 \
>         cluster-infrastructure=corosync \
>         cluster-recheck-interval=15s \
>

Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw
setups with low intervals burning a lot of cpu cycles in bigger cluster
setups and side-effects from aborted transitions. If you do this for
"cleanup" the cluster state because you see resource-agent errors you
should better fix the resource agent.

Regards,
Andreas

>         no-quorum-policy=stop \
>         stonith-enabled=false \
>         start-failure-is-fatal=false \
>         symmetric-cluster=false \
>         node-health-strategy=migrate-on-red \
>         last-lrm-refresh=1470334410
>
> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
> Resource                                                Score     Node
>                               Stickiness #Fail    Migration-Threshold
> cmha                                                    -INFINITY
> aic-controller-12993.test.domain.local 1          0
> cmha                                                              101
> aic-controller-50186.test.domain.local 1          0
> cmha                                                    -INFINITY
> aic-controller-58055.test.domain.local 1          0
> sysinfo_aic-controller-12993.test.domain.local          INFINITY
>  aic-controller-12993.test.domain.local 0          0
> sysinfo_aic-controller-50186.test.domain.local          -INFINITY
> aic-controller-50186.test.domain.local 0          0
> sysinfo_aic-controller-58055.test.domain.local          INFINITY
>  aic-controller-58055.test.domain.local 0          0
>
> The problem starts when 1 node, goes offline (aic-controller-50186). The
> resource cmha is stocked in stopped state.
> Here is the showscores:
> Resource                                                Score     Node
>                               Stickiness #Fail    Migration-Threshold
> cmha                                                    -INFINITY
> aic-controller-12993.test.domain.local 1          0
> cmha                                                    -INFINITY
> aic-controller-50186.test.domain.local 1          0
> cmha                                                    -INFINITY
> aic-controller-58055.test.domain.local 1          0
>
> Even it has target-role=Started pacemaker skipping this resource. And in
> logs I see:
> pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):
>  Stopped
> pengine:     info: native_color:      Resource cmha cannot run anywhere
> pengine:     info: LogActions:        Leave   cmha    (Stopped)
>
> To recover cmha resource I need to run either:
> 1) crm resource cleanup cmha
> 2) crm resource reprobe
>
> After any of the above commands, resource began to be picked up be
> pacemaker and I see valid scores:
> Resource                                                Score     Node
>                               Stickiness #Fail    Migration-Threshold
> cmha                                                    100
> aic-controller-58055.test.domain.local 1          0        3
> cmha                                                    101
> aic-controller-12993.test.domain.local 1          0        3
> cmha                                                    -INFINITY
> aic-controller-50186.test.domain.local 1          0        3
>
> So the questions here - why cluster-recheck doesn't work, and should it do
> reprobing ?
> How to make migration work or what I missed in configuration that prevents
> migration?
>
> corosync  2.3.4
> pacemaker 1.1.14
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160805/eb9fb59f/attachment.htm>