[ClusterLabs] Singleton resource not being migrated
Andreas Kurz
andreas.kurz at gmail.com
Fri Aug 5 08:48:45 UTC 2016
Hi,
On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshikov at gmail.com> wrote:
> Hello list,
>
> Can you, please, help me in debugging 1 resource not being started after
> node failover ?
>
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
>
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
> pidfile="/var/run/cmha/cmha.pid" user=cmha \
> meta failure-timeout=30 resource-stickiness=1 target-role=Started
> migration-threshold=3 \
> op monitor interval=10 on-fail=restart timeout=20 \
> op start interval=0 on-fail=restart timeout=60 \
> op stop interval=0 on-fail=block timeout=90
>
What is the output of crm_mon -1frA once a node is down ... any failed
actions?
> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
>
You can use a clone for this sysinfo resource and a symmetric cluster for a
more compact configuration .... then you can skip all these location
constraints.
> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-recheck-interval=15s \
>
Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw
setups with low intervals burning a lot of cpu cycles in bigger cluster
setups and side-effects from aborted transitions. If you do this for
"cleanup" the cluster state because you see resource-agent errors you
should better fix the resource agent.
Regards,
Andreas
> no-quorum-policy=stop \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> symmetric-cluster=false \
> node-health-strategy=migrate-on-red \
> last-lrm-refresh=1470334410
>
> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
> Resource Score Node
> Stickiness #Fail Migration-Threshold
> cmha -INFINITY
> aic-controller-12993.test.domain.local 1 0
> cmha 101
> aic-controller-50186.test.domain.local 1 0
> cmha -INFINITY
> aic-controller-58055.test.domain.local 1 0
> sysinfo_aic-controller-12993.test.domain.local INFINITY
> aic-controller-12993.test.domain.local 0 0
> sysinfo_aic-controller-50186.test.domain.local -INFINITY
> aic-controller-50186.test.domain.local 0 0
> sysinfo_aic-controller-58055.test.domain.local INFINITY
> aic-controller-58055.test.domain.local 0 0
>
> The problem starts when 1 node, goes offline (aic-controller-50186). The
> resource cmha is stocked in stopped state.
> Here is the showscores:
> Resource Score Node
> Stickiness #Fail Migration-Threshold
> cmha -INFINITY
> aic-controller-12993.test.domain.local 1 0
> cmha -INFINITY
> aic-controller-50186.test.domain.local 1 0
> cmha -INFINITY
> aic-controller-58055.test.domain.local 1 0
>
> Even it has target-role=Started pacemaker skipping this resource. And in
> logs I see:
> pengine: info: native_print: cmha (ocf::heartbeat:cmha):
> Stopped
> pengine: info: native_color: Resource cmha cannot run anywhere
> pengine: info: LogActions: Leave cmha (Stopped)
>
> To recover cmha resource I need to run either:
> 1) crm resource cleanup cmha
> 2) crm resource reprobe
>
> After any of the above commands, resource began to be picked up be
> pacemaker and I see valid scores:
> Resource Score Node
> Stickiness #Fail Migration-Threshold
> cmha 100
> aic-controller-58055.test.domain.local 1 0 3
> cmha 101
> aic-controller-12993.test.domain.local 1 0 3
> cmha -INFINITY
> aic-controller-50186.test.domain.local 1 0 3
>
> So the questions here - why cluster-recheck doesn't work, and should it do
> reprobing ?
> How to make migration work or what I missed in configuration that prevents
> migration?
>
> corosync 2.3.4
> pacemaker 1.1.14
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160805/eb9fb59f/attachment.htm>
More information about the Users
mailing list