[ClusterLabs] Singleton resource not being migrated

Fri Aug 5 20:19:13 UTC 2016

Thanks for reply, Andreas

On Fri, Aug 5, 2016 at 1:48 AM, Andreas Kurz <andreas.kurz at gmail.com> wrote:

> Hi,
>
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshikov at gmail.com>
> wrote:
>
>> Hello list,
>>
>> Can you, please, help me in debugging 1 resource not being started after
>> node failover ?
>>
>> Here is configuration that I'm testing:
>> 3 nodes(kvm VM) cluster, that have:
>>
>> node 10: aic-controller-58055.test.domain.local
>> node 6: aic-controller-50186.test.domain.local
>> node 9: aic-controller-12993.test.domain.local
>> primitive cmha cmha \
>>         params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
>> pidfile="/var/run/cmha/cmha.pid" user=cmha \
>>         meta failure-timeout=30 resource-stickiness=1 target-role=Started
>> migration-threshold=3 \
>>         op monitor interval=10 on-fail=restart timeout=20 \
>>         op start interval=0 on-fail=restart timeout=60 \
>>         op stop interval=0 on-fail=block timeout=90
>>
>
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>

No errors/failed actions. This is a little bit different lab(names
changes), but have the same effect:

root at aic-controller-57150:~# crm_mon -1frA
Last updated: Fri Aug  5 20:14:05 2016          Last change: Fri Aug  5
19:38:34 2016 by root via crm_attribute on
aic-controller-44151.test.domain.local
Stack: corosync
Current DC: aic-controller-57150.test.domain.local (version 1.1.14-70404b0)
- partition with quorum
3 nodes and 7 resources configured

Online: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
OFFLINE: [ aic-controller-44151.test.domain.local ]

Full list of resources:

 sysinfo_aic-controller-44151.test.domain.local (ocf::pacemaker:SysInfo):
    Stopped
 sysinfo_aic-controller-57150.test.domain.local (ocf::pacemaker:SysInfo):
    Started aic-controller-57150.test.domain.local
 sysinfo_aic-controller-58381.test.domain.local (ocf::pacemaker:SysInfo):
    Started aic-controller-58381.test.domain.local
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
 cmha   (ocf::heartbeat:cmha):  Stopped

Node Attributes:
* Node aic-controller-57150.test.domain.local:
    + arch                              : x86_64
    + cpu_cores                         : 3
    + cpu_info                          : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
    + cpu_load                          : 1.04
    + cpu_speed                         : 4994.21
    + free_swap                         : 5150
    + os                                : Linux-3.13.0-85-generic
    + ram_free                          : 750
    + ram_total                         : 5000
    + root_free                         : 45932
    + var_log_free                      : 431543
* Node aic-controller-58381.test.domain.local:
    + arch                              : x86_64
    + cpu_cores                         : 3
    + cpu_info                          : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
    + cpu_load                          : 1.16
    + cpu_speed                         : 4994.21
    + free_swap                         : 5150
    + os                                : Linux-3.13.0-85-generic
    + ram_free                          : 750
    + ram_total                         : 5000
    + root_free                         : 45932
    + var_log_free                      : 431542

Migration Summary:
* Node aic-controller-57150.test.domain.local:
* Node aic-controller-58381.test.domain.local:

>
>> primitive sysinfo_aic-controller-12993.test.domain.local
>> ocf:pacemaker:SysInfo \
>>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>>         op monitor interval=15s
>> primitive sysinfo_aic-controller-50186.test.domain.local
>> ocf:pacemaker:SysInfo \
>>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>>         op monitor interval=15s
>> primitive sysinfo_aic-controller-58055.test.domain.local
>> ocf:pacemaker:SysInfo \
>>         params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>>         op monitor interval=15s
>>
>
> You can use a clone for this sysinfo resource and a symmetric cluster for
> a more compact configuration .... then you can skip all these location
> constraints.
>
>
>> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
>> aic-controller-12993.test.domain.local
>> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
>> aic-controller-50186.test.domain.local
>> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
>> aic-controller-58055.test.domain.local
>> location sysinfo-on-aic-controller-12993.test.domain.local
>> sysinfo_aic-controller-12993.test.domain.local inf:
>> aic-controller-12993.test.domain.local
>> location sysinfo-on-aic-controller-50186.test.domain.local
>> sysinfo_aic-controller-50186.test.domain.local inf:
>> aic-controller-50186.test.domain.local
>> location sysinfo-on-aic-controller-58055.test.domain.local
>> sysinfo_aic-controller-58055.test.domain.local inf:
>> aic-controller-58055.test.domain.local
>> property cib-bootstrap-options: \
>>         have-watchdog=false \
>>         dc-version=1.1.14-70404b0 \
>>         cluster-infrastructure=corosync \
>>         cluster-recheck-interval=15s \
>>
>
> Never tried such a low cluster-recheck-interval ... wouldn't do that. I
> saw setups with low intervals burning a lot of cpu cycles in bigger cluster
> setups and side-effects from aborted transitions. If you do this for
> "cleanup" the cluster state because you see resource-agent errors you
> should better fix the resource agent.
>

This small interval is result of debugging cmha resource issue. In general
all cluster have 190s, and because 15s didn't help - it will be rollback.

>
> Regards,
> Andreas
>
>
>>         no-quorum-policy=stop \
>>         stonith-enabled=false \
>>         start-failure-is-fatal=false \
>>         symmetric-cluster=false \
>>         node-health-strategy=migrate-on-red \
>>         last-lrm-refresh=1470334410
>>
>> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
>> Resource                                                Score     Node
>>                                 Stickiness #Fail    Migration-Threshold
>> cmha                                                    -INFINITY
>> aic-controller-12993.test.domain.local 1          0
>> cmha                                                              101
>> aic-controller-50186.test.domain.local 1          0
>> cmha                                                    -INFINITY
>> aic-controller-58055.test.domain.local 1          0
>> sysinfo_aic-controller-12993.test.domain.local          INFINITY
>>  aic-controller-12993.test.domain.local 0          0
>> sysinfo_aic-controller-50186.test.domain.local          -INFINITY
>> aic-controller-50186.test.domain.local 0          0
>> sysinfo_aic-controller-58055.test.domain.local          INFINITY
>>  aic-controller-58055.test.domain.local 0          0
>>
>> The problem starts when 1 node, goes offline (aic-controller-50186). The
>> resource cmha is stocked in stopped state.
>> Here is the showscores:
>> Resource                                                Score     Node
>>                                 Stickiness #Fail    Migration-Threshold
>> cmha                                                    -INFINITY
>> aic-controller-12993.test.domain.local 1          0
>> cmha                                                    -INFINITY
>> aic-controller-50186.test.domain.local 1          0
>> cmha                                                    -INFINITY
>> aic-controller-58055.test.domain.local 1          0
>>
>> Even it has target-role=Started pacemaker skipping this resource. And in
>> logs I see:
>> pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):
>>  Stopped
>> pengine:     info: native_color:      Resource cmha cannot run anywhere
>> pengine:     info: LogActions:        Leave   cmha    (Stopped)
>>
>> To recover cmha resource I need to run either:
>> 1) crm resource cleanup cmha
>> 2) crm resource reprobe
>>
>> After any of the above commands, resource began to be picked up be
>> pacemaker and I see valid scores:
>> Resource                                                Score     Node
>>                                 Stickiness #Fail    Migration-Threshold
>> cmha                                                    100
>> aic-controller-58055.test.domain.local 1          0        3
>> cmha                                                    101
>> aic-controller-12993.test.domain.local 1          0        3
>> cmha                                                    -INFINITY
>> aic-controller-50186.test.domain.local 1          0        3
>>
>> So the questions here - why cluster-recheck doesn't work, and should it
>> do reprobing ?
>> How to make migration work or what I missed in configuration that
>> prevents migration?
>>
>> corosync  2.3.4
>> pacemaker 1.1.14
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160805/bc93c422/attachment.htm>