[ClusterLabs] Pacemaker resources are not scheduled

Mon Apr 16 06:50:09 EDT 2018

Lkxjtu,

On 14/04/18 00:16 +0800, lkxjtu wrote:
> My cluster version:
> Corosync 2.4.0
> Pacemaker 1.1.16
> 
> There are many resource anomalies. Some resources are only monitored
> and not recovered. Some resources are not monitored or recovered.
> Only one resource of vnm is scheduled normally, but this resource
> cannot be started because other resources in the cluster are
> abnormal. Just like a deadlock. I have been plagued by this problem
> for a long time. I just want a stable and highly available resource
> with infinite recovery for everyone. Is my resource configure
> correct?

see below

> $ cat /etc/corosync/corosync.conf
> [co]mpatibility: whitetank
> 
> [...]
> 
> logging {
>   fileline:        off
>   to_stderr:       no
>   to_logfile:      yes
>   logfile:         /root/info/logs/pacemaker_cluster/corosync.log
>   to_syslog:       yes
>   syslog_facility: daemon
>   syslog_priority: info
>   debug:           off
>   function_name:   on
>   timestamp:       on
>   logger_subsys {
>     subsys: AMF
>     debug:  off
>     tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
>   }
> }
> 
> amf {
>   mode: disabled
> }
> 
> aisexec {
>   user:  root
>   group: root
> }

You are apparently mixing configuration directives for older major
version(s) of corosync than you claim to be using.
See corosync_conf(5) + votequorum(5) man pages for what you are
supposed to configure with the actual version.

Regarding your pacemaker configuration:

> $ crm configure show
> 
> [... reordered ... ]
> 
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.16-12.el7-94ff4df \
>         cluster-infrastructure=corosync \
>         stonith-enabled=false \
>         start-failure-is-fatal=false \
>         load-threshold="3200%"

You are urged to configure fencing, otherwise asking for sane
cluster's behaviour (which you do) is out of question, unless
you precisely know why you are not configuring it.

> 
> [... reordered ... ]
> 

Furthermore you are using custom resource agents of undisclosed
quality and compatibility with the requirements:
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

Since your resources come in isolated groups, I would go one
by one, trying to figure out why the group won't run as expected.

For instance:

> primitive inetmanager inetmanager \
>         op monitor interval=10s timeout=160 \
>         op stop interval=0 timeout=60s on-fail=restart \
>         op start interval=0 timeout=60s on-fail=restart \
>         meta migration-threshold=2 failure-timeout=60s resource-stickiness=100
> primitive inetmanager_vip IPaddr2 \
>         params ip=122.0.1.201 cidr_netmask=24 \
>         op start interval=0 timeout=20 \
>         op stop interval=0 timeout=20 \
>         op monitor timeout=20s interval=10s depth=0 \
>         meta migration-threshold=3 failure-timeout=60s
> [...]
> colocation inetmanager_col +inf: inetmanager_vip inetmanager
> order inetmanager_order Mandatory: inetmanager inetmanager_vip
> 
> [...]
> 
> $ crm status
> [...]
> Full list of resources:
> [...]
>  inetmanager_vip        (ocf::heartbeat:IPaddr2):       Stopped
>  inetmanager    (ocf::heartbeat:inetmanager):   Stopped
> 
> [...]
> 
> corosync.log of node 122.0.1.10
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:  warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: match_graph_event:      Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: process_graph_event:    Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:  warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: match_graph_event:      Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: process_graph_event:    Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed

What causes inetmanager agent to return 1 (OCF_ERR_GENERIC) when
7 (OCF_NOT_RUNNING) is expected?  It may be a trivial issue in the
implementation of the agent, making the whole group together with
"inetmanager_vip" resource fail (due to the respective constraints).

It may be similar with other isolated sets of resources.

You may find ocf-tester (ocft) tool from resource-agents project useful
to check a basic sanity of the custom agents:
https://github.com/ClusterLabs/resource-agents/tree/master/tools/ocft

Hope this helps

-- 
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180416/e1ba8eb6/attachment-0002.sig>