[ClusterLabs] Pacemaker resources are not scheduled
Jan Pokorný
jpokorny at redhat.com
Mon Apr 16 06:50:09 EDT 2018
Lkxjtu,
On 14/04/18 00:16 +0800, lkxjtu wrote:
> My cluster version:
> Corosync 2.4.0
> Pacemaker 1.1.16
>
> There are many resource anomalies. Some resources are only monitored
> and not recovered. Some resources are not monitored or recovered.
> Only one resource of vnm is scheduled normally, but this resource
> cannot be started because other resources in the cluster are
> abnormal. Just like a deadlock. I have been plagued by this problem
> for a long time. I just want a stable and highly available resource
> with infinite recovery for everyone. Is my resource configure
> correct?
see below
> $ cat /etc/corosync/corosync.conf
> [co]mpatibility: whitetank
>
> [...]
>
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> logfile: /root/info/logs/pacemaker_cluster/corosync.log
> to_syslog: yes
> syslog_facility: daemon
> syslog_priority: info
> debug: off
> function_name: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
>
> amf {
> mode: disabled
> }
>
> aisexec {
> user: root
> group: root
> }
You are apparently mixing configuration directives for older major
version(s) of corosync than you claim to be using.
See corosync_conf(5) + votequorum(5) man pages for what you are
supposed to configure with the actual version.
Regarding your pacemaker configuration:
> $ crm configure show
>
> [... reordered ... ]
>
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.16-12.el7-94ff4df \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> load-threshold="3200%"
You are urged to configure fencing, otherwise asking for sane
cluster's behaviour (which you do) is out of question, unless
you precisely know why you are not configuring it.
>
> [... reordered ... ]
>
Furthermore you are using custom resource agents of undisclosed
quality and compatibility with the requirements:
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc
Since your resources come in isolated groups, I would go one
by one, trying to figure out why the group won't run as expected.
For instance:
> primitive inetmanager inetmanager \
> op monitor interval=10s timeout=160 \
> op stop interval=0 timeout=60s on-fail=restart \
> op start interval=0 timeout=60s on-fail=restart \
> meta migration-threshold=2 failure-timeout=60s resource-stickiness=100
> primitive inetmanager_vip IPaddr2 \
> params ip=122.0.1.201 cidr_netmask=24 \
> op start interval=0 timeout=20 \
> op stop interval=0 timeout=20 \
> op monitor timeout=20s interval=10s depth=0 \
> meta migration-threshold=3 failure-timeout=60s
> [...]
> colocation inetmanager_col +inf: inetmanager_vip inetmanager
> order inetmanager_order Mandatory: inetmanager inetmanager_vip
>
> [...]
>
> $ crm status
> [...]
> Full list of resources:
> [...]
> inetmanager_vip (ocf::heartbeat:IPaddr2): Stopped
> inetmanager (ocf::heartbeat:inetmanager): Stopped
>
> [...]
>
> corosync.log of node 122.0.1.10
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: match_graph_event: Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: process_graph_event: Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: match_graph_event: Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: process_graph_event: Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed
What causes inetmanager agent to return 1 (OCF_ERR_GENERIC) when
7 (OCF_NOT_RUNNING) is expected? It may be a trivial issue in the
implementation of the agent, making the whole group together with
"inetmanager_vip" resource fail (due to the respective constraints).
It may be similar with other isolated sets of resources.
You may find ocf-tester (ocft) tool from resource-agents project useful
to check a basic sanity of the custom agents:
https://github.com/ClusterLabs/resource-agents/tree/master/tools/ocft
Hope this helps
--
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180416/e1ba8eb6/attachment-0002.sig>
More information about the Users
mailing list