[ClusterLabs] Pacemaker resources are not scheduled

lkxjtu lkxjtu at 163.com
Mon Apr 16 11:52:11 EDT 2018


> Lkxjtu,

> On 14/04/18 00:16 +0800, lkxjtu wrote:
>> My cluster version:
>> Corosync 2.4.0
>> Pacemaker 1.1.16
>>>> There are many resource anomalies. Some resources are only monitored
>> and not recovered. Some resources are not monitored or recovered.
>> Only one resource of vnm is scheduled normally, but this resource
>> cannot be started because other resources in the cluster are
>> abnormal. Just like a deadlock. I have been plagued by this problem
>> for a long time. I just want a stable and highly available resource
>> with infinite recovery for everyone. Is my resource configure
>> correct?

> see below

>> $ cat /etc/corosync/corosync.conf
>> [co]mpatibility: whitetank
>>>> [...]
>>>> logging {
>>   fileline:        off
>>   to_stderr:       no
>>   to_logfile:      yes
>>   logfile:         /root/info/logs/pacemaker_cluster/corosync.log
>>   to_syslog:       yes
>>   syslog_facility: daemon
>>   syslog_priority: info
>>   debug:           off
>>   function_name:   on
>>   timestamp:       on
>>   logger_subsys {
>>     subsys: AMF
>>     debug:  off
>>     tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
>>   }
>> }
>>>> amf {
>>   mode: disabled
>> }
>>>> aisexec {
>>   user:  root
>>   group: root
>> }

> You are apparently mixing configuration directives for older major
> version(s) of corosync than you claim to be using.
> See corosync_conf(5) + votequorum(5) man pages for what you are
> supposed to configure with the actual version.


Thank you for your detailed answer!
Corosync.conf is part of the ansible scripts, but corosync and pacemaker are updated with the yum source. So it has caused the current gap. I will carefully compare the gap between the new and old versions


> Regarding your pacemaker configuration:

>> $ crm configure show
>>>> [... reordered ... ]
>>>> property cib-bootstrap-options: \
>>         have-watchdog=false \
>>         dc-version=1.1.16-12.el7-94ff4df \
>>         cluster-infrastructure=corosync \
>>         stonith-enabled=false \
>>         start-failure-is-fatal=false \
>>         load-threshold="3200%"

> You are urged to configure fencing, otherwise asking for sane
> cluster's behaviour (which you do) is out of question, unless
> you precisely know why you are not configuring it.


My environment is a virtual machine environment. There is no headshot device. Can I configure fencing? How to do it? 


>>>> [... reordered ... ]
>>
> Furthermore you are using custom resource agents of undisclosed
> quality and compatibility with the requirements:
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf> https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

> Since your resources come in isolated groups, I would go one
> by one, trying to figure out why the group won't run as expected.

> For instance:

>> primitive inetmanager inetmanager \
>>         op monitor interval=10s timeout=160 \
>>         op stop interval=0 timeout=60s on-fail=restart \
>>         op start interval=0 timeout=60s on-fail=restart \
>>         meta migration-threshold=2 failure-timeout=60s resource-stickiness=100
>> primitive inetmanager_vip IPaddr2 \
>>         params ip=122.0.1.201 cidr_netmask=24 \
>>         op start interval=0 timeout=20 \
>>         op stop interval=0 timeout=20 \
>>         op monitor timeout=20s interval=10s depth=0 \
>>         meta migration-threshold=3 failure-timeout=60s
>> [...]
>> colocation inetmanager_col +inf: inetmanager_vip inetmanager
>> order inetmanager_order Mandatory: inetmanager inetmanager_vip
>>>> [...]
>>>> $ crm status
>> [...]
>> Full list of resources:
>> [...]
>>  inetmanager_vip        (ocf::heartbeat:IPaddr2):       Stopped
>>  inetmanager    (ocf::heartbeat:inetmanager):   Stopped
>>>> [...]
>>>> corosync.log of node 122.0.1.10
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:  warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: match_graph_event:      Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: process_graph_event:    Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:  warning: status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): Error
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 source=match_graph_event:310 complete=false
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: match_graph_event:      Action inetmanager_monitor_0 (24) confirmed on 122.0.1.9 (rc=1)
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10       crmd:     info: process_graph_event:    Detected action (360.24) inetmanager_monitor_0.2152=unknown error: failed
fasd
> What causes inetmanager agent to return 1 (OCF_ERR_GENERIC) when
> 7 (OCF_NOT_RUNNING) is expected?  It may be a trivial issue in the
> implementation of the agent, making the whole group together with
> "inetmanager_vip" resource fail (due to the respective constraints).

> It may be similar with other isolated sets of resources.

> You may find ocf-tester (ocft) tool from resource-agents project useful
> to check a basic sanity of the custom agents:
> https://github.com/ClusterLabs/resource-agents/tree/master/tools/ocft

I did run ocf-tester and the result was passed. Here I carefully read the log. When this error was printed in the log, the latest operation that pacemaker made to the inetmanager's RA was start instead of stop. Why did pacemaker think that the RA should return 7 at this time?
I may know the reason for this problem. Because of the implementation of the start and monitor methods in my RA. Because pacemaker recovers one resource at a time, it must wait for other resources that are starting. In order to eliminate this dependency, my RA's start method is just a startup instance and does not wait for business to work. The monitor method will loop to wait for business to be normal. And this can also filter out some occasional monitor failures. However, I tested and found that from the start of the start until the first monitor is successful, it will also block the scheduling of other resources. Inetmanager is the cause of the problem. Is my analysis correct? If so, how should I solve it?


> Hope this helps

> -- 
> Poki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180416/5ce46bc8/attachment-0002.html>


More information about the Users mailing list