[ClusterLabs] Pacemaker stopped monitoring the resource
Klaus Wenninger
kwenning at redhat.com
Sat Sep 2 05:52:34 EDT 2017
On 09/01/2017 11:45 PM, Ken Gaillot wrote:
> On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
>> Are you sure the monitor stopped? Pacemaker only logs
>> recurring monitors
>> when the status changes. Any successful monitors after this
>> wouldn't be
>> logged.
>>
>> Yes. Since there were no logs which said "RecurringOp: Start
>> recurring monitor" on the node after it had failed.
>> Also there were no logs for any actions pertaining to
>> The problem was that even though the one node was failing, the
>> resources were never moved to the other node(the node on which I
>> suspect monitoring had stopped).
>>
>>
>> There are a lot of resource action failures, so I'm not sure
>> where the
>> issue is, but I'm guessing it has to do with
>> migration-threshold=1 --
>> once a resource has failed once on a node, it won't be allowed
>> back on
>> that node until the failure is cleaned up. Of course you also
>> have
>> failure-timeout=1s, which should clean it up immediately, so
>> I'm not
>> sure.
>>
>>
>> migration-threshold=1
>> failure-timeout=1s
>>
>> cluster-recheck-interval=2s
>>
>>
>> first, set "two_node:
>> 1" in corosync.conf and let no-quorum-policy default in
>> pacemaker
>>
>>
>> This is already configured.
>> # cat /etc/corosync/corosync.conf
>> totem {
>> version: 2
>> secauth: off
>> cluster_name: SVSDEHA
>> transport: udpu
>> token: 5000
>> }
>>
>>
>> nodelist {
>> node {
>> ring0_addr: 2.0.0.10
>> nodeid: 1
>> }
>>
>>
>> node {
>> ring0_addr: 2.0.0.11
>> nodeid: 2
>> }
>> }
>>
>>
>> quorum {
>> provider: corosync_votequorum
>> two_node: 1
>> }
>>
>>
>> logging {
>> to_logfile: yes
>> logfile: /var/log/cluster/corosync.log
>> to_syslog: yes
>> }
>>
>>
>> let no-quorum-policy default in pacemaker; then,
>> get stonith configured, tested, and enabled
>>
>>
>> By not configuring no-quorum-policy, would it ignore quorum for a 2
>> node cluster?
> With two_node, corosync always provides quorum to pacemaker, so
> pacemaker doesn't see any quorum loss. The only significant difference
> from ignoring quorum is that corosync won't form a cluster from a cold
> start unless both nodes can reach each other (a safety feature).
>
>> For my use case I don't need stonith enabled. My intention is to have
>> a highly available system all the time.
> Stonith is the only way to recover from certain types of failure, such
> as the "split brain" scenario, and a resource that fails to stop.
>
> If your nodes are physical machines with hardware watchdogs, you can set
> up sbd for fencing without needing any extra equipment.
Small caveat here:
If I get it right you have a 2-node-setup. In this case the watchdog-only
sbd-setup would not be usable as it relies on 'real' quorum.
In 2-node-setups sbd needs at least a single shared disk.
For the sbd-single-disk-setup working with 2-node
you need the patch from https://github.com/ClusterLabs/sbd/pull/23
in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has
it in since GA)
Regards,
Klaus
>
>> I will test my RA again as suggested with no-quorum-policy=default.
>>
>>
>> One more doubt.
>> Why do we see this is 'pcs property' ?
>> last-lrm-refresh: 1504090367
>>
>>
>>
>> Never seen this on a healthy cluster.
>> From RHEL documentation:
>> last-lrm-refresh
>>
>> Last refresh of the
>> Local Resource Manager,
>> given in units of
>> seconds since epoca.
>> Used for diagnostic
>> purposes; not
>> user-configurable.
>>
>>
>> Doesn't explain much.
> Whenever a cluster property changes, the cluster rechecks the current
> state to see if anything needs to be done. last-lrm-refresh is just a
> dummy property that the cluster uses to trigger that. It's set in
> certain rare circumstances when a resource cleanup is done. You should
> see a line in your logs like "Triggering a refresh after ... deleted ...
> from the LRM". That might give some idea of why.
>
>> Also. does avg. CPU load impact resource monitoring ?
>>
>>
>> Regards,
>> Abhay
> Well, it could cause the monitor to take so long that it times out. The
> only direct effect of load on pacemaker is that the cluster might lower
> the number of agent actions that it can execute simultaneously.
>
>
>> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>> On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
>> > Hi,
>> >
>> >
>> > I have a 2 node HA cluster configured on CentOS 7 with pcs
>> command.
>> >
>> >
>> > Below are the properties of the cluster :
>> >
>> >
>> > # pcs property
>> > Cluster Properties:
>> > cluster-infrastructure: corosync
>> > cluster-name: SVSDEHA
>> > cluster-recheck-interval: 2s
>> > dc-deadtime: 5
>> > dc-version: 1.1.15-11.el7_3.5-e174ec8
>> > have-watchdog: false
>> > last-lrm-refresh: 1504090367
>> > no-quorum-policy: ignore
>> > start-failure-is-fatal: false
>> > stonith-enabled: false
>> >
>> >
>> > PFA the cib.
>> > Also attached is the corosync.log around the time the below
>> issue
>> > happened.
>> >
>> >
>> > After around 10 hrs and multiple failures, pacemaker stops
>> monitoring
>> > resource on one of the nodes in the cluster.
>> >
>> >
>> > So even though the resource on other node fails, it is never
>> migrated
>> > to the node on which the resource is not monitored.
>> >
>> >
>> > Wanted to know what could have triggered this and how to
>> avoid getting
>> > into such scenarios.
>> > I am going through the logs and couldn't find why this
>> happened.
>> >
>> >
>> > After this log the monitoring stopped.
>> >
>> > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
>> > crmd: info: process_lrm_event: Result of monitor
>> operation for
>> > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) |
>> call=538
>> > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>>
>> Are you sure the monitor stopped? Pacemaker only logs
>> recurring monitors
>> when the status changes. Any successful monitors after this
>> wouldn't be
>> logged.
>>
>> > Below log says the resource is leaving the cluster.
>> > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
>> > pengine: info: LogActions: Leave SVSDEHA:0
>> (Slave
>> > TPC-D12-10-002.phaedrus.sandvine.com)
>>
>> This means that the cluster will leave the resource where it
>> is (i.e. it
>> doesn't need a start, stop, move, demote, promote, etc.).
>>
>> > Let me know if anything more is needed.
>> >
>> >
>> > Regards,
>> > Abhay
>> >
>> >
>> > PS:'pcs resource cleanup' brought the cluster back into good
>> state.
>>
>> There are a lot of resource action failures, so I'm not sure
>> where the
>> issue is, but I'm guessing it has to do with
>> migration-threshold=1 --
>> once a resource has failed once on a node, it won't be allowed
>> back on
>> that node until the failure is cleaned up. Of course you also
>> have
>> failure-timeout=1s, which should clean it up immediately, so
>> I'm not
>> sure.
>>
>> My gut feeling is that you're trying to do too many things at
>> once. I'd
>> start over from scratch and proceed more slowly: first, set
>> "two_node:
>> 1" in corosync.conf and let no-quorum-policy default in
>> pacemaker; then,
>> get stonith configured, tested, and enabled; then, test your
>> resource
>> agent manually on the command line to make sure it conforms to
>> the
>> expected return values
>> ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list