[ClusterLabs] Pacemaker stopped monitoring the resource
Ken Gaillot
kgaillot at redhat.com
Fri Sep 1 17:45:34 EDT 2017
On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
> Are you sure the monitor stopped? Pacemaker only logs
> recurring monitors
> when the status changes. Any successful monitors after this
> wouldn't be
> logged.
>
> Yes. Since there were no logs which said "RecurringOp: Start
> recurring monitor" on the node after it had failed.
> Also there were no logs for any actions pertaining to
> The problem was that even though the one node was failing, the
> resources were never moved to the other node(the node on which I
> suspect monitoring had stopped).
>
>
> There are a lot of resource action failures, so I'm not sure
> where the
> issue is, but I'm guessing it has to do with
> migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed
> back on
> that node until the failure is cleaned up. Of course you also
> have
> failure-timeout=1s, which should clean it up immediately, so
> I'm not
> sure.
>
>
> migration-threshold=1
> failure-timeout=1s
>
> cluster-recheck-interval=2s
>
>
> first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in
> pacemaker
>
>
> This is already configured.
> # cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: SVSDEHA
> transport: udpu
> token: 5000
> }
>
>
> nodelist {
> node {
> ring0_addr: 2.0.0.10
> nodeid: 1
> }
>
>
> node {
> ring0_addr: 2.0.0.11
> nodeid: 2
> }
> }
>
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
>
>
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
>
>
> let no-quorum-policy default in pacemaker; then,
> get stonith configured, tested, and enabled
>
>
> By not configuring no-quorum-policy, would it ignore quorum for a 2
> node cluster?
With two_node, corosync always provides quorum to pacemaker, so
pacemaker doesn't see any quorum loss. The only significant difference
from ignoring quorum is that corosync won't form a cluster from a cold
start unless both nodes can reach each other (a safety feature).
> For my use case I don't need stonith enabled. My intention is to have
> a highly available system all the time.
Stonith is the only way to recover from certain types of failure, such
as the "split brain" scenario, and a resource that fails to stop.
If your nodes are physical machines with hardware watchdogs, you can set
up sbd for fencing without needing any extra equipment.
> I will test my RA again as suggested with no-quorum-policy=default.
>
>
> One more doubt.
> Why do we see this is 'pcs property' ?
> last-lrm-refresh: 1504090367
>
>
>
> Never seen this on a healthy cluster.
> From RHEL documentation:
> last-lrm-refresh
>
> Last refresh of the
> Local Resource Manager,
> given in units of
> seconds since epoca.
> Used for diagnostic
> purposes; not
> user-configurable.
>
>
> Doesn't explain much.
Whenever a cluster property changes, the cluster rechecks the current
state to see if anything needs to be done. last-lrm-refresh is just a
dummy property that the cluster uses to trigger that. It's set in
certain rare circumstances when a resource cleanup is done. You should
see a line in your logs like "Triggering a refresh after ... deleted ...
from the LRM". That might give some idea of why.
> Also. does avg. CPU load impact resource monitoring ?
>
>
> Regards,
> Abhay
Well, it could cause the monitor to take so long that it times out. The
only direct effect of load on pacemaker is that the cluster might lower
the number of agent actions that it can execute simultaneously.
> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com> wrote:
>
> On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
> > Hi,
> >
> >
> > I have a 2 node HA cluster configured on CentOS 7 with pcs
> command.
> >
> >
> > Below are the properties of the cluster :
> >
> >
> > # pcs property
> > Cluster Properties:
> > cluster-infrastructure: corosync
> > cluster-name: SVSDEHA
> > cluster-recheck-interval: 2s
> > dc-deadtime: 5
> > dc-version: 1.1.15-11.el7_3.5-e174ec8
> > have-watchdog: false
> > last-lrm-refresh: 1504090367
> > no-quorum-policy: ignore
> > start-failure-is-fatal: false
> > stonith-enabled: false
> >
> >
> > PFA the cib.
> > Also attached is the corosync.log around the time the below
> issue
> > happened.
> >
> >
> > After around 10 hrs and multiple failures, pacemaker stops
> monitoring
> > resource on one of the nodes in the cluster.
> >
> >
> > So even though the resource on other node fails, it is never
> migrated
> > to the node on which the resource is not monitored.
> >
> >
> > Wanted to know what could have triggered this and how to
> avoid getting
> > into such scenarios.
> > I am going through the logs and couldn't find why this
> happened.
> >
> >
> > After this log the monitoring stopped.
> >
> > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
> > crmd: info: process_lrm_event: Result of monitor
> operation for
> > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) |
> call=538
> > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>
> Are you sure the monitor stopped? Pacemaker only logs
> recurring monitors
> when the status changes. Any successful monitors after this
> wouldn't be
> logged.
>
> > Below log says the resource is leaving the cluster.
> > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
> > pengine: info: LogActions: Leave SVSDEHA:0
> (Slave
> > TPC-D12-10-002.phaedrus.sandvine.com)
>
> This means that the cluster will leave the resource where it
> is (i.e. it
> doesn't need a start, stop, move, demote, promote, etc.).
>
> > Let me know if anything more is needed.
> >
> >
> > Regards,
> > Abhay
> >
> >
> > PS:'pcs resource cleanup' brought the cluster back into good
> state.
>
> There are a lot of resource action failures, so I'm not sure
> where the
> issue is, but I'm guessing it has to do with
> migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed
> back on
> that node until the failure is cleaned up. Of course you also
> have
> failure-timeout=1s, which should clean it up immediately, so
> I'm not
> sure.
>
> My gut feeling is that you're trying to do too many things at
> once. I'd
> start over from scratch and proceed more slowly: first, set
> "two_node:
> 1" in corosync.conf and let no-quorum-policy default in
> pacemaker; then,
> get stonith configured, tested, and enabled; then, test your
> resource
> agent manually on the command line to make sure it conforms to
> the
> expected return values
> ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.
> --
> Ken Gaillot <kgaillot at redhat.com>
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list