[ClusterLabs] Pacemaker stopped monitoring the resource

Tue Sep 5 02:54:37 EDT 2017

Ken,

I have another set of logs :

Sep 01 09:10:05 [1328] TPC-F9-26.phaedrus.sandvine.com       crmd:
info: do_lrm_rsc_op: Performing
key=5:50864:0:86160921-abd7-4e14-94d4-f53cee278858 op=SVSDEHA_monitor_2000
SvsdeStateful(SVSDEHA)[6174]:   2017/09/01_09:10:06 ERROR: Resource is in
failed state
Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com       crmd:
info: action_synced_wait:    Managed SvsdeStateful_meta-data_0 process 6274
exited with rc=4
Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com       crmd:
error: generic_get_metadata:  Failed to receive meta-data for
ocf:pacemaker:SvsdeStateful
Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com       crmd:
error: build_operation_update:    No metadata for
ocf::pacemaker:SvsdeStateful
Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com       crmd:
info: process_lrm_event: Result of monitor operation for SVSDEHA on
TPC-F9-26.phaedrus.sandvine.com: 0 (ok) | call=939 key=SVSDEHA_monitor_2000
confirmed=false cib-update=476
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_process_request:   Forwarding cib_modify operation for section
status to all (origin=local/crmd/476)
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    Diff: --- 0.37.4054 2
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    Diff: +++ 0.37.4055 (null)
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    +  /cib:  @num_updates=4055
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    ++
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='SVSDEHA']:
<lrm_rsc_op id="SVSDEHA_monitor_2000" operation_key="SVSDEHA_monitor_2000"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.10"
transition-key="5:50864:0:86160921-abd7-4e14-94d4-f53cee278858"
transition-magic="0:0;5:50864:0:86160921-abd7-4e14-94d4-f53cee278858"
on_node="TPC-F9-26.phaedrus.sandvi
Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_process_request:   Completed cib_modify operation for section
status: OK (rc=0, origin=TPC-F9-26.phaedrus.sandvine.com/crmd/476,
version=0.37.4055)

*Sep 01 09:10:12 [1325] TPC-F9-26.phaedrus.sandvine.com
<http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
cib_process_ping:  Reporting our current digest to
TPC-E9-23.phaedrus.sandvine.com <http://TPC-E9-23.phaedrus.sandvine.com>:
74bbb7e9f35fabfdb624300891e32018 for 0.37.4055 (0x7f5719954560 0)Sep 01
09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
<http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
cib_perform_op:    Diff: --- 0.37.4055 2*
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    Diff: +++ 0.37.4056 (null)
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    +  /cib:  @num_updates=4056
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    ++
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='SVSDEHA']:
<lrm_rsc_op id="SVSDEHA_last_failure_0"
operation_key="SVSDEHA_monitor_1000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
transition-key="7:50662:8:86160921-abd7-4e14-94d4-f53cee278858"
transition-magic="2:1;7:50662:8:86160921-abd7-4e14-94d4-f53cee278858"
on_node="TPC-E9-23.phaedrus.sand
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_process_request:   Completed cib_modify operation for section
status: OK (rc=0, origin=TPC-E9-23.phaedrus.sandvine.com/crmd/53508,
version=0.37.4056)
Sep 01 09:15:33 [1327] TPC-F9-26.phaedrus.sandvine.com      attrd:
info: attrd_peer_update: Setting fail-count-SVSDEHA[
TPC-E9-23.phaedrus.sandvine.com]: (null) -> 1 from
TPC-E9-23.phaedrus.sandvine.com
Sep 01 09:15:33 [1327] TPC-F9-26.phaedrus.sandvine.com      attrd:
info: attrd_peer_update: Setting last-failure-SVSDEHA[
TPC-E9-23.phaedrus.sandvine.com]: (null) -> 1504271733 from
TPC-E9-23.phaedrus.sandvine.com
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    Diff: --- 0.37.4056 2
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    Diff: +++ 0.37.4057 (null)
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    +  /cib:  @num_updates=4057
Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com        cib:
info: cib_perform_op:    ++
/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:
<nvpair id="status-2-fail-count-SVSDEHA" name="fail-count-SVSDEHA"
value="1"/>

I was suspecting around the highlighted parts of the logs above.
After 09:10:12 the next log is at 09:15:33. During this time other node
failed several times but was not migrated here.

I am yet to check with sbd fencing with  the patch shared by Klaus.
I am on CentOS.

# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

Regards,
Abhay

On Sat, 2 Sep 2017 at 15:23 Klaus Wenninger <kwenning at redhat.com> wrote:

> On 09/01/2017 11:45 PM, Ken Gaillot wrote:
> > On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
> >>         Are you sure the monitor stopped? Pacemaker only logs
> >>         recurring monitors
> >>         when the status changes. Any successful monitors after this
> >>         wouldn't be
> >>         logged.
> >>
> >> Yes. Since there  were no logs which said "RecurringOp:  Start
> >> recurring monitor" on the node after it had failed.
> >> Also there were no logs for any actions pertaining to
> >> The problem was that even though the one node was failing, the
> >> resources were never moved to the other node(the node on which I
> >> suspect monitoring had stopped).
> >>
> >>
> >>         There are a lot of resource action failures, so I'm not sure
> >>         where the
> >>         issue is, but I'm guessing it has to do with
> >>         migration-threshold=1 --
> >>         once a resource has failed once on a node, it won't be allowed
> >>         back on
> >>         that node until the failure is cleaned up. Of course you also
> >>         have
> >>         failure-timeout=1s, which should clean it up immediately, so
> >>         I'm not
> >>         sure.
> >>
> >>
> >> migration-threshold=1
> >> failure-timeout=1s
> >>
> >> cluster-recheck-interval=2s
> >>
> >>
> >>         first, set "two_node:
> >>         1" in corosync.conf and let no-quorum-policy default in
> >>         pacemaker
> >>
> >>
> >> This is already configured.
> >> # cat /etc/corosync/corosync.conf
> >> totem {
> >>     version: 2
> >>     secauth: off
> >>     cluster_name: SVSDEHA
> >>     transport: udpu
> >>     token: 5000
> >> }
> >>
> >>
> >> nodelist {
> >>     node {
> >>         ring0_addr: 2.0.0.10
> >>         nodeid: 1
> >>     }
> >>
> >>
> >>     node {
> >>         ring0_addr: 2.0.0.11
> >>         nodeid: 2
> >>     }
> >> }
> >>
> >>
> >> quorum {
> >>     provider: corosync_votequorum
> >>     two_node: 1
> >> }
> >>
> >>
> >> logging {
> >>     to_logfile: yes
> >>     logfile: /var/log/cluster/corosync.log
> >>     to_syslog: yes
> >> }
> >>
> >>
> >>         let no-quorum-policy default in pacemaker; then,
> >>         get stonith configured, tested, and enabled
> >>
> >>
> >> By not configuring no-quorum-policy, would it ignore quorum for a 2
> >> node cluster?
> > With two_node, corosync always provides quorum to pacemaker, so
> > pacemaker doesn't see any quorum loss. The only significant difference
> > from ignoring quorum is that corosync won't form a cluster from a cold
> > start unless both nodes can reach each other (a safety feature).
> >
> >> For my use case I don't need stonith enabled. My intention is to have
> >> a highly available system all the time.
> > Stonith is the only way to recover from certain types of failure, such
> > as the "split brain" scenario, and a resource that fails to stop.
> >
> > If your nodes are physical machines with hardware watchdogs, you can set
> > up sbd for fencing without needing any extra equipment.
>
> Small caveat here:
> If I get it right you have a 2-node-setup. In this case the watchdog-only
> sbd-setup would not be usable as it relies on 'real' quorum.
> In 2-node-setups sbd needs at least a single shared disk.
> For the sbd-single-disk-setup working with 2-node
> you need the patch from https://github.com/ClusterLabs/sbd/pull/23
> in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has
> it in since GA)
>
> Regards,
> Klaus
>
> >
> >> I will test my RA again as suggested with no-quorum-policy=default.
> >>
> >>
> >> One more doubt.
> >> Why do we see this is 'pcs property' ?
> >> last-lrm-refresh: 1504090367
> >>
> >>
> >>
> >> Never seen this on a healthy cluster.
> >> From RHEL documentation:
> >> last-lrm-refresh
> >>
> >> Last refresh of the
> >> Local Resource Manager,
> >> given in units of
> >> seconds since epoca.
> >> Used for diagnostic
> >> purposes; not
> >> user-configurable.
> >>
> >>
> >> Doesn't explain much.
> > Whenever a cluster property changes, the cluster rechecks the current
> > state to see if anything needs to be done. last-lrm-refresh is just a
> > dummy property that the cluster uses to trigger that. It's set in
> > certain rare circumstances when a resource cleanup is done. You should
> > see a line in your logs like "Triggering a refresh after ... deleted ...
> > from the LRM". That might give some idea of why.
> >
> >> Also. does avg. CPU load impact resource monitoring ?
> >>
> >>
> >> Regards,
> >> Abhay
> > Well, it could cause the monitor to take so long that it times out. The
> > only direct effect of load on pacemaker is that the cluster might lower
> > the number of agent actions that it can execute simultaneously.
> >
> >
> >> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com> wrote:
> >>
> >>         On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
> >>         > Hi,
> >>         >
> >>         >
> >>         > I have a 2 node HA cluster configured on CentOS 7 with pcs
> >>         command.
> >>         >
> >>         >
> >>         > Below are the properties of the cluster :
> >>         >
> >>         >
> >>         > # pcs property
> >>         > Cluster Properties:
> >>         >  cluster-infrastructure: corosync
> >>         >  cluster-name: SVSDEHA
> >>         >  cluster-recheck-interval: 2s
> >>         >  dc-deadtime: 5
> >>         >  dc-version: 1.1.15-11.el7_3.5-e174ec8
> >>         >  have-watchdog: false
> >>         >  last-lrm-refresh: 1504090367
> >>         >  no-quorum-policy: ignore
> >>         >  start-failure-is-fatal: false
> >>         >  stonith-enabled: false
> >>         >
> >>         >
> >>         > PFA the cib.
> >>         > Also attached is the corosync.log around the time the below
> >>         issue
> >>         > happened.
> >>         >
> >>         >
> >>         > After around 10 hrs and multiple failures, pacemaker stops
> >>         monitoring
> >>         > resource on one of the nodes in the cluster.
> >>         >
> >>         >
> >>         > So even though the resource on other node fails, it is never
> >>         migrated
> >>         > to the node on which the resource is not monitored.
> >>         >
> >>         >
> >>         > Wanted to know what could have triggered this and how to
> >>         avoid getting
> >>         > into such scenarios.
> >>         > I am going through the logs and couldn't find why this
> >>         happened.
> >>         >
> >>         >
> >>         > After this log the monitoring stopped.
> >>         >
> >>         > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
> >>         > crmd:     info: process_lrm_event:   Result of monitor
> >>         operation for
> >>         > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) |
> >>         call=538
> >>         > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
> >>
> >>         Are you sure the monitor stopped? Pacemaker only logs
> >>         recurring monitors
> >>         when the status changes. Any successful monitors after this
> >>         wouldn't be
> >>         logged.
> >>
> >>         > Below log says the resource is leaving the cluster.
> >>         > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
> >>         > pengine:     info: LogActions:  Leave   SVSDEHA:0
> >>          (Slave
> >>         > TPC-D12-10-002.phaedrus.sandvine.com)
> >>
> >>         This means that the cluster will leave the resource where it
> >>         is (i.e. it
> >>         doesn't need a start, stop, move, demote, promote, etc.).
> >>
> >>         > Let me know if anything more is needed.
> >>         >
> >>         >
> >>         > Regards,
> >>         > Abhay
> >>         >
> >>         >
> >>         > PS:'pcs resource cleanup' brought the cluster back into good
> >>         state.
> >>
> >>         There are a lot of resource action failures, so I'm not sure
> >>         where the
> >>         issue is, but I'm guessing it has to do with
> >>         migration-threshold=1 --
> >>         once a resource has failed once on a node, it won't be allowed
> >>         back on
> >>         that node until the failure is cleaned up. Of course you also
> >>         have
> >>         failure-timeout=1s, which should clean it up immediately, so
> >>         I'm not
> >>         sure.
> >>
> >>         My gut feeling is that you're trying to do too many things at
> >>         once. I'd
> >>         start over from scratch and proceed more slowly: first, set
> >>         "two_node:
> >>         1" in corosync.conf and let no-quorum-policy default in
> >>         pacemaker; then,
> >>         get stonith configured, tested, and enabled; then, test your
> >>         resource
> >>         agent manually on the command line to make sure it conforms to
> >>         the
> >>         expected return values
> >>         (
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf
> ); then add your resource to the cluster without migration-threshold or
> failure-timeout, and work out any issues with frequent failures; then
> finally set migration-threshold and failure-timeout to reflect how you want
> recovery to proceed.
> >>         --
> >>         Ken Gaillot <kgaillot at redhat.com>
> >>
> >>
> >>
> >>
> >>
> >>         _______________________________________________
> >>         Users mailing list: Users at clusterlabs.org
> >>         http://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >>         Project Home: http://www.clusterlabs.org
> >>         Getting started:
> >>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>         Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170905/47b64f55/attachment-0003.html>