[ClusterLabs] Pacemaker stopped monitoring the resource

Tue Sep 5 08:27:25 EDT 2017

On 09/05/2017 08:54 AM, Abhay B wrote:
> Ken,
>
> I have another set of logs : 
>
> Sep 01 09:10:05 [1328] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>       crmd:     info:
> do_lrm_rsc_op: Performing
> key=5:50864:0:86160921-abd7-4e14-94d4-f53cee278858 op=SVSDEHA_monitor_2000
> SvsdeStateful(SVSDEHA)[6174]:   2017/09/01_09:10:06 ERROR: Resource is
> in failed state
> Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>       crmd:     info:
> action_synced_wait:    Managed SvsdeStateful_meta-data_0 process 6274
> exited with rc=4
> Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>       crmd:    error:
> generic_get_metadata:  Failed to receive meta-data for
> ocf:pacemaker:SvsdeStateful
> Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>       crmd:    error:
> build_operation_update:    No metadata for ocf::pacemaker:SvsdeStateful
> Sep 01 09:10:06 [1328] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>       crmd:     info:
> process_lrm_event: Result of monitor operation for SVSDEHA on
> TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>: 0 (ok) | call=939
> key=SVSDEHA_monitor_2000 confirmed=false cib-update=476
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_process_request:   Forwarding cib_modify operation for section
> status to all (origin=local/crmd/476)
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: --- 0.37.4054 2
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: +++ 0.37.4055 (null)
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    +  /cib:  @num_updates=4055
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    ++
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='SVSDEHA']: 
> <lrm_rsc_op id="SVSDEHA_monitor_2000"
> operation_key="SVSDEHA_monitor_2000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="5:50864:0:86160921-abd7-4e14-94d4-f53cee278858"
> transition-magic="0:0;5:50864:0:86160921-abd7-4e14-94d4-f53cee278858"
> on_node="TPC-F9-26.phaedrus.sandvi
> Sep 01 09:10:06 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_process_request:   Completed cib_modify operation for section
> status: OK (rc=0, origin=TPC-F9-26.phaedrus.sandvine.com/crmd/476
> <http://TPC-F9-26.phaedrus.sandvine.com/crmd/476>, version=0.37.4055)
> *Sep 01 09:10:12 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_process_ping:  Reporting our current digest to
> TPC-E9-23.phaedrus.sandvine.com
> <http://TPC-E9-23.phaedrus.sandvine.com>:
> 74bbb7e9f35fabfdb624300891e32018 for 0.37.4055 (0x7f5719954560 0)
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: --- 0.37.4055 2*
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: +++ 0.37.4056 (null)
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    +  /cib:  @num_updates=4056
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    ++
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='SVSDEHA']: 
> <lrm_rsc_op id="SVSDEHA_last_failure_0"
> operation_key="SVSDEHA_monitor_1000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="7:50662:8:86160921-abd7-4e14-94d4-f53cee278858"
> transition-magic="2:1;7:50662:8:86160921-abd7-4e14-94d4-f53cee278858"
> on_node="TPC-E9-23.phaedrus.sand
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_process_request:   Completed cib_modify operation for section
> status: OK (rc=0, origin=TPC-E9-23.phaedrus.sandvine.com/crmd/53508
> <http://TPC-E9-23.phaedrus.sandvine.com/crmd/53508>, version=0.37.4056)
> Sep 01 09:15:33 [1327] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>      attrd:     info:
> attrd_peer_update: Setting
> fail-count-SVSDEHA[TPC-E9-23.phaedrus.sandvine.com
> <http://TPC-E9-23.phaedrus.sandvine.com>]: (null) -> 1 from
> TPC-E9-23.phaedrus.sandvine.com <http://TPC-E9-23.phaedrus.sandvine.com>
> Sep 01 09:15:33 [1327] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>      attrd:     info:
> attrd_peer_update: Setting
> last-failure-SVSDEHA[TPC-E9-23.phaedrus.sandvine.com
> <http://TPC-E9-23.phaedrus.sandvine.com>]: (null) -> 1504271733 from
> TPC-E9-23.phaedrus.sandvine.com <http://TPC-E9-23.phaedrus.sandvine.com>
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: --- 0.37.4056 2
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    Diff: +++ 0.37.4057 (null)
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    +  /cib:  @num_updates=4057
> Sep 01 09:15:33 [1325] TPC-F9-26.phaedrus.sandvine.com
> <http://TPC-F9-26.phaedrus.sandvine.com>        cib:     info:
> cib_perform_op:    ++
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: 
> <nvpair id="status-2-fail-count-SVSDEHA" name="fail-count-SVSDEHA"
> value="1"/>
>
> I was suspecting around the highlighted parts of the logs above. 
> After 09:10:12 the next log is at 09:15:33. During this time other
> node failed several times but was not migrated here.
>
> I am yet to check with sbd fencing with  the patch shared by Klaus.
> I am on CentOS. 
>
> # cat /etc/centos-release
> CentOS Linux release 7.3.1611 (Core)

I would expect that CentOS Linux release 7.4.1708 should have the patch
mentioned.
Currently I'm on a train with slow and flaky internet-connection thus
checking out would
probably be a pain at the moment ...
iirc the RHEL-7.4 package was working fine on RHEL 7.3 so you might be
lucky with just taking
sbd from there.

Regards,
Klaus

> Regards,
> Abhay
>
>
>
>
> On Sat, 2 Sep 2017 at 15:23 Klaus Wenninger <kwenning at redhat.com
> <mailto:kwenning at redhat.com>> wrote:
>
>     On 09/01/2017 11:45 PM, Ken Gaillot wrote:
>     > On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
>     >>         Are you sure the monitor stopped? Pacemaker only logs
>     >>         recurring monitors
>     >>         when the status changes. Any successful monitors after this
>     >>         wouldn't be
>     >>         logged.
>     >>
>     >> Yes. Since there  were no logs which said "RecurringOp:  Start
>     >> recurring monitor" on the node after it had failed.
>     >> Also there were no logs for any actions pertaining to
>     >> The problem was that even though the one node was failing, the
>     >> resources were never moved to the other node(the node on which I
>     >> suspect monitoring had stopped).
>     >>
>     >>
>     >>         There are a lot of resource action failures, so I'm not
>     sure
>     >>         where the
>     >>         issue is, but I'm guessing it has to do with
>     >>         migration-threshold=1 --
>     >>         once a resource has failed once on a node, it won't be
>     allowed
>     >>         back on
>     >>         that node until the failure is cleaned up. Of course
>     you also
>     >>         have
>     >>         failure-timeout=1s, which should clean it up
>     immediately, so
>     >>         I'm not
>     >>         sure.
>     >>
>     >>
>     >> migration-threshold=1
>     >> failure-timeout=1s
>     >>
>     >> cluster-recheck-interval=2s
>     >>
>     >>
>     >>         first, set "two_node:
>     >>         1" in corosync.conf and let no-quorum-policy default in
>     >>         pacemaker
>     >>
>     >>
>     >> This is already configured.
>     >> # cat /etc/corosync/corosync.conf
>     >> totem {
>     >>     version: 2
>     >>     secauth: off
>     >>     cluster_name: SVSDEHA
>     >>     transport: udpu
>     >>     token: 5000
>     >> }
>     >>
>     >>
>     >> nodelist {
>     >>     node {
>     >>         ring0_addr: 2.0.0.10
>     >>         nodeid: 1
>     >>     }
>     >>
>     >>
>     >>     node {
>     >>         ring0_addr: 2.0.0.11
>     >>         nodeid: 2
>     >>     }
>     >> }
>     >>
>     >>
>     >> quorum {
>     >>     provider: corosync_votequorum
>     >>     two_node: 1
>     >> }
>     >>
>     >>
>     >> logging {
>     >>     to_logfile: yes
>     >>     logfile: /var/log/cluster/corosync.log
>     >>     to_syslog: yes
>     >> }
>     >>
>     >>
>     >>         let no-quorum-policy default in pacemaker; then,
>     >>         get stonith configured, tested, and enabled
>     >>
>     >>
>     >> By not configuring no-quorum-policy, would it ignore quorum for a 2
>     >> node cluster?
>     > With two_node, corosync always provides quorum to pacemaker, so
>     > pacemaker doesn't see any quorum loss. The only significant
>     difference
>     > from ignoring quorum is that corosync won't form a cluster from
>     a cold
>     > start unless both nodes can reach each other (a safety feature).
>     >
>     >> For my use case I don't need stonith enabled. My intention is
>     to have
>     >> a highly available system all the time.
>     > Stonith is the only way to recover from certain types of
>     failure, such
>     > as the "split brain" scenario, and a resource that fails to stop.
>     >
>     > If your nodes are physical machines with hardware watchdogs, you
>     can set
>     > up sbd for fencing without needing any extra equipment.
>
>     Small caveat here:
>     If I get it right you have a 2-node-setup. In this case the
>     watchdog-only
>     sbd-setup would not be usable as it relies on 'real' quorum.
>     In 2-node-setups sbd needs at least a single shared disk.
>     For the sbd-single-disk-setup working with 2-node
>     you need the patch from https://github.com/ClusterLabs/sbd/pull/23
>     in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has
>     it in since GA)
>
>     Regards,
>     Klaus
>
>     >
>     >> I will test my RA again as suggested with no-quorum-policy=default.
>     >>
>     >>
>     >> One more doubt.
>     >> Why do we see this is 'pcs property' ?
>     >> last-lrm-refresh: 1504090367
>     >>
>     >>
>     >>
>     >> Never seen this on a healthy cluster.
>     >> From RHEL documentation:
>     >> last-lrm-refresh
>     >>
>     >> Last refresh of the
>     >> Local Resource Manager,
>     >> given in units of
>     >> seconds since epoca.
>     >> Used for diagnostic
>     >> purposes; not
>     >> user-configurable.
>     >>
>     >>
>     >> Doesn't explain much.
>     > Whenever a cluster property changes, the cluster rechecks the
>     current
>     > state to see if anything needs to be done. last-lrm-refresh is
>     just a
>     > dummy property that the cluster uses to trigger that. It's set in
>     > certain rare circumstances when a resource cleanup is done. You
>     should
>     > see a line in your logs like "Triggering a refresh after ...
>     deleted ...
>     > from the LRM". That might give some idea of why.
>     >
>     >> Also. does avg. CPU load impact resource monitoring ?
>     >>
>     >>
>     >> Regards,
>     >> Abhay
>     > Well, it could cause the monitor to take so long that it times
>     out. The
>     > only direct effect of load on pacemaker is that the cluster
>     might lower
>     > the number of agent actions that it can execute simultaneously.
>     >
>     >
>     >> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com
>     <mailto:kgaillot at redhat.com>> wrote:
>     >>
>     >>         On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
>     >>         > Hi,
>     >>         >
>     >>         >
>     >>         > I have a 2 node HA cluster configured on CentOS 7
>     with pcs
>     >>         command.
>     >>         >
>     >>         >
>     >>         > Below are the properties of the cluster :
>     >>         >
>     >>         >
>     >>         > # pcs property
>     >>         > Cluster Properties:
>     >>         >  cluster-infrastructure: corosync
>     >>         >  cluster-name: SVSDEHA
>     >>         >  cluster-recheck-interval: 2s
>     >>         >  dc-deadtime: 5
>     >>         >  dc-version: 1.1.15-11.el7_3.5-e174ec8
>     >>         >  have-watchdog: false
>     >>         >  last-lrm-refresh: 1504090367
>     >>         >  no-quorum-policy: ignore
>     >>         >  start-failure-is-fatal: false
>     >>         >  stonith-enabled: false
>     >>         >
>     >>         >
>     >>         > PFA the cib.
>     >>         > Also attached is the corosync.log around the time the
>     below
>     >>         issue
>     >>         > happened.
>     >>         >
>     >>         >
>     >>         > After around 10 hrs and multiple failures, pacemaker
>     stops
>     >>         monitoring
>     >>         > resource on one of the nodes in the cluster.
>     >>         >
>     >>         >
>     >>         > So even though the resource on other node fails, it
>     is never
>     >>         migrated
>     >>         > to the node on which the resource is not monitored.
>     >>         >
>     >>         >
>     >>         > Wanted to know what could have triggered this and how to
>     >>         avoid getting
>     >>         > into such scenarios.
>     >>         > I am going through the logs and couldn't find why this
>     >>         happened.
>     >>         >
>     >>         >
>     >>         > After this log the monitoring stopped.
>     >>         >
>     >>         > Aug 29 11:01:44 [16500]
>     TPC-D12-10-002.phaedrus.sandvine.com
>     <http://TPC-D12-10-002.phaedrus.sandvine.com>
>     >>         > crmd:     info: process_lrm_event:   Result of monitor
>     >>         operation for
>     >>         > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com
>     <http://TPC-D12-10-002.phaedrus.sandvine.com>: 0 (ok) |
>     >>         call=538
>     >>         > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>     >>
>     >>         Are you sure the monitor stopped? Pacemaker only logs
>     >>         recurring monitors
>     >>         when the status changes. Any successful monitors after this
>     >>         wouldn't be
>     >>         logged.
>     >>
>     >>         > Below log says the resource is leaving the cluster.
>     >>         > Aug 29 11:01:44 [16499]
>     TPC-D12-10-002.phaedrus.sandvine.com
>     <http://TPC-D12-10-002.phaedrus.sandvine.com>
>     >>         > pengine:     info: LogActions:  Leave   SVSDEHA:0
>     >>          (Slave
>     >>         > TPC-D12-10-002.phaedrus.sandvine.com
>     <http://TPC-D12-10-002.phaedrus.sandvine.com>)
>     >>
>     >>         This means that the cluster will leave the resource
>     where it
>     >>         is (i.e. it
>     >>         doesn't need a start, stop, move, demote, promote, etc.).
>     >>
>     >>         > Let me know if anything more is needed.
>     >>         >
>     >>         >
>     >>         > Regards,
>     >>         > Abhay
>     >>         >
>     >>         >
>     >>         > PS:'pcs resource cleanup' brought the cluster back
>     into good
>     >>         state.
>     >>
>     >>         There are a lot of resource action failures, so I'm not
>     sure
>     >>         where the
>     >>         issue is, but I'm guessing it has to do with
>     >>         migration-threshold=1 --
>     >>         once a resource has failed once on a node, it won't be
>     allowed
>     >>         back on
>     >>         that node until the failure is cleaned up. Of course
>     you also
>     >>         have
>     >>         failure-timeout=1s, which should clean it up
>     immediately, so
>     >>         I'm not
>     >>         sure.
>     >>
>     >>         My gut feeling is that you're trying to do too many
>     things at
>     >>         once. I'd
>     >>         start over from scratch and proceed more slowly: first, set
>     >>         "two_node:
>     >>         1" in corosync.conf and let no-quorum-policy default in
>     >>         pacemaker; then,
>     >>         get stonith configured, tested, and enabled; then, test
>     your
>     >>         resource
>     >>         agent manually on the command line to make sure it
>     conforms to
>     >>         the
>     >>         expected return values
>     >>         (
>     http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf
>     ); then add your resource to the cluster without
>     migration-threshold or failure-timeout, and work out any issues
>     with frequent failures; then finally set migration-threshold and
>     failure-timeout to reflect how you want recovery to proceed.
>     >>         --
>     >>         Ken Gaillot <kgaillot at redhat.com
>     <mailto:kgaillot at redhat.com>>
>     >>
>     >>
>     >>
>     >>
>     >>
>     >>         _______________________________________________
>     >>         Users mailing list: Users at clusterlabs.org
>     <mailto:Users at clusterlabs.org>
>     >>         http://lists.clusterlabs.org/mailman/listinfo/users
>     >>
>     >>         Project Home: http://www.clusterlabs.org
>     >>         Getting started:
>     >>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     >>         Bugs: http://bugs.clusterlabs.org
>
>
>     _______________________________________________
>     Users mailing list: Users at clusterlabs.org
>     <mailto:Users at clusterlabs.org>
>     http://lists.clusterlabs.org/mailman/listinfo/users
>
>     Project Home: http://www.clusterlabs.org
>     Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170905/3f2eed3f/attachment-0003.html>