[ClusterLabs] cron-suitable cluster status check
Ken Gaillot
kgaillot at redhat.com
Mon Feb 29 16:52:38 CET 2016
On 02/27/2016 03:56 PM, Devin Reade wrote:
> Right now in a test cluster on CentOS 7 I'm occasionally seeing
> resource monitoring failures and, just today, a failure to start
> a fencing agent. While I need to track those down problems, the
> issue I want to discuss here is being notified when there is a
> problem with the cluster, where there is not a nagios-type monitoring
> system in place.
>
> On an older CentOS 5 cluster I have a cron job that periodically runs
> 'crm_verify -LV'. If the return code is non-zero, the output of
> that command (and some other info) is mailed to the operator. That
> mechanism has been working well for years.
>
> However on CentOS 7, when the cluster gets into this state 'crm_verify -LV'
> returns zero, and its output claims there is no problem. However in
> 'crm_mon -f' I can see that I've got resource failures and nonzero
> failcounts.
>
> I tried 'pcs cluster status', however when the cluster is properly
> working (no failures), that command still has a return code of '1',
> probably because I get the 'Error: no nodes found in corosync.conf'
> which is an ignorable condition per
> <https://access.redhat.com/solutions/663283>.
>
> Is there a command that I can run from cron in the current cluster
> tools to tell me the simple answer of whether there is *anything*
> failed in the cluster, preferably based on its return code?
I'm not sure about the CentOS 5 days, but at least now, crm_verify is
intended to verify the syntax of a cluster's configuration rather than
its status.
The simplest method is "crm_mon -s", which gives a one-line
nagios-compatible output with return code 0=success and 1=problem.
However. it returns 1 for cluster not running, no DC, or offline nodes.
Back in the day, I used check_crm with nagios/icinga. It's a perl script
that parses the output of crm_mon -1rf and crm configure show. It's
trivial to use such a check outside a monitoring system, and it could be
modified to work with pcs and current crm_mon output, so maybe it could
help:
https://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details
> The CentOS 7 cluster is running:
> corosync 2.3.4
> pacemaker 1.1.13
>
> The CentOS 5 cluster is running:
> corosync 1.2.7
> pacemaker 1.0.12
>
> The corosync.conf is included below:
>
> --------- cut here and be careful of pointy scissors ---------
> totem {
> version: 2
> #secauth: off
> cluster_name: somecluster
> #transport: udpu
> rrp_mode: passive
> crypto_hash: sha256
> clear_node_high_bit: yes
>
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.1.0
> mcastaddr: 239.192.0.5
> mcastport: 5406
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.2.0
> mcastaddr: 239.192.0.6
> mcastport: 5408
> }
> }
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> expected_votes: 2
> }
>
> logging {
> to_syslog: yes
> }
>
> --------- cut here and be careful of pointy scissors ---------
>
> Devin
More information about the Users
mailing list