[Pacemaker] asymmetric clusters, remote nodes, and monitor operations

Andrew Beekhof andrew at beekhof.net
Tue Nov 12 19:31:48 EST 2013


On 12 Sep 2013, at 3:44 am, Lindsay Todd <rltodd.ml1 at gmail.com> wrote:

> What I am seeing in the syslog are messages like:
> 
> Sep 11 13:19:52 db02 pacemaker_remoted[1736]:   notice: operation_finished: p-my
> sql_monitor_20000:19398:stderr [ 2013/09/11_13:19:52 INFO: MySQL monitor succeed
> ed ]
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processin
> g failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysq
> l-slurm from re-starting on cvmh02: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processin
> g failed op monitor for p-mysql-slurm on cvmh03: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysq
> l-slurm from re-starting on cvmh03: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql-slurm on cvmh01: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysql-slurm from re-starting on cvmh01: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysql-slurm from re-starting on cvmh02: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql-slurm on cvmh03: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysql-slurm from re-starting on cvmh03: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql-slurm on cvmh01: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysql-slurm from re-starting on cvmh01: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: LogActions: Start   p-mysql#011(db02)
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating action 48: monitor p-mysql_monitor_0 on cvmh03 (local)
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating action 46: monitor p-mysql_monitor_0 on cvmh02
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating action 44: monitor p-mysql_monitor_0 on cvmh01
> Sep 11 13:20:08 cvmh03 mysql(p-mysql)[12476]: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: process_lrm_event: LRM operation p-mysql_monitor_0 (call=907, rc=5, cib-update=701, confirmed=true) not installed
> Sep 11 13:20:08 cvmh02 mysql(p-mysql)[17158]: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
> Sep 11 13:20:08 cvmh01 mysql(p-mysql)[5968]: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
> Sep 11 13:20:08 cvmh02 crmd[5081]:   notice: process_lrm_event: LRM operation p-mysql_monitor_0 (call=332, rc=5, cib-update=164, confirmed=true) not installed
> Sep 11 13:20:08 cvmh01 crmd[5169]:   notice: process_lrm_event: LRM operation p-mysql_monitor_0 (call=319, rc=5, cib-update=188, confirmed=true) not installed
> Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 48 (p-mysql_monitor_0) on cvmh03 failed (target: 7 vs. rc: 5): Error
> Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 46 (p-mysql_monitor_0) on cvmh02 failed (target: 7 vs. rc: 5): Error
> Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 44 (p-mysql_monitor_0) on cvmh01 failed (target: 7 vs. rc: 5): Error
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing p-mysql-slurm from re-starting on cvmh02: operation monitor failed 'not installed' (5)
> Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure: Processing failed op monitor for p-mysql on cvmh02: not installed (5)
> ...
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating action 150: start p-mysql_start_0 on db02
> Sep 11 13:20:08 db02 pacemaker_remoted[1736]:   notice: operation_finished: p-mysql_start_0:19427:stderr [ 2013/09/11_13:20:08 INFO: MySQL already running ]
> Sep 11 13:20:08 cvmh02 crmd[5081]:   notice: process_lrm_event: LRM operation p-mysql_start_0 (call=2600, rc=0, cib-update=165, confirmed=true) ok
> Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating action 151: monitor p-mysql_monitor_20000 on db02
> Sep 11 13:20:09 db02 pacemaker_remoted[1736]:   notice: operation_finished: p-mysql_monitor_20000:19454:stderr [ 2013/09/11_13:20:09 INFO: MySQL monitor succeeded ]
> 
> So I guess they aren't "error", but rather warnings, which is what we see in unpack_rcs_op_failure, and I do see that is makes OCF_NOT_INSTALLED when asymmetric a special case -- after logging the warning.  Should the test move earlier in this function, and maybe return in that case?

I've moved that message further down into a block that is conditional on OCF_NOT_INSTALLED and pe_flag_symmetric_cluster:

   https://github.com/beekhof/pacemaker/commit/4b6def9

>  Also crm_mon reports errors:

The latest in git appears to have resolved this.
I'm reasonably sure it was this commit:

   https://github.com/beekhof/pacemaker/commit/a32474b

> 
> Failed actions:
>     p-mysql-slurm_monitor_0 on cvmh02 'not installed' (5): call=69, status=compl
> ete, last-rc-change='Tue Sep 10 15:52:57 2013', queued=31ms, exec=0ms
>     s-ldap_monitor_0 on cvmh02 'not installed' (5): call=289, status=Not install
> ed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
>     p-mysql_monitor_0 on cvmh02 'not installed' (5): call=332, status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=40ms, exec=0ms
>     p-mysql-slurm_monitor_0 on cvmh03 'not installed' (5): call=325, status=complete, last-rc-change='Wed Sep  4 13:44:15 2013', queued=35ms, exec=0ms
>     s-ldap_monitor_0 on cvmh03 'not installed' (5): call=869, status=Not installed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
>     p-mysql_monitor_0 on cvmh03 'not installed' (5): call=907, status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=36ms, exec=0ms
>     p-mysql-slurm_monitor_0 on cvmh01 'not installed' (5): call=95, status=complete, last-rc-change='Tue Sep 10 15:48:15 2013', queued=95ms, exec=0ms
>     fence-cvmh02_start_0 on (null) 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Tue Sep 10 15:49:38 2013', queued=0ms, exec=0ms
>     fence-cvmh02_start_0 on cvmh01 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Tue Sep 10 15:49:38 2013', queued=0ms, exec=0ms
>     s-ldap_monitor_0 on cvmh01 'not installed' (5): call=279, status=Not installed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
>     p-mysql_monitor_0 on cvmh01 'not installed' (5): call=319, status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=42ms, exec=0ms
> 
> Almost all of these are instances of resources being probed on nodes that they shouldn't be running on, aren't installed on, and aren't really errors.  (I assume the crm_report has captured the location rules, as well as confirmed that the symmetric-cluster property is false.)  The resources do also start up on the nodes they should run on.
> 
> Previously I'd noticed that LSB resources probed on nodes that don't have the associated init script would fail; looks like that is also getting reported as OCF_NOT_INSTALLED, so perhaps is the same problem.
> 
> 
> On Wed, Sep 4, 2013 at 12:49 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> On 04/09/2013, at 6:18 AM, Lindsay Todd <rltodd.ml1 at gmail.com> wrote:
> 
> > We've been attempting to set up an asymmetric pacemaker cluster using remote cluster nodes, with pacemaker 1.1.10 (actually, building from git lately, currently at a4eb44f).  We use location constraints to enable resources to start on nodes they should start on, and rely on asymmetry to otherwise keep resources from starting.
> 
> You set symmetric-cluster=false or assumed that was the default
> 
> >
> > But we get many monitor operation failures.
> >
> > Resource monitor operations run on the physical real hosts, and frequently fail because not all the components are present on those hosts.  For instance, the mysql resource agent's monitor operation fails as "not installed", since, well, mysql isn't installed on those systems, so the validate operation, which most or every path through that agent runs, always fails.  I don't see failures on the remote nodes, even ones without mysql installed.
> >
> > Previously I'd noticed LSB resources had failed monitor operations on systems that didn't have the LSB init script installed.
> >
> > Presumably these monitor operations are happening to ensure the resource is NOT running where it should not be???
> 
> Correct. Although with symmetric-cluster=false it shouldn't show up as an error.
> Logs? crm_mon output?
> 
> >  There doesn't seem to be a way to set up location constraints to prevent this from happening, at least that I've found.  I wrote an OCF wrapper RA to help me with LSB init scripts, but not sure what to do about other RA's like mysql short of maintaining my own version, unless there is a way to tune where "monitor" runs.  Or more likely I'm missing something ...
> >
> > It would seem to me that a "not installed" failure, OCF_ERR_INSTALLED, would not really be an error on a node that shouldn't be running that resource agent anyway, and is probably a pretty good indication that it isn't running.
> >
> > /Lindsay
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131113/0512eb61/attachment-0002.sig>


More information about the Pacemaker mailing list