[Pacemaker] Resource attempting to start on unauthorized node causing running master/slave to shut down

Thu Jun 21 02:07:54 UTC 2012

On Fri, Jun 15, 2012 at 12:15 AM, Mike Roest <msroest at gmail.com> wrote:
> Hey everyone,
>     We had an interesting issue happen the other night on one of our
> clusters.  A resource attempted to start on an unauthorized node (and
> failed),

Just from the logs below, that does not seem to be the case.
What I see is pacemaker attempting to determine the state of
Postgres-Server and Postgres-IP-1 on dbquorum.example.com and those
operations failing:

> WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0
> on dbquorum.example.com: unknown error (1)
> WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0
> on dbquorum.example.com: unknown error (1)

Under such conditions, pacemaker must assume that the resources are
active and initiates recovery.
Your real question should be, why did the monitor op fail with rc=1
(instead of rc=7) for those two resources on dbquorum?

> which caused the real resource, already running on a different
> node, to become orphaned and subsequently shut down.
>
> Some background:
> We're running pacemaker 1.0.12, corosync 1.2.7 on Centos 5.8 x64
>
> The cluster has 3 members:
> pgsql1c & pgsql1d are physical machines running dual Xeon X5650's with 32
> gigs of ram
> dbquorum which is a vm running on vmware ESX server on HP Blade hardware.
>
> The 2 physical machines are configured to be master/slave postgres servers,
> the vm machine is only there for quorum - it should never run any resources.
>   The full crm configuration is available in this zip (as alink to allow the
> email to post correctly)
> https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip
>
> On the dbquorum VM we got the following log message:
> Jun 07 03:11:10 corosync [TOTEM ] Process pause detected for 598 ms,
> flushing membership messages.
>
> After this it appears that somehow even though the Cluster-Postgres-Server-1
> and Postgres-IP-1 resources are only setup to run on pgsql1c/d the dbquorum
> box tried to start them up
>
> WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0
> on dbquorum.example.com: unknown error (1)
> WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0
> on dbquorum.example.com: unknown error (1)
> info: find_clone: Internally renamed Postgres-Server-1:0
> on pgsql1c.example.com to Postgres-Server-1:1
> info: find_clone: Internally renamed Postgres-Server-1:1
> on pgsql1d.example.com to Postgres-Server-1:2 (ORPHAN)
> WARN: process_rsc_state: Detected active orphan Postgres-Server-1:2 running
> on pgsql1d.example.com
> ERROR: native_add_running: Resource ocf::IPaddr2:Postgres-IP-1 appears to be
> active on 2 nodes.
> WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
> information.
> notice: native_print: Postgres-IP-1 (ocf::heartbeat:IPaddr2) Started  FAILED
> notice: native_print:  0 : dbquorum.example.com
> notice: native_print:  1 : pgsql1d.example.com
> notice: clone_print:  Master/Slave Set: Cluster-Postgres-Server-1
> notice: native_print:      Postgres-Server-1:0 (ocf::custom:pgsql):
> Slave dbquorum.example.com FAILED
> notice: native_print:      Postgres-Server-1:2 (ocf::custom:pgsql):
>  ORPHANED Master pgsql1d.example.com
> notice: short_print:      Slaves: [ pgsql1c.example.com ]
> ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1c.example.com] =
> 100
> ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1d.example.com] =
> 100
> ERROR: clone_color: Postgres-Server-1:0 is running
> on dbquorum.example.com which isn't allowed
> info: native_color: Stopping orphan resource Postgres-Server-1:2
>
> The stopping of the orphaned resource caused our master to stop, luckily the
> slave correctly got promoted to master and we had no outage.
>
> There seems to be several things that went wrong here:
> 1. The VM pause - doing some searching I found some posts with regards to
> the pause message and VM's.  We've upped the priority of our dbquorum box on
> the VM host, the other posts seem to talk about the token configuration
> option in totem but we haven't set that so it seems like it should be the
> default of 1000ms so it doesn't seem likely that changing this setting would
> have made any difference in this situation.  We looked at the VM host and
> couldn't see anything on the physical host at the time that would cause this
> pause.
> 2. The quorum machine tried to start resources it is not authorized for -
> symmetric-cluster is set to false and there is no location entry for that
> node/resource...why would it try to start it?
> 3. The 2 machines that stayed up got corrupted when the 3rd came back - the
> 2 primary machines never lost quorum so...when the 3rd machine came back and
> told them it was now the postgres master, why would they believe it?  and
> then subsequently shut down the proper master that they should know full
> well is the true master?  I would have expected the dbquorum machine changes
> to have been rejected by the other 2 that had quorum.
>
> The logs and config are in this zip
> https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d was the DC
> at the time of the issue.
>
> If anyone has any ideas as to why this happened and/or changes we can make
> to our config to prevent it happening again that would be great.
>
> Thanks!
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>