[Pacemaker] Resource attempting to start on unauthorized node causing running master/slave to shut down

Thu Jun 14 10:15:49 EDT 2012

Hey everyone,
    We had an interesting issue happen the other night on one of our
clusters.  A resource attempted to start on an unauthorized node (and
failed), which caused the real resource, already running on a different
node, to become orphaned and subsequently shut down.

Some background:
We're running pacemaker 1.0.12, corosync 1.2.7 on Centos 5.8 x64

The cluster has 3 members:
pgsql1c & pgsql1d are physical machines running dual Xeon X5650's with 32
gigs of ram
dbquorum which is a vm running on vmware ESX server on HP Blade hardware.

The 2 physical machines are configured to be master/slave postgres servers,
the vm machine is only there for quorum - it should never run any
resources.   The full crm configuration is available in this zip (as alink
to allow the email to post correctly)
https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip

On the dbquorum VM we got the following log message:
Jun 07 03:11:10 corosync [TOTEM ] Process pause detected for 598 ms,
flushing membership messages.

After this it appears that somehow even though the
Cluster-Postgres-Server-1 and Postgres-IP-1 resources are only setup to run
on pgsql1c/d the dbquorum box tried to start them up

WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0 on
dbquorum.example.com: unknown error (1)
WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0 on
dbquorum.example.com: unknown error (1)
info: find_clone: Internally renamed Postgres-Server-1:0 on
pgsql1c.example.com to Postgres-Server-1:1
info: find_clone: Internally renamed Postgres-Server-1:1 on
pgsql1d.example.com to Postgres-Server-1:2 (ORPHAN)
WARN: process_rsc_state: Detected active orphan Postgres-Server-1:2 running
on pgsql1d.example.com
ERROR: native_add_running: Resource ocf::IPaddr2:Postgres-IP-1 appears to
be active on 2 nodes.
WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
information.
notice: native_print: Postgres-IP-1 (ocf::heartbeat:IPaddr2) Started  FAILED
notice: native_print:  0 : dbquorum.example.com
notice: native_print:  1 : pgsql1d.example.com
notice: clone_print:  Master/Slave Set: Cluster-Postgres-Server-1
notice: native_print:      Postgres-Server-1:0 (ocf::custom:pgsql): Slave
dbquorum.example.com FAILED
notice: native_print:      Postgres-Server-1:2 (ocf::custom:pgsql):  ORPHANED
Master pgsql1d.example.com
notice: short_print:      Slaves: [ pgsql1c.example.com ]
ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1c.example.com] =
100
ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1d.example.com] =
100
ERROR: clone_color: Postgres-Server-1:0 is running on
dbquorum.example.com which
isn't allowed
info: native_color: Stopping orphan resource Postgres-Server-1:2

The stopping of the orphaned resource caused our master to stop, luckily
the slave correctly got promoted to master and we had no outage.

There seems to be several things that went wrong here:
1. The VM pause - doing some searching I found some posts with regards to
the pause message and VM's.  We've upped the priority of our dbquorum box
on the VM host, the other posts seem to talk about the token configuration
option in totem but we haven't set that so it seems like it should be the
default of 1000ms so it doesn't seem likely that changing this setting
would have made any difference in this situation.  We looked at the VM host
and couldn't see anything on the physical host at the time that would cause
this pause.
2. The quorum machine tried to start resources it is not authorized for -
symmetric-cluster is set to false and there is no location entry for that
node/resource...why would it try to start it?
3. The 2 machines that stayed up got corrupted when the 3rd came back - the
2 primary machines never lost quorum so...when the 3rd machine came back
and told them it was now the postgres master, why would they believe it?
 and then subsequently shut down the proper master that they should know
full well is the true master?  I would have expected the dbquorum machine
changes to have been rejected by the other 2 that had quorum.

The logs and config are in this zip
https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d was the DC
at the time of the issue.

If anyone has any ideas as to why this happened and/or changes we can make
to our config to prevent it happening again that would be great.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120614/c4894194/attachment-0002.html>