[Pacemaker] Pacemaker delays (long posting)

Tue Mar 5 20:15:46 EST 2013

On Wed, Mar 6, 2013 at 2:01 AM, Michael Powell <
Michael.Powell at harmonicinc.com> wrote:

> I have recently assumed the responsibility for maintaining code on one of
> my company’s products that uses Pacemaker/Heartbeat.  I’m still coming up
> to speed on this code, and would like to solicit comments about some
> particular behavior.  For reference, the Pacemaker version is 1.0.9.1, and
> Heartbeat is version 3.0.3.****
>
> ** **
>
> This product uses two host systems, each of which supports several disk
> enclosures, operating in an “active/passive” mode.  The two hosts are
> connected by redundant, dedicated 10Gb Ethernet links, which are used for
> messaging between them.  The disks in each enclosure are controlled by an
> instance of an application called SS.  If an “active” host’s SS application
> fails for some reason, then the corresponding application on the “passive”
> host will assume control of the disks.  Each application is assigned a
> Pacemaker resource, and the resource agent communicates with the
> appropriate SS instance.  For reference, here’s a sample crm_mon output:**
> **
>
> ** **
>
> ============****
>
> Last updated: Tue Mar  5 06:10:22 2013****
>
> Stack: Heartbeat****
>
> Current DC: mgraid-12241530rn01433-0
> (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum****
>
> Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677****
>
> 2 Nodes configured, unknown expected votes****
>
> 9 Resources configured.****
>
> ============****
>
> ** **
>
> Online: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]****
>
> ** **
>
> Clone Set: Fencing****
>
>      Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]****
>
> Clone Set: cloneIcms****
>
>      Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]****
>
> Clone Set: cloneOmserver****
>
>      Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]****
>
> Master/Slave Set: ms-SS11451532RN01389****
>
>      Masters: [ mgraid-12241530rn01433-1 ]****
>
>      Slaves: [ mgraid-12241530rn01433-0 ]****
>
> Master/Slave Set: ms-SS11481532RN01465****
>
>      Masters: [ mgraid-12241530rn01433-0 ]****
>
>      Slaves: [ mgraid-12241530rn01433-1 ]****
>
> Master/Slave Set: ms-SS12171532RN01613****
>
>      Masters: [ mgraid-12241530rn01433-0 ]****
>
>      Slaves: [ mgraid-12241530rn01433-1 ]****
>
> Master/Slave Set: ms-SS12241530RN01433****
>
>      Masters: [ mgraid-12241530rn01433-0 ]****
>
>      Slaves: [ mgraid-12241530rn01433-1 ]****
>
> Master/Slave Set: ms-SS12391532RN01768****
>
>      Masters: [ mgraid-12241530rn01433-0 ]****
>
>      Slaves: [ mgraid-12241530rn01433-1 ]****
>
> Master/Slave Set: ms-SS12391532RN01772****
>
>      Masters: [ mgraid-12241530rn01433-0 ]****
>
>      Slaves: [ mgraid-12241530rn01433-1 ]****
>
> ** **
>
> I’ve been investigating the system’s behavior when one or more master SS
> instances crashes, simulated by a kill command.  I’ve noticed two
> behaviors of interest.****
>
> ** **
>
> First, in a simple case, where one master SS is killed, it takes about
> 10-12 seconds for the slave to complete the failover.  From the log files,
> the DC issues the following notifications to the slave SS:****
>
> **·         **Pre_notify_demote****
>
> **·         **Post_notify_demote****
>
> **·         **Pre_notify_stop****
>
> **·         **Post_notify_stop****
>
> **·         **Pre_notify_promote****
>
> **·         **Promote****
>
> **·         **Post_notify_promote****
>
> **·         **Monitor_3000****
>
> **·         **Pre_notify_start****
>
> **·         **Post_notify_start****
>
> ** **
>
> These notifications and their confirmations appear to take about 1-2
> seconds each, begging the following questions:****
>
> **·         **Is this sequence of notifications expected?
>

Yes, it looks correct (if sub-optimal) to me.
A more recent version might provide a better experience.

> ****
>
> **·         **Is the 10-12 second timeframe expected?
>

Its really dependant on what the RA (resource agent) does with the
notification (and therefor how long it takes).
Do you need the notifications turned on?  Some agents like drbd do need it,
but without knowing which agents you're using its hard to say.

> ****
>
> ** **
>
> Second, in a more complex case, where the master SS for each instance is
> assigned to the same can, and each SS is in turn killed with an
> approximate 10-second delay between kill commands, there appear to be
> very long delays in processing the notifications.  These delays appear to
> be associated with these factors****
>
> **·         **After an SS instance is killed, there’s a 10-second monitor
> notification which causes a new SS instance to be launched to replace the
> missing SS instance.
>

Whoa... monitor restarts the service if it detects a failure?
That is rarely a good idea.

> ****
>
> **·         **It takes about 30 seconds for an SS instance to complete
> the startup process.  The resource agent waits for that startup to complete
> before returning to crmd.
>

Right, agents shouldn't say "done" until they really are.
Returning too soon usually just leads to people needing to insert
delays/sleeps into anything that depends on it.

> ****
>
> **·         **Until the resource agent returns, crmd does not process
> notifications for any other SS/resource.****
>
> The net effect of these delays varies from one SS instance to another.  In
> some cases, the “normal” failover occurs, taking 10-12 seconds.  In other
> cases, there is no failover to the other host’s SS instance, and there is
> no master/active SS instance for 1-2 *minutes *(until an SS instance is
> re-launched following the kill), depending upon the number of disk
> enclosures and thus the number of SS instances.****
>
> ** **
>
> My first question in this case is simply whether the serialization of
> notifications among the various SS resources is expected?
>

They happen before+after each operation on the master/slave resource.
So first we tell all the instances of X that one or more instances are
about to be $action'd then we perform $action, then we tell any active
instances of X that we did it.
Then we move on to the next action (demote -> stop -> start -> promote)

  In other words, transition notifications for one resource are delayed
> until earlier notifications are completed.  Is this the expected behavior?
>

Only if you created ordering constraints between the two master/slave
resources.

> Secondly, once the SS instance has been restarted,
>

By who? The agent or pacemaker?

> there’s apparently no attempt to complete the failover; the new SS
> instance assumes the active/master role****
>
> ** **
>
> Finally, a couple of general questions:****
>
> **·         **Is there any reason to believe that a later version of
> Pacemaker would behave differently?
>

Quite likely.

> **
>
> **·         **Is there a mechanism by which the crmd (and lrmd) debug
> levels can be increased at run time (allowing more debug messages in the
> log output)?
>

IIRC, killall -USR1 crmd lrmd

In later versions we support killall -TRAP to dump the current buffer of
trace logging to disk (we keep the last 5mb in memory in case an error
occurs).

> **
>
> * *
>
> Thanks very much for your help,****
>
> Michael Powell****
>
> ** **
>
> [image: LogoSignature2]****
>
> ** **
>
>     Michael Powell****
>
>     Staff Engineer****
>
> ** **
>
>     15220 NW Greenbrier Pkwy****
>
>         Suite 290****
>
>     Beaverton, OR   97006****
>
>     T 503-372-7327    M 503-789-3019   H 503-625-5332****
>
> ** **
>
>     www.harmonicinc.com****
>
> ** **
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130306/0257d351/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1625 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130306/0257d351/attachment-0003.gif>