[Pacemaker] Call cib_query failed (-41): Remote node did not respond

Tue Jul 3 16:26:44 EDT 2012

----- Original Message -----
> From: "Brian J. Murrell" <brian at interlinx.bc.ca>
> To: pacemaker at clusterlabs.org
> Sent: Tuesday, July 3, 2012 2:15:09 PM
> Subject: Re: [Pacemaker] Call cib_query failed (-41): Remote node did not	respond
> 
> On 12-06-27 11:30 PM, Andrew Beekhof wrote:
> > 
> > The updates from you aren't the problem.  Its the number of
> > resource
> > operations (that need to be stored in the CIB) that result from
> > your
> > changes that might be causing the problem.
> 
> Just to follow this up for anyone currently following or anyone
> finding
> this thread in the future...
> 
> It turns out that the problem is simply the size of the HA cluster
> that
> I want to create.  The details are in the bug I filed at
> http://bugs.clusterlabs.org/show_bug.cgi?id=5076 but the short story
> is
> that I can add the number of resources and constrains I want to add
> (i.e. 32-34 of each, as previously described in this thread),
> concurrently even, so long as there is not more than 4 nodes per
> corosync/pacemaker cluster.
> 
> Even adding 4 passive nodes (I only tried 8 total of 8 nodes, but not
> values between 4 and 8 so the tipping point might be somewhere in
> between 4 and 8) -- nodes that do no CIB operations of their own made
> pacemaker crumble.
>
> 
> So the summary seems to be that pacemaker cannot scale to more than a
> handful of nodes, even when the nodes are big: 12 core Xeon nodes
> with
> gobs of memory.

This is not a definite.  Perhaps you are experiencing this given the pacemaker version you are running and the torture test you are running with all those parallel commands, but I wouldn't go as far as to say pacemaker cannot scale to more than a handful of nodes.  It completely depends on the situation.  16 nodes with 32 resources might work... 3 nodes with 100 resources might not.  There is a limit to how far deployments can scale, but it is not easy to quantify values that hold any real truth across all deployments.  I'm sure you know this, I just wanted to be explicit about this so there is no confusion caused by people who may use your example as a concrete metric.

> 
> I can only guess that everybody is using pacemaker in "pair" (or not
> much bigger) type configurations currently.  Is that accurate?
>

>From the deployments I've seen on the mailing list and bug reports, the most common clusters appear to be around the 2-6 node mark.

> Perhaps there is some tuning that can be done to scale somewhat, but
> realistically, I am looking for pacemaker clusters in the tens, if
> not
> into the hundreds of nodes.  However, I really wonder if any amount

The messaging involved with keeping the all the local resource operations in the CIB synced across that many nodes is pretty insane.  If you are set on using pacemaker, the best approach for scaling for your situation would probably be to try and figure out how to break nodes into smaller clusters that are easier to manage.  I have not heard of a single deployment as large as you are thinking of.

-- Vossel

> of
> tuning could be done to achieve clusters that large given the small
> number of nodes supported with the default tuning values.
>
> 
> Thoughts?
> 
> b.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>