[Pacemaker] Call cib_query failed (-41): Remote node did not respond

Wed Jun 27 23:29:21 EDT 2012

On Wed, Jun 27, 2012 at 11:30 PM, Brian J. Murrell
<brian at interlinx.bc.ca> wrote:
> On 12-06-26 09:54 PM, Andrew Beekhof wrote:
>>
>> The DC, possibly you didn't have one at that moment in time.
>
> It was the DC in fact.  I restarted corosync on that node and the
> timeouts went away.  But note I "re"started, not started.  It was
> running at the time, just not properly, apparently.
>
>> Were there (m)any membership events occurring at the time?
>
> I'm not sure.
>
> I do seem to be able to reproduce this situation though with some
> software I have that's driving pacemaker configuration building.
>
> I essentially have 34 resources across 17 nodes that I need to populate
> pacemaker with, complete with location constraints.  This populating is
> done with a pair of cibadmin commands, one for the resource and one for
> the constraint.  These pairs of commands are being run for each resource
> on the nodes on which they will run.
>
> So, that's 17 pairs of cibadmin commands being run, one pair on each
> node, concurrently -- so yes, lots of thrashing of the CIB.  Is the CIB
> and/or cibadmin not up to this kind of thrashing?

As with any interesting question, "it depends".

Newer versions have become more efficient over time, so an upgrade may help.
Even something as trivial as switching our logging library made a
significant difference.
For 1.1.8 we've also switched the IPC library which should bring
another performance boost.

If the services you're adding are clones, that can also have a big
impact as the number of probe operations is clones * clone-max *
nodes.
Thats a lot of updates hitting the cib when the cluster first starts up.

To mitigate the thrashing, try setting the batch-limit parameter in
the cib (man pengine).

>
> Typically while this is happening some number of cibadmin commands will
> start failing with:
>
> Call cib_create failed (-41): Remote node did not respond
>
> and then calls to (say) "cibadmin -Q" on every node except the DC will
> start failing with:
>
> Call cib_query failed (-41): Remote node did not respond
>
> After restarting corosync on the DC, (most if not all of) the non-DC
> nodes are now able to return from "cibadmin -Q" but they have differing
> CIB contents.  That state doesn't seem to last long and all nodes except
> the (typically new/different) DC node again suffer "Remote node did not
> respond".  A restart of that new DC again yields some/most of the nodes
> able to complete queries again, bug again, with differing CIB content.

Really doesn't sound good.
Could you check CPU usage of the various cluster processes with top
while this is occurring?
Otherwise perhaps the traffic is making corosync twitchy and you're hitting:
   http://bugzilla.redhat.com/show_bug.cgi?id=820821

> I am using corosync-1.4.1-4.el6_2.3 and pacemaker-1.1.6-3.el6 on these
> nodes.
>
> Any ideas?  Am I really pushing the CIB too hard with all of the
> concurrent modifications?
>
> b.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>