[Pacemaker] Call cib_query failed (-41): Remote node did not respond

Wed Jun 27 23:30:39 EDT 2012

On Thu, Jun 28, 2012 at 1:29 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Wed, Jun 27, 2012 at 11:30 PM, Brian J. Murrell
> <brian at interlinx.bc.ca> wrote:
>> On 12-06-26 09:54 PM, Andrew Beekhof wrote:
>>>
>>> The DC, possibly you didn't have one at that moment in time.
>>
>> It was the DC in fact.  I restarted corosync on that node and the
>> timeouts went away.  But note I "re"started, not started.  It was
>> running at the time, just not properly, apparently.
>>
>>> Were there (m)any membership events occurring at the time?
>>
>> I'm not sure.
>>
>> I do seem to be able to reproduce this situation though with some
>> software I have that's driving pacemaker configuration building.
>>
>> I essentially have 34 resources across 17 nodes that I need to populate
>> pacemaker with, complete with location constraints.  This populating is
>> done with a pair of cibadmin commands, one for the resource and one for
>> the constraint.  These pairs of commands are being run for each resource
>> on the nodes on which they will run.
>>
>> So, that's 17 pairs of cibadmin commands being run, one pair on each
>> node, concurrently -- so yes, lots of thrashing of the CIB.  Is the CIB
>> and/or cibadmin not up to this kind of thrashing?
>
> As with any interesting question, "it depends".
>
> Newer versions have become more efficient over time, so an upgrade may help.
> Even something as trivial as switching our logging library made a
> significant difference.
> For 1.1.8 we've also switched the IPC library which should bring
> another performance boost.
>
> If the services you're adding are clones, that can also have a big
> impact as the number of probe operations is clones * clone-max *
> nodes.
> Thats a lot of updates hitting the cib when the cluster first starts up.
>
> To mitigate the thrashing, try setting the batch-limit parameter in
> the cib (man pengine).
>
>>
>> Typically while this is happening some number of cibadmin commands will
>> start failing with:
>>
>> Call cib_create failed (-41): Remote node did not respond
>>
>> and then calls to (say) "cibadmin -Q" on every node except the DC will
>> start failing with:
>>
>> Call cib_query failed (-41): Remote node did not respond
>>
>> After restarting corosync on the DC, (most if not all of) the non-DC
>> nodes are now able to return from "cibadmin -Q" but they have differing
>> CIB contents.  That state doesn't seem to last long and all nodes except
>> the (typically new/different) DC node again suffer "Remote node did not
>> respond".  A restart of that new DC again yields some/most of the nodes
>> able to complete queries again, bug again, with differing CIB content.
>
> Really doesn't sound good.
> Could you check CPU usage of the various cluster processes with top
> while this is occurring?
> Otherwise perhaps the traffic is making corosync twitchy and you're hitting:
>   http://bugzilla.redhat.com/show_bug.cgi?id=820821
>
>> I am using corosync-1.4.1-4.el6_2.3 and pacemaker-1.1.6-3.el6 on these
>> nodes.
>>
>> Any ideas?  Am I really pushing the CIB too hard with all of the
>> concurrent modifications?

The updates from you aren't the problem.  Its the number of resource
operations (that need to be stored in the CIB) that result from your
changes that might be causing the problem.