[Pacemaker] Call cib_query failed (-41): Remote node did not respond

Brian J. Murrell brian at interlinx.bc.ca
Wed Jun 27 09:30:00 EDT 2012


On 12-06-26 09:54 PM, Andrew Beekhof wrote:
> 
> The DC, possibly you didn't have one at that moment in time.

It was the DC in fact.  I restarted corosync on that node and the
timeouts went away.  But note I "re"started, not started.  It was
running at the time, just not properly, apparently.

> Were there (m)any membership events occurring at the time?

I'm not sure.

I do seem to be able to reproduce this situation though with some
software I have that's driving pacemaker configuration building.

I essentially have 34 resources across 17 nodes that I need to populate
pacemaker with, complete with location constraints.  This populating is
done with a pair of cibadmin commands, one for the resource and one for
the constraint.  These pairs of commands are being run for each resource
on the nodes on which they will run.

So, that's 17 pairs of cibadmin commands being run, one pair on each
node, concurrently -- so yes, lots of thrashing of the CIB.  Is the CIB
and/or cibadmin not up to this kind of thrashing?

Typically while this is happening some number of cibadmin commands will
start failing with:

Call cib_create failed (-41): Remote node did not respond

and then calls to (say) "cibadmin -Q" on every node except the DC will
start failing with:

Call cib_query failed (-41): Remote node did not respond

After restarting corosync on the DC, (most if not all of) the non-DC
nodes are now able to return from "cibadmin -Q" but they have differing
CIB contents.  That state doesn't seem to last long and all nodes except
the (typically new/different) DC node again suffer "Remote node did not
respond".  A restart of that new DC again yields some/most of the nodes
able to complete queries again, bug again, with differing CIB content.

I am using corosync-1.4.1-4.el6_2.3 and pacemaker-1.1.6-3.el6 on these
nodes.

Any ideas?  Am I really pushing the CIB too hard with all of the
concurrent modifications?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120627/042d69d5/attachment-0003.sig>


More information about the Pacemaker mailing list