[Pacemaker] Call cib_query failed (-41): Remote node did not respond

Wed Aug 15 00:16:10 EDT 2012

On 05/07/2012, at 2:51 AM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:

> On 12-07-04 04:27 AM, Andreas Kurz wrote:
>> 
>> beside increasing the batch limit to a higher value ... did you also
>> tune corosync totem timings?
> 
> Not yet.
> 
> But a closer look at the logs reveals a bunch of these:
> 
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25046 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-4.lab.example.com"
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-4.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25048 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-6.lab.example.com"
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-6.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 abrt[25049]: not dumping repeating crash in '/usr/sbin/corosync'
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25050 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-10.lab.example.com
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-10.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25051 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-7.lab.example.com"
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-7.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25052 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-4.lab.example.com"
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-4.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25053 spawned to record non-fatal assertion failure line 1594: rc == 0
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <copy t="cib" cib_op="cib_replace" cib_delegated_from="node-6.lab.example.com"
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] WARN: route_ais_message: Sending message to node-6.lab.example.com.cib failed: cluster delivery failed (rc=-1)
> Jun 28 14:56:56 node-2 corosync[30497]:   [pcmk  ] ERROR: send_cluster_msg_raw: Child 25054 spawned to record non-fatal assertion failure line 1594: rc == 0
> 
> Google could not seem to turn up anything about the assertion message.
> 
> I also saw these after setting the batch-limit to 1 and repeating my 8
> node (4 active, 4 idle) experiment today.

Thats unusual.

> 
> But surely, it is easy to understand why pacemaker would have problems
> if corosync is aborting on a failed assertion.

Technically thats still pacemaker - the plugin code that gets loaded into corosync.

> 
> Any clues what this one is about?  This is corosync-1.4.1-4.el6_2.3.x86_64.

Looking at the source code, it seems totem (a corosync thing) is refusing to send the message to the other nodes.
I don't know under what conditions it would do this though.

Possibly it is saying "try again" but we don't have logic in place to do so.
You might have more success with pacemaker in cman mode - in this situation i know we do have retry logic in place when sending a cluster message fails.

> 
> Cheers,
> b.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org