[Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
Andrew Beekhof
andrew at beekhof.net
Wed Aug 4 09:12:47 UTC 2010
2010/8/3 nozawat <nozawat at gmail.com>:
> Hi Andrew,
>
> I changed cluster option to batch-limit=3,I re-tried it.
> However, similar time-out occurs.
>
> I measured processing just before the time-out(120s) in systemtap.
> The following only the function long time.
> -----
> probe start! ---------------------------------
> cib_process_request [call-count:179][117,540,173,155 nsec]
> cib_process_command [call:179] [116,471,047,275 nsec]
> cib_process_command call function ---
> cib_config_changed [call:179] [101,169,909,572 nsec]
> cib_config_changed call function ---
> calculate_xml_digest [call:179] [ 68,820,560,745 nsec]
> create_xml_node [call:3012263] [ 19,855,469,976 nsec]※
But, but, create_xml_node() barely does _anything_.
It's hard to imagine how this could take 19s:
xmlNode*
create_xml_node(xmlNode *parent, const char *name)
{
xmlDoc *doc = NULL;
xmlNode *node = NULL;
if (name == NULL || name[0] == 0) {
return NULL;
}
if(parent == NULL) {
doc = xmlNewDoc((const xmlChar*)"1.0");
node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL);
xmlDocSetRootElement(doc, node);
} else {
doc = getDocPtr(parent);
node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL);
xmlAddChild(parent, node);
}
return node;
}
Is this data all from the CIB process?
What was the load from the other processes like? (Ie. using top)
> xpath_search [call:179] [ 145,030,232 nsec]
> diff_xml_object [call:179] [ 32,677,359,476 nsec]※
> calculate_xml_digest call function ---
> sorted_xml [call:1505799] [ 52,512,465,838 nsec]※
Yikes, these are also really high.
Could you send me the CIB so that we can profile these operations on it?
> copy_xml [call:179] [ 3,692,232,073 nsec]
> dump_xml [call:536] [ 6,177,606,232 nsec]
> -----
> Is there the method to make these processing early?
You mean faster?
Maybe oprofile can suggest something.
>
>
> 2010/6/14 <renayama19661014 at ybb.ne.jp>
>>
>> Hi Andrew,
>>
>> Thank you for comment.
>>
>> > More likely of the underlying messaging infrastructure, but I'll take a
>> > look.
>> > Perhaps the default cib operation timeouts are too low for larger
>> > clusters.
>> >
>> > >
>> > > The log attached it to next Bugzilla.
>> > > �*
>> > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
>> >
>> > Ok, I'll follow up there.
>>
>> If it is necessary for us to work for the solution of the problem, please
>> order it.
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>>
>> > On Mon, Jun 14, 2010 at 4:46 AM, <renayama19661014 at ybb.ne.jp> wrote:
>> > > We tested 16 node constitution (15+1).
>> > >
>> > > We carried out the next procedure.
>> > >
>> > > Step1) Start 16 nodes.
>> > > Step2) Send cib after a DC node was decided.
>> > >
>> > > An error occurs by the update of the attribute of pingd after Probe
>> > > processing was over.
>> > >
>> > >
>> >
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> > > Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying...
>> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
>> > > 337 for
>> > default_ping_set=1600
>> > > failed: Remote node did not respond
>> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
>> > > 340 for
>> > default_ping_set=1600
>> > > failed: Remote node did not respond
>> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
>> > > 343 for
>> > default_ping_set=1600
>> > > failed: Remote node did not respond
>> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
>> > > 346 for
>> > default_ping_set=1600
>> > > failed: Remote node did not respond
>> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
>> > > 349 for
>> > default_ping_set=1600
>> > > failed: Remote node did not respond
>> > >
>> >
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> > > In the middle of this error, I carried out a cibadmin(-Q optin)
>> > > command, but time-out
>> > occurred.
>> > > In addition, cib of the DC node seemed to move by the top command very
>> > > busily.
>> > >
>> > >
>> > > In addition, a communication error with cib occurs in the DC node, and
>> > > crmd reboots.
>> > >
>> > >
>> >
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> > > Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message
>> > > received in the required
>> > interval
>> > > (120s)
>> > > Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent
>> > > update -41:
>> > > default_ping_set=1600
>> > > (snip)
>> > > Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2)
>> > > Jun 14 10:59:07 hb0101 corosync[2269]: � [pcmk �]
>> > > plugin.c:858 info: pcmk_ipc_exit:
>> Client
>> > crmd
>> > > (conn=0x106a2bf0, async-conn=0x106a2bf0) left
>> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> > > plugin.c:481 ERROR:
>> pcmk_wait_dispatch:
>> > Child
>> > > process crmd exited (pid=2280, rc=2)
>> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> > > plugin.c:498 notice:
>> pcmk_wait_dispatch:
>> > Respawning
>> > > failed child process: crmd
>> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> > > utils.c:131 info: spawn_child:
>> Forked child
>> > 2680 for
>> > > process crmd
>> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked:
>> > > /usr/lib64/heartbeat/crmd
>> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version:
>> > > 9f04fa88cfd3da553e977cc79983d1c494c8b502
>> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd
>> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: G_main_add_SignalHandler:
>> > > Added signal handler for
>> > signal
>> > > 17
>> > >
>> >
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> > > There seems to be a problem in cib of the DC node somehow or other.
>> > > We hope that an attribute change is completed in 16 nodes definitely.
>> > > �* Is this phenomenon a limit of the current cib process?
>> >
>> > More likely of the underlying messaging infrastructure, but I'll take a
>> > look.
>> > Perhaps the default cib operation timeouts are too low for larger
>> > clusters.
>> >
>> > >
>> > > The log attached it to next Bugzilla.
>> > > �*
>> > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
>> >
>> > Ok, I'll follow up there.
>> >
>> > >
>> > > Best Regards,
>> > > Hideo Yamauchi.
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started:
>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > > Bugs:
>> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> > >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs:
>> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
More information about the Pacemaker
mailing list