[Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.

Thu Aug 5 06:55:13 UTC 2010

Hi Andrew,

 I took oprofile of 120 seconds before Time-out occurs.
 However, sorted_xml function seems not to be called it so much as far as I
watch log of Oprofile.

 I took Oprofile by a sample program.
 But, the contents of Oprofile did not agree with the contents of the sample
program.
 Therefore I used systemtap because I was not able to do internal bottleneck
analysis from information of this Oprofile.

2010年8月4日18:56 nozawat <nozawat at gmail.com>:

> Hi Andrew,
>
>  I send the cib file of the point in time when environmental construction
> was completed.
>
>
> >Maybe oprofile can suggest something.
> OK, I measure even Oprofile.
>
> --
> Regards,
> Tomo
>
>
> 2010/8/4 Andrew Beekhof <andrew at beekhof.net>
>
> 2010/8/3 nozawat <nozawat at gmail.com>:
>> > Hi Andrew,
>> >
>> > I changed cluster option to batch-limit=3,I re-tried it.
>> > However, similar time-out occurs.
>> >
>> > I measured processing just before the time-out(120s) in systemtap.
>> > The following only the function long time.
>> > -----
>> > probe start! ---------------------------------
>> >   cib_process_request  [call-count:179][117,540,173,155 nsec]
>> >   cib_process_command  [call:179]      [116,471,047,275 nsec]
>> > cib_process_command  call function ---
>> >   cib_config_changed   [call:179]      [101,169,909,572 nsec]
>> > cib_config_changed   call function ---
>> >   calculate_xml_digest [call:179]      [ 68,820,560,745 nsec]
>> >   create_xml_node      [call:3012263]  [ 19,855,469,976 nsec]※
>>
>> But, but, create_xml_node() barely does _anything_.
>>
>> It's hard to imagine how this could take 19s:
>>
>> xmlNode*
>> create_xml_node(xmlNode *parent, const char *name)
>> {
>>    xmlDoc *doc = NULL;
>>    xmlNode *node = NULL;
>>
>>    if (name == NULL || name[0] == 0) {
>>        return NULL;
>>    }
>>
>>    if(parent == NULL) {
>>        doc = xmlNewDoc((const xmlChar*)"1.0");
>>        node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL);
>>        xmlDocSetRootElement(doc, node);
>>
>>    } else {
>>        doc = getDocPtr(parent);
>>        node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL);
>>        xmlAddChild(parent, node);
>>    }
>>    return node;
>> }
>>
>> Is this data all from the CIB process?
>> What was the load from the other processes like? (Ie. using top)
>>
>> >   xpath_search         [call:179]      [    145,030,232 nsec]
>> >   diff_xml_object      [call:179]      [ 32,677,359,476 nsec]※
>> > calculate_xml_digest call function ---
>> >   sorted_xml           [call:1505799]  [ 52,512,465,838 nsec]※
>>
>> Yikes, these are also really high.
>> Could you send me the CIB so that we can profile these operations on it?
>>
>> >   copy_xml             [call:179]      [  3,692,232,073 nsec]
>> >   dump_xml             [call:536]      [  6,177,606,232 nsec]
>> > -----
>> > Is there the method to make these processing early?
>>
>> You mean faster?
>> Maybe oprofile can suggest something.
>>
>> >
>> >
>> > 2010/6/14 <renayama19661014 at ybb.ne.jp>
>> >>
>> >> Hi Andrew,
>> >>
>> >> Thank you for comment.
>> >>
>> >> > More likely of the underlying messaging infrastructure, but I'll take
>> a
>> >> > look.
>> >> > Perhaps the default cib operation timeouts are too low for larger
>> >> > clusters.
>> >> >
>> >> > >
>> >> > > The log attached it to next Bugzilla.
>> >> > > �*
>> >> > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
>> >> >
>> >> > Ok, I'll follow up there.
>> >>
>> >> If it is necessary for us to work for the solution of the problem,
>> please
>> >> order it.
>> >>
>> >> Best Regards,
>> >> Hideo Yamauchi.
>> >>
>> >> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>> >>
>> >> > On Mon, Jun 14, 2010 at 4:46 AM,  <renayama19661014 at ybb.ne.jp>
>> wrote:
>> >> > > We tested 16 node constitution (15+1).
>> >> > >
>> >> > > We carried out the next procedure.
>> >> > >
>> >> > > Step1) Start 16 nodes.
>> >> > > Step2) Send cib after a DC node was decided.
>> >> > >
>> >> > > An error occurs by the update of the attribute of pingd after Probe
>> >> > > processing was over.
>> >> > >
>> >> > >
>> >> >
>> >>
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> >> > > Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying...
>> >> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback:
>> Update
>> >> > > 337 for
>> >> > default_ping_set=1600
>> >> > > failed: Remote node did not respond
>> >> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback:
>> Update
>> >> > > 340 for
>> >> > default_ping_set=1600
>> >> > > failed: Remote node did not respond
>> >> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback:
>> Update
>> >> > > 343 for
>> >> > default_ping_set=1600
>> >> > > failed: Remote node did not respond
>> >> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback:
>> Update
>> >> > > 346 for
>> >> > default_ping_set=1600
>> >> > > failed: Remote node did not respond
>> >> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback:
>> Update
>> >> > > 349 for
>> >> > default_ping_set=1600
>> >> > > failed: Remote node did not respond
>> >> > >
>> >> >
>> >>
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> >> > >
>> >> > > In the middle of this error, I carried out a cibadmin(-Q optin)
>> >> > > command, but time-out
>> >> > occurred.
>> >> > > In addition, cib of the DC node seemed to move by the top command
>> very
>> >> > > busily.
>> >> > >
>> >> > >
>> >> > > In addition, a communication error with cib occurs in the DC node,
>> and
>> >> > > crmd reboots.
>> >> > >
>> >> > >
>> >> >
>> >>
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> >> > > Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message
>> >> > > received in the required
>> >> > interval
>> >> > > (120s)
>> >> > > Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update:
>> Sent
>> >> > > update -41:
>> >> > > default_ping_set=1600
>> >> > > (snip)
>> >> > > Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped
>> (2)
>> >> > > Jun 14 10:59:07 hb0101 corosync[2269]: � [pcmk �]
>> >> > > plugin.c:858 info: pcmk_ipc_exit:
>> >> Client
>> >> > crmd
>> >> > > (conn=0x106a2bf0, async-conn=0x106a2bf0) left
>> >> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> >> > > plugin.c:481 ERROR:
>> >> pcmk_wait_dispatch:
>> >> > Child
>> >> > > process crmd exited (pid=2280, rc=2)
>> >> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> >> > > plugin.c:498 notice:
>> >> pcmk_wait_dispatch:
>> >> > Respawning
>> >> > > failed child process: crmd
>> >> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
>> >> > > utils.c:131 info: spawn_child:
>> >> Forked child
>> >> > 2680 for
>> >> > > process crmd
>> >> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked:
>> >> > > /usr/lib64/heartbeat/crmd
>> >> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version:
>> >> > > 9f04fa88cfd3da553e977cc79983d1c494c8b502
>> >> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd
>> >> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info:
>> G_main_add_SignalHandler:
>> >> > > Added signal handler for
>> >> > signal
>> >> > > 17
>> >> > >
>> >> >
>> >>
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------
>> >> > >
>> >> > > There seems to be a problem in cib of the DC node somehow or other.
>> >> > > We hope that an attribute change is completed in 16 nodes
>> definitely.
>> >> > > �* Is this phenomenon a limit of the current cib process?
>> >> >
>> >> > More likely of the underlying messaging infrastructure, but I'll take
>> a
>> >> > look.
>> >> > Perhaps the default cib operation timeouts are too low for larger
>> >> > clusters.
>> >> >
>> >> > >
>> >> > > The log attached it to next Bugzilla.
>> >> > > �*
>> >> > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
>> >> >
>> >> > Ok, I'll follow up there.
>> >> >
>> >> > >
>> >> > > Best Regards,
>> >> > > Hideo Yamauchi.
>> >> > >
>> >> > >
>> >> > >
>> >> > > _______________________________________________
>> >> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> > >
>> >> > > Project Home: http://www.clusterlabs.org
>> >> > > Getting started:
>> >> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > > Bugs:
>> >> > >
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> > >
>> >> >
>> >> > _______________________________________________
>> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> >
>> >> > Project Home: http://www.clusterlabs.org
>> >> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > Bugs:
>> >> >
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> >
>> >>
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs:
>> >>
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs:
>> >
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
>
>
> --
> Regards,
> Tomo
>
>

-- 
Regards,
Tomo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100805/abd9e6db/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: oprofile-log.log
Type: application/octet-stream
Size: 35965 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100805/abd9e6db/attachment-0004.obj>