[Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.

Mon Jun 14 02:46:42 UTC 2010

We tested 16 node constitution (15+1).

We carried out the next procedure.

Step1) Start 16 nodes.
Step2) Send cib after a DC node was decided.

An error occurs by the update of the attribute of pingd after Probe processing was over.

----------------------------------------------------------------------------------------------------------------------------------------
Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying...
Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 337 for default_ping_set=1600
failed: Remote node did not respond
Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 340 for default_ping_set=1600
failed: Remote node did not respond
Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 343 for default_ping_set=1600
failed: Remote node did not respond
Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 346 for default_ping_set=1600
failed: Remote node did not respond
Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 349 for default_ping_set=1600
failed: Remote node did not respond
----------------------------------------------------------------------------------------------------------------------------------------

In the middle of this error, I carried out a cibadmin(-Q optin) command, but time-out occurred.
In addition, cib of the DC node seemed to move by the top command very busily.

In addition, a communication error with cib occurs in the DC node, and crmd reboots.

----------------------------------------------------------------------------------------------------------------------------------------
Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message received in the required interval
(120s)
Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent update -41:
default_ping_set=1600
(snip)
Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2)
Jun 14 10:59:07 hb0101 corosync[2269]:   [pcmk  ] plugin.c:858 info: pcmk_ipc_exit: Client crmd
(conn=0x106a2bf0, async-conn=0x106a2bf0) left
Jun 14 10:59:08 hb0101 corosync[2269]:   [pcmk  ] plugin.c:481 ERROR: pcmk_wait_dispatch: Child
process crmd exited (pid=2280, rc=2)
Jun 14 10:59:08 hb0101 corosync[2269]:   [pcmk  ] plugin.c:498 notice: pcmk_wait_dispatch: Respawning
failed child process: crmd
Jun 14 10:59:08 hb0101 corosync[2269]:   [pcmk  ] utils.c:131 info: spawn_child: Forked child 2680 for
process crmd
Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked: /usr/lib64/heartbeat/crmd 
Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version:
9f04fa88cfd3da553e977cc79983d1c494c8b502 
Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd
Jun 14 10:59:08 hb0101 crmd: [2680]: info: G_main_add_SignalHandler: Added signal handler for signal
17
----------------------------------------------------------------------------------------------------------------------------------------

There seems to be a problem in cib of the DC node somehow or other.
We hope that an attribute change is completed in 16 nodes definitely.
 * Is this phenomenon a limit of the current cib process?

The log attached it to next Bugzilla.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443

Best Regards,
Hideo Yamauchi.