[Pacemaker] If 256 resources are load(ed), crmd will reboot.

Thu May 29 23:35:11 UTC 2014

On 29 May 2014, at 8:43 pm, Yusuke Iida <yusk.iida at gmail.com> wrote:

> Hi, Andrew
> 
> 2014-05-29 15:30 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>> 
>> On 29 May 2014, at 3:40 pm, Yusuke Iida <yusk.iida at gmail.com> wrote:
>> 
>>> Hi, Andrew
>>> 
>>> 2014-05-29 14:00 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>>> 
>>>> On 29 May 2014, at 12:28 pm, Yusuke Iida <yusk.iida at gmail.com> wrote:
>>>> 
>>>>> Hi, Andrew
>>>>> 
>>>>> I'm sorry.
>>>>> It seems that the notation of the node name became another by syslog.
>>>>> In order to dispel misunderstanding, the report was newly acquired.
>>>>> I think that the signs are appearing in vm02/ha-log.
>>>> 
>>>> Got it :)
>>>> 
>>>> Ok, step 1 - stop logging debug.
>>>> Debug is accounting for 30% of the logs and all that writing to disk would be adding significantly to the cluster's workload.
>>> I understand.
>>> 
>>>> 
>>>> Question:  How have you got logging configured? Anything in /etc/sysconfig/pacemaker ?
>>>> 
>>>> I ask because pacemaker.log appears to have a jumble of syslog and regular file output:
>>>> 
>>>> May 29 10:45:26 vm02 cib[25603]:     info: cib_perform_op: +  /cib:  @num_updates=1295
>>>> May 29 10:45:26 [25603] vm02        cib:     info: cib_perform_op:      +  /cib:  @num_updates=1295
>>> The position of pid is different although seldom cared.
>>> I attach the /etc/sysconfig/pacemaker of my environment.
>> 
>> The format isn't a problem, it just indicates that there are two mechanisms logging to the same place.
>> So its redundant.
>> 
>> The question is... how, your configs look fine to me :-/
> This was my setting mistake.
> syslog was set up to output "local1.*" to "/var/log/pacemaker.log."
> I am sorry to cause confusion.

Thats ok. It wont hurt the cluster, but you might as well turn off file logging in such a scenario.

> 
>> 
>>> 
>>>> 
>>>> 
>>>> Step 2 - can you try this patch:
>>>> 
>>>> diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
>>>> index 4d330a6..eba5f11 100644
>>>> --- a/crmd/te_callbacks.c
>>>> +++ b/crmd/te_callbacks.c
>>>> @@ -381,12 +381,15 @@ te_update_diff(const char *event, xmlNode * msg)
>>>> 
>>>>        } else if(strstr(xpath, "/cib/configuration")) {
>>>>            abort_transition(INFINITY, tg_restart, "Non-status change", change);
>>>> +            break; /* Wont be packaged with any resource operations we may be waiting for */
>>>> 
>>>>        } else if(strstr(xpath, "/"XML_CIB_TAG_TICKETS) || safe_str_eq(name, XML_CIB_TAG_TICKETS)) {
>>>>            abort_transition(INFINITY, tg_restart, "Ticket attribute change", change);
>>>> +            break; /* Wont be packaged with any resource operations we may be waiting for */
>>>> 
>>>>        } else if(strstr(xpath, "/"XML_TAG_TRANSIENT_NODEATTRS"[") || safe_str_eq(name, XML_TAG_TRANSIENT_NODEATTRS)) {
>>>>            abort_transition(INFINITY, tg_restart, "Transient attribute change", change);
>>>> +            break; /* Wont be packaged with any resource operations we may be waiting for */
>>>> 
>>>>        } else if(strstr(xpath, "/"XML_LRM_TAG_RSC_OP"[") && safe_str_eq(op, "delete")) {
>>>>            crm_action_t *cancel = NULL;
>>> 
>>> Thank you for the patch.
>>> It replies by checking a motion.
>> 
>> Do you mean it works now?
> I think the patch is running without any problems.
> When a setup was loaded, it changed so that abort_transition() might
> be called only once.
> I want this correction to be included in Pacemaker-1.1.12.

Done :)
Thanks for helping to track it down!

> 
> A report when a patch is applied is attached.
> https://drive.google.com/file/d/0BwMFJItoO-fVWWV0VmxqclMzT2M/edit?usp=sharing
> 
> 
> Regards,
> Yusuke
>> 
>>> 
>>> Regards,
>>> Yusuke
>>>> 
>>>> 
>>>>> 
>>>>> May 29 10:43:37 vm02 crmd[25608]:    error: config_query_callback:
>>>>> Local CIB query resulted in an error: Timer expired
>>>>> May 29 10:43:37 vm02 crmd[25608]:     info: register_fsa_error_adv:
>>>>> Resetting the current action list
>>>>> May 29 10:43:37 vm02 crmd[25608]:    error: do_log: FSA: Input I_ERROR
>>>>> from config_query_callback() received in state S_POLICY_ENGINE
>>>>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_state_transition: State
>>>>> transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR
>>>>> cause=C_FSA_INTERNAL origin=config_query_callback ]
>>>>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_recover: Fast-tracking
>>>>> shutdown in response to errors
>>>>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_election_vote: Not
>>>>> voting in election, we're in state S_RECOVERY
>>>>> 
>>>>> https://drive.google.com/file/d/0BwMFJItoO-fVSEd2MkRiOGxkelk/edit?usp=sharing
>>>>> 
>>>>> Regards,
>>>>> Yusuke
>>>>> 
>>>>> 2014-05-29 10:26 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>>>>> 
>>>>>> On 28 May 2014, at 6:42 pm, Yusuke Iida <yusk.iida at gmail.com> wrote:
>>>>>> 
>>>>>>> Hi, Andrew
>>>>>>> 
>>>>>>> I made the cluster load a setup to which 256 resources are started using crmsh.
>>>>>>> At this time, crmd changed into the S_RECOVERY state and rebooted.
>>>>>>> 
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:    error:
>>>>>>> config_query_callback: Local CIB query resulted in an error: Timer
>>>>>>> expired
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:     info:
>>>>>>> register_fsa_error_adv: Resetting the current action list
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:    error: do_log: FSA: Input
>>>>>>> I_ERROR from config_query_callback() received in state S_POLICY_ENGINE
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:  warning:
>>>>>>> do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [
>>>>>>> input=I_ERROR cause=C_FSA_INTERNAL origin=config_query_callback ]
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:  warning: do_recover:
>>>>>>> Fast-tracking shutdown in response to errors
>>>>>>> May 28 17:08:00 [14194] vm02       crmd:  warning: do_election_vote:
>>>>>>> Not voting in election, we're in state S_RECOVERY
>>>>>>> 
>>>>>>> I think that query performed in large quantities cannot be processed.
>>>>>>> Before implementing cib_performance, abort_transition() was called only once.
>>>>>>> 
>>>>>>> Is this corrected?
>>>>>>> 
>>>>>>> report when a problem occurs is attached.
>>>>>>> https://drive.google.com/file/d/0BwMFJItoO-fVX0gxM1ptcE52WWs/edit?usp=sharing
>>>>>> 
>>>>>> That doesn't appear to match the symptoms above.
>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Yusuke
>>>>>>> --
>>>>>>> ----------------------------------------
>>>>>>> METRO SYSTEMS CO., LTD
>>>>>>> 
>>>>>>> Yusuke Iida
>>>>>>> Mail: yusk.iida at gmail.com
>>>>>>> ----------------------------------------
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ----------------------------------------
>>>>> METRO SYSTEMS CO., LTD
>>>>> 
>>>>> Yusuke Iida
>>>>> Mail: yusk.iida at gmail.com
>>>>> ----------------------------------------
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> ----------------------------------------
>>> METRO SYSTEMS CO., LTD
>>> 
>>> Yusuke Iida
>>> Mail: yusk.iida at gmail.com
>>> ----------------------------------------
>>> <pacemaker>_______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> 
> -- 
> ----------------------------------------
> METRO SYSTEMS CO., LTD
> 
> Yusuke Iida
> Mail: yusk.iida at gmail.com
> ----------------------------------------
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140530/7fbcae07/attachment-0004.sig>