[Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

Fri Mar 7 02:43:20 UTC 2014

On 26 Feb 2014, at 5:25 pm, yusuke iida <yusk.iida at gmail.com> wrote:

> Hi, Andrew
> 
> 2014-02-21 10:47 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>> 
>> On 20 Feb 2014, at 8:39 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>> 
>>> Hi, Andrew
>>> 
>>> 2014-02-20 17:28 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>>> Who was pid 16243?
>>>> Doesn't look like a pacemaker daemon.
>>> pid 16243 is crm_mon.
>> 
>> That means that the state displayed by crm_mon was > 500 updates behind.
>> At that point, what its displaying is horribly out of date and evicting it seems like a pretty good idea.
>> 
>>> In vm01, crm_mon was started and the state was checked.
>>> 
>>> If there is information required for analysis to other, I get it.
>> 
>> Some idea of what crm_mon is doing would be a good start.
>> Adding a few -V options in addition to --disable-ncurses might be the best approach.
> Run the following command, I get a log of crm_mon.
> crm_mon -VVVV --disable-ncurses >crm_mon.log 2>&1
> I attach it.
> 
> BTW,
> I checked operation with the application of the following patches you made.
> https://github.com/beekhof/pacemaker/commit/4002e4ab6a50ceb44e484613f2abd33e490492a7
> 
> The load of stonithd fell and queue stopped generating overflow.
> This patch looks very effective.
> 
> Is it possible to implement the crm_mon a process similar to this?

I don't understand... crm_mon doesn't look for changes to resources or constraints and it should already be using the new faster diff format.

[/me reads attachment]

Ah, but perhaps I do understand afterall :-)

This is repeated over and over:

  notice: crm_diff_update: 	[cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
  notice: xml_patch_version_check: 	Current num_updates is too high (885 > 67)

That would certainly drive up CPU usage and cause crm_mon to get left behind.
Happily the fix for that should be: https://github.com/beekhof/pacemaker/commit/6c33820

> 
> Regards,
> Yusuke
>> 
>>> 
>>> Regards,
>>> Yusuke
>>>> 
>>>>> 
>>>>> Overflow of queue of vm09 has taken place between cib and stonithd.
>>>>> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:506   )
>>>>> trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
>>>>> 0x105ec10[15520]: Resource temporarily unavailable (-11)
>>>>> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:515   )
>>>>> error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
>>>>> event queue reached 530 entries
>>>>> 
>>>>> Although I checked the code of the problem part, it was not understood
>>>>> by which it would be solved.
>>>>> 
>>>>> Is it less likelihood of sending a message of 100 at a time?
>>>>> Does calculation of the waiting time after message transmission have a problem?
>>>>> Threshold of 500 may be too low?
>>>> 
>>>> being 500 behind is really quite a long way.
>>> 
>>> 
>>> 
>>> 
>>> --
>>> ----------------------------------------
>>> METRO SYSTEMS CO., LTD
>>> 
>>> Yusuke Iida
>>> Mail: yusk.iida at gmail.com
>>> ----------------------------------------
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> 
> -- 
> ----------------------------------------
> METRO SYSTEMS CO., LTD
> 
> Yusuke Iida
> Mail: yusk.iida at gmail.com
> ----------------------------------------
> <crm_mon.log>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140307/e3062902/attachment-0004.sig>