[Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

Wed Feb 26 01:25:52 EST 2014

Hi, Andrew

2014-02-21 10:47 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>
> On 20 Feb 2014, at 8:39 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>
>> Hi, Andrew
>>
>> 2014-02-20 17:28 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>> Who was pid 16243?
>>> Doesn't look like a pacemaker daemon.
>> pid 16243 is crm_mon.
>
> That means that the state displayed by crm_mon was > 500 updates behind.
> At that point, what its displaying is horribly out of date and evicting it seems like a pretty good idea.
>
>> In vm01, crm_mon was started and the state was checked.
>>
>> If there is information required for analysis to other, I get it.
>
> Some idea of what crm_mon is doing would be a good start.
> Adding a few -V options in addition to --disable-ncurses might be the best approach.
Run the following command, I get a log of crm_mon.
crm_mon -VVVV --disable-ncurses >crm_mon.log 2>&1
I attach it.

BTW,
I checked operation with the application of the following patches you made.
https://github.com/beekhof/pacemaker/commit/4002e4ab6a50ceb44e484613f2abd33e490492a7

The load of stonithd fell and queue stopped generating overflow.
This patch looks very effective.

Is it possible to implement the crm_mon a process similar to this?

Regards,
Yusuke
>
>>
>> Regards,
>> Yusuke
>>>
>>>>
>>>> Overflow of queue of vm09 has taken place between cib and stonithd.
>>>> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:506   )
>>>> trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
>>>> 0x105ec10[15520]: Resource temporarily unavailable (-11)
>>>> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:515   )
>>>> error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
>>>> event queue reached 530 entries
>>>>
>>>> Although I checked the code of the problem part, it was not understood
>>>> by which it would be solved.
>>>>
>>>> Is it less likelihood of sending a message of 100 at a time?
>>>> Does calculation of the waiting time after message transmission have a problem?
>>>> Threshold of 500 may be too low?
>>>
>>> being 500 behind is really quite a long way.
>>
>>
>>
>>
>> --
>> ----------------------------------------
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.iida at gmail.com
>> ----------------------------------------
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
----------------------------------------
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.iida at gmail.com
----------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crm_mon.log
Type: application/octet-stream
Size: 51907 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140226/f46e82fa/attachment-0003.obj>