[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Wed Oct 5 00:19:10 UTC 2011

On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
<k.proskurin at corp.mail.ru> wrote:
> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>
>>> corosync-1.4.1
>>> pacemaker-1.1.5
>>> pacemaker runs with "ver: 1"
>
>>> 2)
>>> This one is scary.
>>> I twice run on situation then pacemaker thinks what resource is started
>>> but
>>> it is not.
>>
>> RA is misbehaving.  Pacemaker will only consider a resource running if
>> the RA tells us it is (running or in a failed state).
>
> But you can see below, what agent return "7".

Its still broken. Not one stop action succeeds.

Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
with signal SIGTERM (15).

/That/ is why pacemaker thinks its still running.

>
>>> We use slightly modifed version of "anything" agent for our
>>> scripts but they are aware of OCF return codes and other staff.
>>>
>>> I run monitoring by our agent from console:
>>> # env -i ; OCF_ROOT=/usr/lib/ocf
>>> OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
>>> /usr/lib/ocf/resource.d/mail.ru/generic monitor
>>> # generic[14992]: DEBUG: default monitor : 7
>>>
>>> So our agent said what it is not running, but pacemaker still think it
>>> does.
>>> I runs for 2 days and after I forced to cleanup it. And it find what
>>> it`snot
>>> running in seconds.
>>
>> Did you configure a recurring monitor operation?
>
> Of course. I add my primitive configuration in original letter there is:
> op monitor interval="30" timeout="300" on-fail="restart" \
>
> I have this third time and this time I found in logs:
> Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice:
> unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2,
> magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on
> mysender34.mail.ru
>
> There is different resource name cos logs from third situation but problem
> is same.
>
>
>>> 3)
>>> This one it confusing and dangerous.
>>>
>>> I use failure-timeout on most resources to wipe out temp warn messages
>>> from
>>> crm_verify -LV - I use it for monitoring a cluster. All works good but I
>>> found this:
>>>
>>> 1) Resource can`t start on node and migrate to next one.
>>> 2) It can`t start here too and on all other.
>>> 3) It is give up and stops. There is many erros about all this in
>>> crm_verify
>>> -LV - and it is good.
>>> 4) failure-timeout comes and... wipe out all errors.
>>> 5) We have stopped resource and all errors are wiped. And we don`t know
>>> if
>>> it is stopped by a hands of admin or because of errors.
>
>>> I think what failure-timeout should not happend on stopped resource.
>>> Any chance to avoid this?

No.

>
>> Not sure why you think this is dangerous, the cluster is doing exactly
>> what you told it to.
>> If you want resources to stay stopped either set failure-timeout=0
>> (disabled) or set the target-role to Stopped.
>
> No, I want to use failure-timeout but not wipe out errors then resource are
> already stopped by pacemaker because of errors and not by admin hands.
>
> --
> Best regards,
> Proskurin Kirill
>