[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
Proskurin Kirill
k.proskurin at corp.mail.ru
Wed Oct 5 15:47:00 UTC 2011
On 10/05/2011 04:19 AM, Andrew Beekhof wrote:
> On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
> <k.proskurin at corp.mail.ru> wrote:
>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>
>>>> corosync-1.4.1
>>>> pacemaker-1.1.5
>>>> pacemaker runs with "ver: 1"
>>
>>>> 2)
>>>> This one is scary.
>>>> I twice run on situation then pacemaker thinks what resource is started
>>>> but
>>>> it is not.
>>>
>>> RA is misbehaving. Pacemaker will only consider a resource running if
>>> the RA tells us it is (running or in a failed state).
>>
>> But you can see below, what agent return "7".
>
> Its still broken. Not one stop action succeeds.
>
> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 4082) timed out (try 1). Killing with
> signal SIGTERM (15).
> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 21859) timed out (try 1). Killing
> with signal SIGTERM (15).
> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 24576) timed out (try 1). Killing
> with signal SIGTERM (15).
>
> /That/ is why pacemaker thinks its still running.
I made an experiment.
I create script what don`t die at SIGTERM
#!/usr/bin/perl
$SIG{TERM} = "IGNORE"; sleep 1 while 1
And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
op monitor interval="20" timeout="5" on-fail="restart" \
params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
2) Same but on-fail=block
3) Same but with metaware stonith.
Each time I do:
crm resource stop test-kill-15.pl
And in case 1 and 2 - I get "unmanaged" on this resource.
In case 3 I get stonith situation.
From IRC:
(12:20:44 PM) beekhof: Oloremo: what the hell is the cluster supposed to
do if stop fails and you dont want fencing? it cant start it anywhere
because its still active in the original location
(12:30:09 PM) Oloremo: I get the point, really. But may be it should
make it unmanaged?
And it does.
So can I assume what my problem with monitoring still not that clear? I
don`t get "unmanaged" - it is just thinks that resource are started but
it`s not.
--
Best regards,
Proskurin Kirill
More information about the Pacemaker
mailing list