[Pacemaker] Best way to check if PM is alive

Thu Dec 9 11:58:33 UTC 2010

On Thu, Dec 9, 2010 at 2:32 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Thu, Dec 9, 2010 at 12:14 PM, Evgeniy Ivanov <lolkaantimat at gmail.com> wrote:
>> Hi,
>>
>> What is a best way to check if PM is still alive?
>
> "ps axf | grep crmd" is one approach

It just means that crmd is alive, but doesn't give information about
its state, e.g. theoretically it can hang in some internal logic
(something like  "endless loop"). So we need something to ask "Hey,
PM! Are your brains still OK?".

>> We tried following approach: there is a softdog timer (max value is
>> 300s + extra 60s to give PM another chance) initially started and
>> checked by third party. Clone named HA_alive fails in monitor (except
>> first time), monitor interval is 200s. HA_alive:start should reset
>> that softdog timer. It looks like sometimes PM doesn't restart failed
>> resource for that 360s with no reason: system is almost IDLE.
>
> Strange.  Should work. Details?

It's dual-node cluster based on openais-0.80.3-26.1 and
pacemaker-1.0.3-4.1. Solution I've described worked fine on my
cluster, but regularly failed without a reason on some another
clusters. The logs (/var/log/messages) say, that PM noticed a failure
in monitor, but later it didn't restart (no stop and no start) the
HA_alive resource, thus in 360s system died. I didn't notice anything
else in logs...
I will be able to share some /var/log/messages, if I get access to
failed clusters.

-- 
Evgeniy Ivanov