[ClusterLabs] Delayed first monitoring

Wed Aug 12 16:20:22 UTC 2015

On 08/12/2015 10:45 AM, Miloš Kozák wrote:
> Thank you for your answer, but.
> 
> 1) This sounds ok, but in other words it means the first delayed check
> is not possible to be done.
> 
> 2) Start of init script? I follow lsb scripts from distribution, so
> there is not way to change them (I can change them, but with packages
> upgade they will go void). The is quite typical approach, how can I do
> HA for atlassian for example? Jira loads 5minutes..

I think your situation involves multiple issues which are worth
separating for clarity:

1. As Alexander mentioned, Pacemaker will do a monitor BEFORE trying to
start a service, to make sure it's not already running. So these don't
need any delay and are expected to "fail".

2. Resource agents MUST NOT return success for "start" until the service
is fully up and running, so the next monitor should succeed, again
without needing any delay. If that's not the case, it's a bug in the agent.

3. It's generally better to use OCF resource agents whenever available,
as they have better integration with pacemaker than lsb/systemd/upstart.
In this case, take a look at ocf:heartbeat:apache.

4. You can configure the timeout used with each action (stop, start,
monitor, restart) on a given resource. The default is 20 seconds. For
example, if a "start" action is expected to take 5 minutes, you would
define a start operation on the resource with timeout=300s. How you do
that depends on your management tool (pcs, crmsh, or cibadmin).

Bottom line, you should never need a delay on the monitor, instead set
appropriate timeouts for each action, and make sure that the agent does
not return from "start" until the service is fully up.

> Dne 12.8.2015 v 16:14 Nekrasov, Alexander napsal(a):
>> 1. Pacemaker will/may call a monitor before starting a resource, in
>> which case it expects a NOT_RUNNING response. It's just checking
>> assumptions at that point.
>>
>> 2. A resource::start must only return when resource::monitor is
>> successful. Basically the logic of a start() must follow this:
>>
>> start() {
>>    start_daemon()
>>    while ! monitor() ; do
>>        sleep some
>>    done
>>    return $OCF_SUCCESS
>> }
>>
>>> -----Original Message-----
>>> From: Miloš Kozák [mailto:milos.kozak at lejmr.com]
>>> Sent: Wednesday, August 12, 2015 10:03 AM
>>> To: users at clusterlabs.org
>>> Subject: [ClusterLabs] Delayed first monitoring
>>>
>>> Hi,
>>>
>>> I have set up and CoroSync+CMAN+Pacemaker at CentOS 6.5 in order to
>>> provide high-availability of opennebula. However, I am facing to a
>>> strange problem which raises from my lack of knowleadge..
>>>
>>> In the log I can see that when I create a resource based on an init
>>> script, typically:
>>>
>>> pcs resource create httpd lsb:httpd
>>>
>>> The httpd daemon gets started, but monitor is initiated at the same time
>>> and the resource is identified as not running. This behaviour makes
>>> sense since we realize that the daemon starting takes some time. In this
>>> particular case, I get error code 2 which means that process is running,
>>> but environment is not locked. The effect of this is that httpd resource
>>> gets restarted.
>>>
>>> My workaround is extra sleep in status function of the init script, but
>>> I dont like this solution at all! Do you have idea how to tackle this
>>> problem in a proper way? I expected an op attribut which would specify
>>> delay after service start and first monitoring, but I could not find
>>> it..
>>>
>>> Thank you, Milos