[Pacemaker] long time to start

Mon Apr 19 02:42:03 EDT 2010

On Fri, Apr 16, 2010 at 9:28 PM, Schaefer, Diane E
<diane.schaefer at unisys.com> wrote:
> Hi,
>
>   I have a resource that sometimes can take 10 minutes to start after a
> failure due to log records that need to be sync’d. (my own OCF)  I noticed
> while the start action was being performed, if other resources in my cluster
> report a “not running”, no restart will be attempted until my long running
> started resource returns.  Meanwhile, the crm_mon  reports the resources as
> “started” eventhough they are not running, and may not be for many minutes.

Does your RA return from the start action immediately or after the
sync is complete and the service is truly started?
It _must_ only do the later.
Doing the former would explain what you're seeing.

> Is the lrm process single threaded?  Is running my resource start action
> async a better strategy?  I am concerned that other critical resources will
> not be restarted in case of failures during the restart of the long starting
> one.   Is the resource state of started, not running or failed triggered by
> the result of start instead of monitor?
>
>
>
> Thanks for any information.
>
> Diane Schaefer
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>