[Pacemaker] resource starts but then fails right away

Brian J. Murrell brian at interlinx.bc.ca
Fri May 10 11:23:57 UTC 2013


On 13-05-09 09:53 PM, Andrew Beekhof wrote:
> 
> May  7 02:36:16 node1 crmd[16836]:     info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1
> May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed
> May  7 02:36:16 node1 crmd[16836]:     info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1
> 
> So apparently a badly timed cleanup was run.

:-(  I didn't know there could such timing problems.  I might have to
change my process a bit then perhaps.

> Did you do that or was it the crm shell?

That was "me" doing a "crm resource cleanup" (soon to become
"crm_resource -r ... --cleanup").  The process is typically:

- create resource
- start resource
- wait for resource to start

where "start resource" is:
- "clean it to start with a known clean resource"
  (crm resource cleanup)
- "start resource"
  (crm_resource -r ... -p target-role -m -v Started)

and "wait for resource" is a loop of "crm resource status ..." (soon to
be "crm_resource -r ... --locate")

So the create, clean, start operations happen in quite quick succession
(i.e. scripted).  Is that pathological?  Is a clean between create and
start known to be problematic?

FWIW, the reason for clean before the start, even after just creating
the resource is that "clean" and "start" are lumped together into a
function that is called after create, but can also be called at other
times during the life-cycle, so it could be needed to clean a resource
before trying to start it.  I was hoping the cleaning of a just created
resource was going to be effectively a NOOP.

I guess for completeness, I should add here that creating the resource
is a "cibadmin -o resource -C ..." operation.

> If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time.

Yeah, not very loaded at all, especially at this point.  This is all
happening before anything really gets started on the machine... this is
the process of getting the resources up and running and the machine is
dedicated to running the tasks associated with these resources.

Cheers,
b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130510/be65c122/attachment-0004.sig>


More information about the Pacemaker mailing list