[Pacemaker] resource starts but then fails right away

Fri May 10 01:53:07 UTC 2013

On 10/05/2013, at 12:26 AM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:

> I do see the:
> 
> May  7 02:37:32 node1 crmd[16836]:    error: print_elem: Aborting transition, action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: node1, priority: 0)
> 
> in the log.  Is that the root cause of the problem?  

Ordinarily I'd have said yes, but I also see:

May  7 02:36:16 node1 crmd[16836]:     info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1
May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed
May  7 02:36:16 node1 crmd[16836]:     info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1

So apparently a badly timed cleanup was run.  Did you do that or was it the crm shell?

> If so, what's that
> trying to tell me, exactly?  If not, what is the cause of the problem?
> 
> It really can't be the RA timing out since I give the monitor operation
> a 60 second timeout and the status action of the RA only take a few
> seconds at most to run and is not really an operation that can get
> blocked on anything.  It's effectively the grepping of a file.

If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time.
I've seen IPaddr monitor actions take over a minute for example.