[Pacemaker] resource starts but then fails right away
Andrew Beekhof
andrew at beekhof.net
Fri May 10 01:53:07 UTC 2013
On 10/05/2013, at 12:26 AM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:
> I do see the:
>
> May 7 02:37:32 node1 crmd[16836]: error: print_elem: Aborting transition, action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: node1, priority: 0)
>
> in the log. Is that the root cause of the problem?
Ordinarily I'd have said yes, but I also see:
May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1
May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed
May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1
So apparently a badly timed cleanup was run. Did you do that or was it the crm shell?
> If so, what's that
> trying to tell me, exactly? If not, what is the cause of the problem?
>
> It really can't be the RA timing out since I give the monitor operation
> a 60 second timeout and the status action of the RA only take a few
> seconds at most to run and is not really an operation that can get
> blocked on anything. It's effectively the grepping of a file.
If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time.
I've seen IPaddr monitor actions take over a minute for example.
More information about the Pacemaker
mailing list