[Pacemaker] detect/cleanup failed resource

Thu Oct 21 10:42:40 UTC 2010

Hi all,

is there a better way to detect a failed resource than to run "crm_mon -1 -r"?

Example, I just 'created' a failed resource:

crm_mon -1 -r

Failed actions:
    ost_janlus_27_start_0 (node=vm3, call=108, rc=2, status=complete): invalid parameter

This cannot easily parsed using 'grep', as "Failed actions:" is a complete 
section. Well, using a python or perl script, it still wouldn't be 
too difficult. But, how can I figure out the resource name there?
I cannot run "crm resource cleanup ost_janlus_27_start_0", as this is 
obviously not the resource name. I also cannot simply cut off "start_0",
as there are also other actions that might fail.

In fact, crm_mon output is here already annoying to be run as human being,
as for a clean-up a simple copy-and-paste using mouse clicks does not work, 
as I always have to cut off the action...

Cleaning up dozens to hundreds resources manually is not an option, so
we have a script that goes over all resources and does that. However, in 
larger clusters that can easily take up to 90 minutes.

For a small size cluster, 

[root at vm3 ~]# time crm resource cleanup ost_janlus_27
Cleaning up ost_janlus_27 on vm6
Cleaning up ost_janlus_27 on vm7
Cleaning up ost_janlus_27 on vm8
Cleaning up ost_janlus_27 on vm1
Cleaning up ost_janlus_27 on vm2
Cleaning up ost_janlus_27 on vm3

real    0m7.129s
user    0m0.471s
sys     0m0.106s

[root at vm3 ~]# time crm resource cleanup ost_janlus_27 vm6
Cleaning up ost_janlus_27 on vm6

real    0m1.348s
user    0m0.203s
sys     0m0.071s

[root at vm3 ~]# time cluster_resources cleanup
resource: mds-janlus-grp
Cleaning up vg_janlus on vm6
Cleaning up mgs on vm6
Cleaning up mdt_janlus on vm6
Cleaning up vg_janlus on vm7
Cleaning up mgs on vm7
[...]
real    3m35.463s
user    0m13.704s
sys     0m3.440s

(cluster_resources is a small front end for crm to run
it for all of our resources)

So about 1.35s per resource. No problem to do that for a few resources  on 
all nodes on a 3 node system. But already annoyingly over 3 minutes 
for 28 resources and 6 nodes on our default small size systems. 
And definitely not an option anymore on a 18 node cluster with 230 resources 
(calculated time: 230 resources * 18 nodes * 1.35 s = 5589 s = 1.5 *hours*). 

And cleaning up 230 resources manually if something went wrong on the cluster
is also no fun and also is not really fast.

So I'm looking for *any* sane way to clean up resources or at least
for a good parse-able way to get failed resources and the corresponding
node.

Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks