[Pacemaker] fencing to recover from failed resources

Thu Jan 13 14:30:48 UTC 2011

Bart Coninckx wrote:
> By the way: things seem better when I change the monitor time out to 30 
> seconds in stead of 10 seconds. Very strange though, because the resource 
> agent basically does a "xm list --long" while monitoring, which takes less 
> than half a second in a console.

I think sometimes xend hangs for a while. 30 seconds should be good.

What is your stop action timeout? I use 90.

There are some old bugs related to spurious resource stops; make sure 
you're running at least cluster-glue 1.0.6 and pacemaker 1.1.2.1. I 
don't see these versions in the SLES11 HAE SP1 updates repo, but maybe 
the fixes were backported to other versions. I've been using the 
packages from here:

http://download.opensuse.org/repositories/network:/ha-clustering/SLE_11_SP1/x86_64/

http://developerbugs.linux-foundation.org/show_bug.cgi?id=2482
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2479

Mike