[Pacemaker] external/ssh stonith and repeated reboots

Tue Oct 16 06:00:08 CEST 2012

On Sun, Oct 14, 2012 at 5:04 PM, James Harper
<james.harper at bendigoit.com.au> wrote:
> I'm using external/ssh in my test cluster (a bunch of vm's), and for some reason the cluster has tried to terminate it but failed, like:

Try fence_xvm instead.  Its actually reliable.
You'd need the fence-virtd on the host and guests package and I've had
plenty of success with the following as the config file on the host.
Make sure key_file exists everywhere, start fence-virtd and test with
"fence_xvm -o list" on the guest(s)

ssh based fencing isn't just "not for production" its a flat out terrible idea.
With much handwaving it is barely usable even for testing as it
requires the target to be alive, reachable and behaving.

# cat /etc/fence_virt.conf

fence_virtd {
	listener = "multicast";
	backend = "libvirt";
}

listeners {
	multicast {
		key_file = "/etc/cluster/fence_xvm.key";
		address = "225.0.0.12";
		family = "ipv4";
		port = "1229";

		# Needed on Fedora systems
		interface = "virbr0";
	}
}

backends {
	libvirt {
		uri = "qemu:///system";
	}
}

>
> ct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
> Oct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
> Oct 14 15:54:45 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
> Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
> Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
> Oct 14 15:54:46 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
> Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
> Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
> Oct 14 15:54:47 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
> Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
> Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
> Oct 14 15:54:48 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
>
> And I'll look into why that is, but the result is that there are 20 'at' jobs in the queue and every time the machine starts up it shuts down again. Easy enough to fix but it probably shouldn't happen (even for external/ssh which is advertised as 'not for production').
>
> Is there a way to schedule an at job in such a way that it cancels a currently scheduled job? I can't see one...
>
> James
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org