[Pacemaker] external/ssh stonith and repeated reboots

Sun Oct 14 06:04:25 UTC 2012

I'm using external/ssh in my test cluster (a bunch of vm's), and for some reason the cluster has tried to terminate it but failed, like:

ct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
Oct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
Oct 14 15:54:45 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No remaining peers capable of terminating ctest1
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent notifications for 'off of ctest1 by (null)' (op=6ca32814-1272-482f-bb67-f0b46daef78b, for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out

And I'll look into why that is, but the result is that there are 20 'at' jobs in the queue and every time the machine starts up it shuts down again. Easy enough to fix but it probably shouldn't happen (even for external/ssh which is advertised as 'not for production').

Is there a way to schedule an at job in such a way that it cancels a currently scheduled job? I can't see one...

James