[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node
Junko IKEDA
tsukishima.ha at gmail.com
Wed Feb 29 07:32:59 UTC 2012
Hi,
I'm running the following simple configuration with Pacemaker 1.1.6,
and try the test case, "resource stop NG and shutdown Pacemaker".
property \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="2s"
rsc_defaults \
resource-stickiness="INFINITY" \
migration-threshold="1"
primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="7s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
"Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
# diff -urNp Dummy Dummy-stop-NG
--- Dummy 2011-06-30 17:43:37.000000000 +0900
+++ Dummy-stop-NG 2012-02-28 19:11:12.850207767 +0900
@@ -108,6 +108,8 @@ dummy_start() {
}
dummy_stop() {
+ exit $OCF_ERR_GENERIC
+
dummy_monitor
if [ $? = $OCF_SUCCESS ]; then
rm ${OCF_RESKEY_state}
Before the test, the resource is running on "bl460g6a".
# crm_simulate -S -x pe-input-1.bz2
Current cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Stopped
Transition Summary:
crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
dummy01 (bl460g6a)
Executing cluster transition:
* Executing action 6: dummy01_monitor_0 on bl460g6b
* Executing action 4: dummy01_monitor_0 on bl460g6a
* Executing action 7: dummy01_start_0 on bl460g6a
* Executing action 8: dummy01_monitor_7000 on bl460g6a
Revised cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
Stop Pacemaker on "bl460g6a".
# service heartbeat stop
Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
# crm_simulate -S -x pe-input-2.bz2
Current cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
Transition Summary:
crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
dummy01 (Started bl460g6a -> bl460g6b)
Executing cluster transition:
* Executing action 6: dummy01_stop_0 on bl460g6a
* Executing action 7: dummy01_start_0 on bl460g6b
* Executing action 8: dummy01_monitor_7000 on bl460g6b
Revised cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
but this action will fail, it means the resource goes into unmanaged state.
# crm_simulate -S -x pe-input-3.bz2
Current cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED
Transition Summary:
Executing cluster transition:
Revised cluster status:
Online: [ bl460g6a bl460g6b ]
dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED
Pacemaker shutdown on "bl460g6a" becomes successful,
it seems that the following patch works well.
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
might be running because it fails to stop.
In fact, the resource didn't start on "bl460g6b" after its stop NG and
"bl460g6a"'s shutdown, and this is an expectable behavior,
but I could start it on "bl460g6b" with crm command.
This holds the potential for the unexpected active/active status.
Is it possible to prevent it's start in this situation?
for example,
(1) Dummy runs on node-a
(2) Shutdown Pacemaker on node-a, and Dummy stop NG
(3) Dummy can not run on other nodes
(4) * cleanup the unmanaged status of Dummy after checking it's manual
operation on node-a
(5) * start Dummy on other nodes
This can be the safe way.
See attached hb_report.
Thanks,
Junko IKEDA
NTT DATA INTELLILINK CORPORATION
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report.tar.bz2
Type: application/x-bzip2
Size: 58698 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120229/8f970435/attachment-0003.bz2>
More information about the Pacemaker
mailing list