[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node

Wed Feb 29 07:32:59 UTC 2012

Hi,

I'm running the following simple configuration with Pacemaker 1.1.6,
and try the test case, "resource stop NG and shutdown Pacemaker".

property \
    no-quorum-policy="ignore" \
    stonith-enabled="false" \
    crmd-transition-delay="2s"

rsc_defaults \
    resource-stickiness="INFINITY" \
    migration-threshold="1"

primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
    op start   timeout="60s" interval="0s"  on-fail="restart" \
    op monitor timeout="60s" interval="7s"  on-fail="restart" \
    op stop    timeout="60s" interval="0s"  on-fail="block"


"Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.

# diff -urNp Dummy Dummy-stop-NG

--- Dummy       2011-06-30 17:43:37.000000000 +0900
+++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
@@ -108,6 +108,8 @@ dummy_start() {
 }

 dummy_stop() {
+    exit $OCF_ERR_GENERIC
+
     dummy_monitor
     if [ $? =  $OCF_SUCCESS ]; then
        rm ${OCF_RESKEY_state}



Before the test, the resource is running on "bl460g6a".

# crm_simulate -S -x pe-input-1.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped

Transition Summary:
crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
dummy01    (bl460g6a)

Executing cluster transition:
 * Executing action 6: dummy01_monitor_0 on bl460g6b
 * Executing action 4: dummy01_monitor_0 on bl460g6a
 * Executing action 7: dummy01_start_0 on bl460g6a
 * Executing action 8: dummy01_monitor_7000 on bl460g6a

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a



Stop Pacemaker on "bl460g6a".
# service heartbeat stop

Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
# crm_simulate -S -x pe-input-2.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a

Transition Summary:
crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
dummy01    (Started bl460g6a -> bl460g6b)

Executing cluster transition:
 * Executing action 6: dummy01_stop_0 on bl460g6a
 * Executing action 7: dummy01_start_0 on bl460g6b
 * Executing action 8: dummy01_monitor_7000 on bl460g6b

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b



but this action will fail, it means the resource goes into unmanaged state.
# crm_simulate -S -x pe-input-3.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED

Transition Summary:

Executing cluster transition:

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

 dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED



Pacemaker shutdown on "bl460g6a" becomes successful,
it seems that the following patch works well.
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c

At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
might be running because it fails to stop.
In fact, the resource didn't start on "bl460g6b" after its stop NG and
"bl460g6a"'s shutdown, and this is an expectable behavior,
but I could start it on "bl460g6b" with crm command.
This holds the potential for the unexpected active/active status.
Is it possible to prevent it's start in this situation?
for example,
(1) Dummy runs on node-a
(2) Shutdown Pacemaker on node-a, and Dummy stop NG
(3) Dummy can not run on other nodes
(4) * cleanup the unmanaged status of Dummy after checking it's manual
operation on node-a
(5) * start Dummy on other nodes
This can be the safe way.

See attached hb_report.

Thanks,
Junko IKEDA

NTT DATA INTELLILINK CORPORATION
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report.tar.bz2
Type: application/x-bzip2
Size: 58698 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120229/8f970435/attachment-0003.bz2>