[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node
Andrew Beekhof
andrew at beekhof.net
Mon Mar 5 01:35:21 UTC 2012
On Fri, Mar 2, 2012 at 5:07 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
> Hi,
>
> OK, we have to setup STONITH to handle this.
> By the way, I tried to run the group resource and do the same test.
>
> crm configuration;
>
> property \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> crmd-transition-delay="2s" \
> cluster-recheck-interval="60s"
>
> rsc_defaults \
> resource-stickiness="INFINITY" \
> migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy \
> op start timeout="60s" interval="0s" on-fail="restart" \
> op monitor timeout="60s" interval="7s" on-fail="restart" \
> op stop timeout="60s" interval="0s" on-fail="block"
>
> primitive dummy02 ocf:heartbeat:Dummy-stop-NG \
> op start timeout="60s" interval="0s" on-fail="restart" \
> op monitor timeout="60s" interval="7s" on-fail="restart" \
> op stop timeout="60s" interval="0s" on-fail="block"
>
> group dummy-g dummy01 dummy02
>
>
> in this case, dummy02 calls stop NG.
> dummy02 goes to unmanaged status,
> and after that, Pacemaker shutdown is freezing,
On the one hand the admin is saying "always stop A before B", but then
also asking for "stop B" while preventing "stop A".
So the admin is making incompatible demands, which one do you want us to ignore?
> it seems that Pacemaker is waiting some clear operations for unmanaged
> resources.
> if dummy01 calls stop NG, Pacemaker shutdown works well.
> see attached hb_report.
>
> Thanks,
> Junko
>
> 2012/3/1 Andrew Beekhof <andrew at beekhof.net>:
>> On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
>>> Hi,
>>>
>>> I'm running the following simple configuration with Pacemaker 1.1.6,
>>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>>
>>> property \
>>> no-quorum-policy="ignore" \
>>> stonith-enabled="false" \
>>> crmd-transition-delay="2s"
>>>
>>> rsc_defaults \
>>> resource-stickiness="INFINITY" \
>>> migration-threshold="1"
>>>
>>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>>> op start timeout="60s" interval="0s" on-fail="restart" \
>>> op monitor timeout="60s" interval="7s" on-fail="restart" \
>>> op stop timeout="60s" interval="0s" on-fail="block"
>>>
>>>
>>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>>
>>> # diff -urNp Dummy Dummy-stop-NG
>>> --- Dummy 2011-06-30 17:43:37.000000000 +0900
>>> +++ Dummy-stop-NG 2012-02-28 19:11:12.850207767 +0900
>>> @@ -108,6 +108,8 @@ dummy_start() {
>>> }
>>>
>>> dummy_stop() {
>>> + exit $OCF_ERR_GENERIC
>>> +
>>> dummy_monitor
>>> if [ $? = $OCF_SUCCESS ]; then
>>> rm ${OCF_RESKEY_state}
>>>
>>>
>>>
>>> Before the test, the resource is running on "bl460g6a".
>>>
>>> # crm_simulate -S -x pe-input-1.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Stopped
>>>
>>> Transition Summary:
>>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>>> dummy01 (bl460g6a)
>>>
>>> Executing cluster transition:
>>> * Executing action 6: dummy01_monitor_0 on bl460g6b
>>> * Executing action 4: dummy01_monitor_0 on bl460g6a
>>> * Executing action 7: dummy01_start_0 on bl460g6a
>>> * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>>
>>>
>>>
>>> Stop Pacemaker on "bl460g6a".
>>> # service heartbeat stop
>>>
>>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>>> # crm_simulate -S -x pe-input-2.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>>
>>> Transition Summary:
>>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>>> dummy01 (Started bl460g6a -> bl460g6b)
>>>
>>> Executing cluster transition:
>>> * Executing action 6: dummy01_stop_0 on bl460g6a
>>> * Executing action 7: dummy01_start_0 on bl460g6b
>>> * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>>
>>>
>>>
>>> but this action will fail, it means the resource goes into unmanaged state.
>>> # crm_simulate -S -x pe-input-3.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>> (unmanaged) FAILED
>>>
>>> Transition Summary:
>>>
>>> Executing cluster transition:
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>> (unmanaged) FAILED
>>>
>>>
>>>
>>> Pacemaker shutdown on "bl460g6a" becomes successful,
>>> it seems that the following patch works well.
>>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>>
>>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>>> might be running because it fails to stop.
>>
>> This is because we ignore the status section of any offline nodes when
>> stonith-enabled=false.
>>
>>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>>> but I could start it on "bl460g6b" with crm command.
>>> This holds the potential for the unexpected active/active status.
>>> Is it possible to prevent it's start in this situation?
>>
>> Only by disabling the logic in
>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>> when stonith is disabled.
>>
>>> for example,
>>> (1) Dummy runs on node-a
>>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>>> (3) Dummy can not run on other nodes
>>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>>> operation on node-a
>>> (5) * start Dummy on other nodes
>>> This can be the safe way.
>>>
>>> See attached hb_report.
>>>
>>> Thanks,
>>> Junko IKEDA
>>>
>>> NTT DATA INTELLILINK CORPORATION
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list