[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node
Junko IKEDA
tsukishima.ha at gmail.com
Wed Feb 29 09:30:11 UTC 2012
Hi,
sorry again,
I checked the latest code, and it says,
} else if (wrapper->action->rsc
&& wrapper->action->rsc != action->rsc
&& is_set(wrapper->action->rsc->flags, pe_rsc_failed)
&& is_not_set(wrapper->action->rsc->flags, pe_rsc_managed)
&& strstr(wrapper->action->uuid, "_stop_0")
&& action->rsc && action->rsc->variant >= pe_clone) {
crm_warn("Ignoring requirement that %s comeplete before %s:"
" unmanaged failed resources cannot prevent clone shutdown",
wrapper->action->uuid, action->uuid);
return FALSE;
It seems that lf#1959 is for the clone resource issue.
The behavior which I posted is the other one.
In the current specification, does "stop NG action" prevent Pacemaker shutdown?
Thanks,
Junko
2012/2/29 Junko IKEDA <tsukishima.ha at gmail.com>:
> Hi,
>
> additional information;
> (1) resource is running on DC
> (2) shutdown Pacemaker on DC, and resource goes into stop NG(unmanaged)
> (3) the other node becomes DC
> (4) resource starts on the new DC
> (this resource has unmanaged status on the old DC...)
>
> see attached the other hb_report.
>
> By the way, this patch means,
> if there are some unmanaged resources, the operation of "Pacemaker
> shutdown" becomes successful, right?
>
> High: PE: Bug lf#1959 - Fail unmanaged resources should not prevent
> other services from shutting down
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> I don't know the detail of lf#1959, and it would be better to setup
> STONITH to handle "stop" fail unmanaged resource,
> but stop NG action do not permit Pacemaker to shutdown itself just in case.
>
> Thanks,
> Junko
>
> 2012/2/29 Junko IKEDA <tsukishima.ha at gmail.com>:
>> Hi,
>>
>> I'm running the following simple configuration with Pacemaker 1.1.6,
>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>
>> property \
>> no-quorum-policy="ignore" \
>> stonith-enabled="false" \
>> crmd-transition-delay="2s"
>>
>> rsc_defaults \
>> resource-stickiness="INFINITY" \
>> migration-threshold="1"
>>
>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>> op start timeout="60s" interval="0s" on-fail="restart" \
>> op monitor timeout="60s" interval="7s" on-fail="restart" \
>> op stop timeout="60s" interval="0s" on-fail="block"
>>
>>
>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>
>> # diff -urNp Dummy Dummy-stop-NG
>> --- Dummy 2011-06-30 17:43:37.000000000 +0900
>> +++ Dummy-stop-NG 2012-02-28 19:11:12.850207767 +0900
>> @@ -108,6 +108,8 @@ dummy_start() {
>> }
>>
>> dummy_stop() {
>> + exit $OCF_ERR_GENERIC
>> +
>> dummy_monitor
>> if [ $? = $OCF_SUCCESS ]; then
>> rm ${OCF_RESKEY_state}
>>
>>
>>
>> Before the test, the resource is running on "bl460g6a".
>>
>> # crm_simulate -S -x pe-input-1.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Stopped
>>
>> Transition Summary:
>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>> dummy01 (bl460g6a)
>>
>> Executing cluster transition:
>> * Executing action 6: dummy01_monitor_0 on bl460g6b
>> * Executing action 4: dummy01_monitor_0 on bl460g6a
>> * Executing action 7: dummy01_start_0 on bl460g6a
>> * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>>
>>
>> Stop Pacemaker on "bl460g6a".
>> # service heartbeat stop
>>
>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>> # crm_simulate -S -x pe-input-2.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>> Transition Summary:
>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>> dummy01 (Started bl460g6a -> bl460g6b)
>>
>> Executing cluster transition:
>> * Executing action 6: dummy01_stop_0 on bl460g6a
>> * Executing action 7: dummy01_start_0 on bl460g6b
>> * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>
>>
>>
>> but this action will fail, it means the resource goes into unmanaged state.
>> # crm_simulate -S -x pe-input-3.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>> Transition Summary:
>>
>> Executing cluster transition:
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>>
>>
>> Pacemaker shutdown on "bl460g6a" becomes successful,
>> it seems that the following patch works well.
>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>
>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>> might be running because it fails to stop.
>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>> but I could start it on "bl460g6b" with crm command.
>> This holds the potential for the unexpected active/active status.
>> Is it possible to prevent it's start in this situation?
>> for example,
>> (1) Dummy runs on node-a
>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>> (3) Dummy can not run on other nodes
>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>> operation on node-a
>> (5) * start Dummy on other nodes
>> This can be the safe way.
>>
>> See attached hb_report.
>>
>> Thanks,
>> Junko IKEDA
>>
>> NTT DATA INTELLILINK CORPORATION
More information about the Pacemaker
mailing list