[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node

Wed Feb 29 09:30:11 UTC 2012

Hi,

sorry again,
I checked the latest code, and it says,

    } else if (wrapper->action->rsc
               && wrapper->action->rsc != action->rsc
               && is_set(wrapper->action->rsc->flags, pe_rsc_failed)
               && is_not_set(wrapper->action->rsc->flags, pe_rsc_managed)
               && strstr(wrapper->action->uuid, "_stop_0")
               && action->rsc && action->rsc->variant >= pe_clone) {
        crm_warn("Ignoring requirement that %s comeplete before %s:"
                 " unmanaged failed resources cannot prevent clone shutdown",
                 wrapper->action->uuid, action->uuid);
        return FALSE;

It seems that lf#1959 is for the clone resource issue.
The behavior which I posted is the other one.

In the current specification, does "stop NG action" prevent Pacemaker shutdown?

Thanks,
Junko

2012/2/29 Junko IKEDA <tsukishima.ha at gmail.com>:
> Hi,
>
> additional information;
> (1) resource is running on DC
> (2) shutdown Pacemaker on DC, and resource goes into stop NG(unmanaged)
> (3) the other node becomes DC
> (4) resource starts on the new DC
> (this resource has unmanaged status on the old DC...)
>
> see attached the other hb_report.
>
> By the way, this patch means,
> if there are some unmanaged resources, the operation of "Pacemaker
> shutdown" becomes successful, right?
>
> High: PE: Bug lf#1959 - Fail unmanaged resources should not prevent
> other services from shutting down
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> I don't know the detail of lf#1959, and it would be better to setup
> STONITH to handle "stop" fail unmanaged resource,
> but stop NG action do not permit Pacemaker to shutdown itself just in case.
>
> Thanks,
> Junko
>
> 2012/2/29 Junko IKEDA <tsukishima.ha at gmail.com>:
>> Hi,
>>
>> I'm running the following simple configuration with Pacemaker 1.1.6,
>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>
>> property \
>>    no-quorum-policy="ignore" \
>>    stonith-enabled="false" \
>>    crmd-transition-delay="2s"
>>
>> rsc_defaults \
>>    resource-stickiness="INFINITY" \
>>    migration-threshold="1"
>>
>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>>    op stop    timeout="60s" interval="0s"  on-fail="block"
>>
>>
>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>
>> # diff -urNp Dummy Dummy-stop-NG
>> --- Dummy       2011-06-30 17:43:37.000000000 +0900
>> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
>> @@ -108,6 +108,8 @@ dummy_start() {
>>  }
>>
>>  dummy_stop() {
>> +    exit $OCF_ERR_GENERIC
>> +
>>     dummy_monitor
>>     if [ $? =  $OCF_SUCCESS ]; then
>>        rm ${OCF_RESKEY_state}
>>
>>
>>
>> Before the test, the resource is running on "bl460g6a".
>>
>> # crm_simulate -S -x pe-input-1.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>>
>> Transition Summary:
>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>> dummy01    (bl460g6a)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6a
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>>
>>
>> Stop Pacemaker on "bl460g6a".
>> # service heartbeat stop
>>
>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>> # crm_simulate -S -x pe-input-2.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>> Transition Summary:
>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>> dummy01    (Started bl460g6a -> bl460g6b)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_stop_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6b
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>
>>
>>
>> but this action will fail, it means the resource goes into unmanaged state.
>> # crm_simulate -S -x pe-input-3.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>> Transition Summary:
>>
>> Executing cluster transition:
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>>
>>
>> Pacemaker shutdown on "bl460g6a" becomes successful,
>> it seems that the following patch works well.
>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>
>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>> might be running because it fails to stop.
>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>> but I could start it on "bl460g6b" with crm command.
>> This holds the potential for the unexpected active/active status.
>> Is it possible to prevent it's start in this situation?
>> for example,
>> (1) Dummy runs on node-a
>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>> (3) Dummy can not run on other nodes
>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>> operation on node-a
>> (5) * start Dummy on other nodes
>> This can be the safe way.
>>
>> See attached hb_report.
>>
>> Thanks,
>> Junko IKEDA
>>
>> NTT DATA INTELLILINK CORPORATION