[Pacemaker] Can't failover Master/Slave with group(primitive x3) setting

Fri Oct 7 00:39:01 UTC 2011

On Fri, Sep 30, 2011 at 3:44 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
> Hi,
>
> sorry for the confusion.
>
> Pacemaker 1.0.10 OK(group resource can failover)
> Pacemaker 1.0.11 NG(gruop resource just stop, can not failover)
> Pacemaker 1.1 <- the latest hg (gruop resource just stop, can not failover)

We've actually moved over 1.1 to git:
   http://www.clusterlabs.org/wiki/Contributing_Patches

I should mark that somehow in the HG tree.

>
> By the way, your simulation showed dummy01 restart on bl460g1n13 again,
> but dummy01 failed on bl460g1n13, so dummy01 should move to bl460g1n14.

Hmmm. True.  I'll take another look.

> Current cluster status:
> Online: [ bl460g1n13 bl460g1n14 ]
>
>  Resource Group: grpDRBD
>     dummy01    (ocf::pacemaker:Dummy): Started bl460g1n13 FAILED
>     dummy02    (ocf::pacemaker:Dummy): Started bl460g1n13
>     dummy03    (ocf::pacemaker:Dummy): Started bl460g1n13
>  Master/Slave Set: msDRBD [prmDRBD]
>     Masters: [ bl460g1n13 ]
>     Slaves: [ bl460g1n14 ]
>
>  Transition Summary:
>  crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Recover
>  dummy01 (Started bl460g1n13)
>  crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>  dummy02 (Started bl460g1n13)
>  crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>  dummy03 (Started bl460g1n13)
>  crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>  prmDRBD:0       (Master bl460g1n13)
>  crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>  prmDRBD:1       (Slave bl460g1n14)
>
>  Executing cluster transition:
>  * Executing action 14: dummy03_stop_0 on bl460g1n13
>  * Executing action 12: dummy02_stop_0 on bl460g1n13
>  * Executing action 2: dummy01_stop_0 on bl460g1n13
>  * Executing action 11: dummy01_start_0 on bl460g1n13
>  * Executing action 1: dummy01_monitor_10000 on bl460g1n13
>  * Executing action 13: dummy02_start_0 on bl460g1n13
>  * Executing action 3: dummy02_monitor_10000 on bl460g1n13
>  * Executing action 15: dummy03_start_0 on bl460g1n13
>  * Executing action 4: dummy03_monitor_10000 on bl460g1n13
>
>
> Thanks,
> Junko
>
>
>
> 2011/9/29 Andrew Beekhof <andrew at beekhof.net>:
>> On Tue, Sep 27, 2011 at 2:31 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
>>> Hi,
>>>
>>>> Which version did you check?
>>>
>>> Pacemaker 1.0.11.
>>
>> I meant of 1.1 since you said:
>>
>>  "Pacemaker 1.1 shows the same behavior."
>>
>>>
>>>> The latest from git seems to work fine:
>>>>
>>>> Current cluster status:
>>>> Online: [ bl460g1n13 bl460g1n14 ]
>>>>
>>>>  Resource Group: grpDRBD
>>>>     dummy01    (ocf::pacemaker:Dummy): Started bl460g1n13 FAILED
>>>>     dummy02    (ocf::pacemaker:Dummy): Started bl460g1n13
>>>>     dummy03    (ocf::pacemaker:Dummy): Started bl460g1n13
>>>>  Master/Slave Set: msDRBD [prmDRBD]
>>>>     Masters: [ bl460g1n13 ]
>>>>     Slaves: [ bl460g1n14 ]
>>>>
>>>> Transition Summary:
>>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Recover
>>>> dummy01 (Started bl460g1n13)
>>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>>>> dummy02 (Started bl460g1n13)
>>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>>>> dummy03 (Started bl460g1n13)
>>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>>>> prmDRBD:0       (Master bl460g1n13)
>>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>>>> prmDRBD:1       (Slave bl460g1n14)
>>>>
>>>> Executing cluster transition:
>>>>  * Executing action 14: dummy03_stop_0 on bl460g1n13
>>>>  * Executing action 12: dummy02_stop_0 on bl460g1n13
>>>>  * Executing action 2: dummy01_stop_0 on bl460g1n13
>>>>  * Executing action 11: dummy01_start_0 on bl460g1n13
>>>>  * Executing action 1: dummy01_monitor_10000 on bl460g1n13
>>>>  * Executing action 13: dummy02_start_0 on bl460g1n13
>>>>  * Executing action 3: dummy02_monitor_10000 on bl460g1n13
>>>>  * Executing action 15: dummy03_start_0 on bl460g1n13
>>>>  * Executing action 4: dummy03_monitor_10000 on bl460g1n13
>>>
>>> dummy01 got the fail-count,
>>> so dummy01 should move from bl460g1n13 to bl460g1n14.
>>> Why does it re-start on the failure node?
>>>
>>> I got the latest changeset from hg;
>>>
>>> # hg log | head -n 7
>>> changeset:   15777:a15ead49e20f
>>> branch:      stable-1.0
>>> tag:         tip
>>> user:        Andrew Beekhof <andrew at beekhof.net>
>>> date:        Thu Aug 25 16:49:59 2011 +1000
>>> summary:     changeset: 15775:fe18a1ad46f8
>>>
>>> # crm
>>> crm(live)# cib import pe-input-7.bz2
>>> crm(pe-input-7)# configure ptest vvv
>>> ptest[19194]: 2011/09/27_11:53:45 notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> ptest[19194]: 2011/09/27_11:53:45 WARN: unpack_nodes: Blind faith: not
>>> fencing unseen nodes
>>> ptest[19194]: 2011/09/27_11:53:45 notice: group_print:  Resource Group: grpDRBD
>>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy01
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy02
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy03
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[19194]: 2011/09/27_11:53:45 notice: clone_print:  Master/Slave Set: msDRBD
>>> ptest[19194]: 2011/09/27_11:53:45 notice: short_print:      Masters: [
>>> bl460g1n13 ]
>>> ptest[19194]: 2011/09/27_11:53:45 notice: short_print:      Slaves: [
>>> bl460g1n14 ]
>>> ptest[19194]: 2011/09/27_11:53:45 WARN: common_apply_stickiness:
>>> Forcing dummy01 away from bl460g1n13 after 1 failures (max=1)
>>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>>> dummy01  (bl460g1n13)
>>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>>> dummy02  (bl460g1n13)
>>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>>> dummy03  (bl460g1n13)
>>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Leave   resource
>>> prmDRBD:0        (Master bl460g1n13)
>>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Leave   resource
>>> prmDRBD:1        (Slave bl460g1n14)
>>> INFO: install graphviz to see a transition graph
>>> crm(pe-input-7)# quit
>>>
>>>
>>> reverts to Pacemaker 1.0.11,
>>>
>>> # hg revert -a -r b2e39d318fda
>>> # make install
>>>
>>> # crm
>>> crm(live)# cib import pe-input-7.bz2
>>> crm(pe-input-7)# configure ptest vvv
>>> ptest[751]: 2011/09/27_11:57:50 notice: unpack_config: On loss of CCM
>>> Quorum: Ignore
>>> ptest[751]: 2011/09/27_11:57:50 WARN: unpack_nodes: Blind faith: not
>>> fencing unseen nodes
>>> ptest[751]: 2011/09/27_11:57:50 notice: group_print:  Resource Group: grpDRBD
>>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy01
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy02
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy03
>>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>>> ptest[751]: 2011/09/27_11:57:50 notice: clone_print:  Master/Slave Set: msDRBD
>>> ptest[751]: 2011/09/27_11:57:50 notice: short_print:      Masters: [
>>> bl460g1n13 ]
>>> ptest[751]: 2011/09/27_11:57:50 notice: short_print:      Slaves: [ bl460g1n14 ]
>>> ptest[751]: 2011/09/27_11:57:50 WARN: common_apply_stickiness: Forcing
>>> dummy01 away from bl460g1n13 after 1 failures (max=1)
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (10s) for dummy01 on bl460g1n14
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (10s) for dummy02 on bl460g1n14
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (10s) for dummy03 on bl460g1n14
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (20s) for prmDRBD:0 on bl460g1n13
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (10s) for prmDRBD:1 on bl460g1n14
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (20s) for prmDRBD:0 on bl460g1n13
>>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>>> monitor (10s) for prmDRBD:1 on bl460g1n14
>>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>>> dummy01       (Started bl460g1n13 -> bl460g1n14)
>>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>>> dummy02       (Started bl460g1n13 -> bl460g1n14)
>>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>>> dummy03       (Started bl460g1n13 -> bl460g1n14)
>>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Demote prmDRBD:0
>>>  (Master -> Slave bl460g1n13)
>>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Promote prmDRBD:1
>>>  (Slave -> Master bl460g1n14)
>>> INFO: install graphviz to see a transition graph
>>>
>>> Pacemaker 1.0.10 moved the failure resource to the other node.
>>> It's the expected behavior.
>>>
>>> I attached the hb_report which includes the above pe-input-7.bz2.
>>>
>>> Thanks,
>>> Junko
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>