[Pacemaker] Can't failover Master/Slave with group(primitive x3) setting

Fri Sep 30 01:44:08 EDT 2011

Hi,

sorry for the confusion.

Pacemaker 1.0.10 OK(group resource can failover)
Pacemaker 1.0.11 NG(gruop resource just stop, can not failover)
Pacemaker 1.1 <- the latest hg (gruop resource just stop, can not failover)

By the way, your simulation showed dummy01 restart on bl460g1n13 again,
but dummy01 failed on bl460g1n13, so dummy01 should move to bl460g1n14.

Current cluster status:
Online: [ bl460g1n13 bl460g1n14 ]

  Resource Group: grpDRBD
     dummy01    (ocf::pacemaker:Dummy): Started bl460g1n13 FAILED
     dummy02    (ocf::pacemaker:Dummy): Started bl460g1n13
     dummy03    (ocf::pacemaker:Dummy): Started bl460g1n13
  Master/Slave Set: msDRBD [prmDRBD]
     Masters: [ bl460g1n13 ]
     Slaves: [ bl460g1n14 ]

 Transition Summary:
 crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Recover
 dummy01 (Started bl460g1n13)
 crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
 dummy02 (Started bl460g1n13)
 crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
 dummy03 (Started bl460g1n13)
 crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
 prmDRBD:0       (Master bl460g1n13)
 crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
 prmDRBD:1       (Slave bl460g1n14)

 Executing cluster transition:
  * Executing action 14: dummy03_stop_0 on bl460g1n13
  * Executing action 12: dummy02_stop_0 on bl460g1n13
  * Executing action 2: dummy01_stop_0 on bl460g1n13
  * Executing action 11: dummy01_start_0 on bl460g1n13
  * Executing action 1: dummy01_monitor_10000 on bl460g1n13
  * Executing action 13: dummy02_start_0 on bl460g1n13
  * Executing action 3: dummy02_monitor_10000 on bl460g1n13
  * Executing action 15: dummy03_start_0 on bl460g1n13
  * Executing action 4: dummy03_monitor_10000 on bl460g1n13

Thanks,
Junko

2011/9/29 Andrew Beekhof <andrew at beekhof.net>:
> On Tue, Sep 27, 2011 at 2:31 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
>> Hi,
>>
>>> Which version did you check?
>>
>> Pacemaker 1.0.11.
>
> I meant of 1.1 since you said:
>
>  "Pacemaker 1.1 shows the same behavior."
>
>>
>>> The latest from git seems to work fine:
>>>
>>> Current cluster status:
>>> Online: [ bl460g1n13 bl460g1n14 ]
>>>
>>>  Resource Group: grpDRBD
>>>     dummy01    (ocf::pacemaker:Dummy): Started bl460g1n13 FAILED
>>>     dummy02    (ocf::pacemaker:Dummy): Started bl460g1n13
>>>     dummy03    (ocf::pacemaker:Dummy): Started bl460g1n13
>>>  Master/Slave Set: msDRBD [prmDRBD]
>>>     Masters: [ bl460g1n13 ]
>>>     Slaves: [ bl460g1n14 ]
>>>
>>> Transition Summary:
>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Recover
>>> dummy01 (Started bl460g1n13)
>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>>> dummy02 (Started bl460g1n13)
>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Restart
>>> dummy03 (Started bl460g1n13)
>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>>> prmDRBD:0       (Master bl460g1n13)
>>> crm_simulate[13781]: 2011/09/26_15:00:05 notice: LogActions: Leave
>>> prmDRBD:1       (Slave bl460g1n14)
>>>
>>> Executing cluster transition:
>>>  * Executing action 14: dummy03_stop_0 on bl460g1n13
>>>  * Executing action 12: dummy02_stop_0 on bl460g1n13
>>>  * Executing action 2: dummy01_stop_0 on bl460g1n13
>>>  * Executing action 11: dummy01_start_0 on bl460g1n13
>>>  * Executing action 1: dummy01_monitor_10000 on bl460g1n13
>>>  * Executing action 13: dummy02_start_0 on bl460g1n13
>>>  * Executing action 3: dummy02_monitor_10000 on bl460g1n13
>>>  * Executing action 15: dummy03_start_0 on bl460g1n13
>>>  * Executing action 4: dummy03_monitor_10000 on bl460g1n13
>>
>> dummy01 got the fail-count,
>> so dummy01 should move from bl460g1n13 to bl460g1n14.
>> Why does it re-start on the failure node?
>>
>> I got the latest changeset from hg;
>>
>> # hg log | head -n 7
>> changeset:   15777:a15ead49e20f
>> branch:      stable-1.0
>> tag:         tip
>> user:        Andrew Beekhof <andrew at beekhof.net>
>> date:        Thu Aug 25 16:49:59 2011 +1000
>> summary:     changeset: 15775:fe18a1ad46f8
>>
>> # crm
>> crm(live)# cib import pe-input-7.bz2
>> crm(pe-input-7)# configure ptest vvv
>> ptest[19194]: 2011/09/27_11:53:45 notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> ptest[19194]: 2011/09/27_11:53:45 WARN: unpack_nodes: Blind faith: not
>> fencing unseen nodes
>> ptest[19194]: 2011/09/27_11:53:45 notice: group_print:  Resource Group: grpDRBD
>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy01
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy02
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[19194]: 2011/09/27_11:53:45 notice: native_print:      dummy03
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[19194]: 2011/09/27_11:53:45 notice: clone_print:  Master/Slave Set: msDRBD
>> ptest[19194]: 2011/09/27_11:53:45 notice: short_print:      Masters: [
>> bl460g1n13 ]
>> ptest[19194]: 2011/09/27_11:53:45 notice: short_print:      Slaves: [
>> bl460g1n14 ]
>> ptest[19194]: 2011/09/27_11:53:45 WARN: common_apply_stickiness:
>> Forcing dummy01 away from bl460g1n13 after 1 failures (max=1)
>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>> dummy01  (bl460g1n13)
>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>> dummy02  (bl460g1n13)
>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Stop    resource
>> dummy03  (bl460g1n13)
>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Leave   resource
>> prmDRBD:0        (Master bl460g1n13)
>> ptest[19194]: 2011/09/27_11:53:45 notice: LogActions: Leave   resource
>> prmDRBD:1        (Slave bl460g1n14)
>> INFO: install graphviz to see a transition graph
>> crm(pe-input-7)# quit
>>
>>
>> reverts to Pacemaker 1.0.11,
>>
>> # hg revert -a -r b2e39d318fda
>> # make install
>>
>> # crm
>> crm(live)# cib import pe-input-7.bz2
>> crm(pe-input-7)# configure ptest vvv
>> ptest[751]: 2011/09/27_11:57:50 notice: unpack_config: On loss of CCM
>> Quorum: Ignore
>> ptest[751]: 2011/09/27_11:57:50 WARN: unpack_nodes: Blind faith: not
>> fencing unseen nodes
>> ptest[751]: 2011/09/27_11:57:50 notice: group_print:  Resource Group: grpDRBD
>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy01
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy02
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[751]: 2011/09/27_11:57:50 notice: native_print:      dummy03
>>  (ocf::pacemaker:Dummy): Started bl460g1n13
>> ptest[751]: 2011/09/27_11:57:50 notice: clone_print:  Master/Slave Set: msDRBD
>> ptest[751]: 2011/09/27_11:57:50 notice: short_print:      Masters: [
>> bl460g1n13 ]
>> ptest[751]: 2011/09/27_11:57:50 notice: short_print:      Slaves: [ bl460g1n14 ]
>> ptest[751]: 2011/09/27_11:57:50 WARN: common_apply_stickiness: Forcing
>> dummy01 away from bl460g1n13 after 1 failures (max=1)
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (10s) for dummy01 on bl460g1n14
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (10s) for dummy02 on bl460g1n14
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (10s) for dummy03 on bl460g1n14
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (20s) for prmDRBD:0 on bl460g1n13
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (10s) for prmDRBD:1 on bl460g1n14
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (20s) for prmDRBD:0 on bl460g1n13
>> ptest[751]: 2011/09/27_11:57:50 notice: RecurringOp:  Start recurring
>> monitor (10s) for prmDRBD:1 on bl460g1n14
>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>> dummy01       (Started bl460g1n13 -> bl460g1n14)
>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>> dummy02       (Started bl460g1n13 -> bl460g1n14)
>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Move resource
>> dummy03       (Started bl460g1n13 -> bl460g1n14)
>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Demote prmDRBD:0
>>  (Master -> Slave bl460g1n13)
>> ptest[751]: 2011/09/27_11:57:50 notice: LogActions: Promote prmDRBD:1
>>  (Slave -> Master bl460g1n14)
>> INFO: install graphviz to see a transition graph
>>
>> Pacemaker 1.0.10 moved the failure resource to the other node.
>> It's the expected behavior.
>>
>> I attached the hb_report which includes the above pe-input-7.bz2.
>>
>> Thanks,
>> Junko
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>