[Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)

Tue Feb 18 22:39:06 EST 2014

Hi Andrew,

> I'll follow up on the bug.

Thanks!

Hideo Yamauch.

--- On Wed, 2014/2/19, Andrew Beekhof <andrew at beekhof.net> wrote:

> I'll follow up on the bug.
> 
> On 19 Feb 2014, at 10:55 am, renayama19661014 at ybb.ne.jp wrote:
> 
> > Hi David,
> > 
> > Thank you for comments.
> > 
> >> You have resource-stickiness=INFINITY, this is what is preventing the failover from occurring. Set resource-stickiness=1 or 0 and the failover should occur.
> >> 
> > 
> > However, the resource moves by a calculation of the next state transition.
> > By a calculation of the first trouble, can it not travel the resource?
> > 
> > In addition, the resource moves when the resource deletes next colocation.
> > 
> > colocation rsc_colocation-master-3 INFINITY: vip-rep msPostgresql:Master
> > 
> > There is the problem with handling of colocation of some Pacemaker?
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Wed, 2014/2/19, David Vossel <dvossel at redhat.com> wrote:
> > 
> >> 
> >> ----- Original Message -----
> >>> From: renayama19661014 at ybb.ne.jp
> >>> To: "PaceMaker-ML" <pacemaker at oss.clusterlabs.org>
> >>> Sent: Monday, February 17, 2014 7:06:53 PM
> >>> Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not    calculated.)
> >>> 
> >>> Hi All,
> >>> 
> >>> I confirmed movement at the time of the trouble in one of Master/Slave in
> >>> Pacemaker1.1.11.
> >>> 
> >>> -------------------------------------
> >>> 
> >>> Step1) Constitute a cluster.
> >>> 
> >>> [root at srv01 ~]# crm_mon -1 -Af
> >>> Last updated: Tue Feb 18 18:07:24 2014
> >>> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
> >>> Stack: corosync
> >>> Current DC: srv01 (3232238180) - partition with quorum
> >>> Version: 1.1.10-9d39a6b
> >>> 2 Nodes configured
> >>> 6 Resources configured
> >>> 
> >>> 
> >>> Online: [ srv01 srv02 ]
> >>> 
> >>>   vip-master     (ocf::heartbeat:Dummy): Started srv01
> >>>   vip-rep        (ocf::heartbeat:Dummy): Started srv01
> >>>   Master/Slave Set: msPostgresql [pgsql]
> >>>       Masters: [ srv01 ]
> >>>       Slaves: [ srv02 ]
> >>>   Clone Set: clnPingd [prmPingd]
> >>>       Started: [ srv01 srv02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node srv01:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 10
> >>> * Node srv02:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 5
> >>> 
> >>> Migration summary:
> >>> * Node srv01:
> >>> * Node srv02:
> >>> 
> >>> Step2) Monitor error in vip-master.
> >>> 
> >>> [root at srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state
> >>> 
> >>> [root at srv01 ~]# crm_mon -1 -Af
> >>> Last updated: Tue Feb 18 18:07:58 2014
> >>> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
> >>> Stack: corosync
> >>> Current DC: srv01 (3232238180) - partition with quorum
> >>> Version: 1.1.10-9d39a6b
> >>> 2 Nodes configured
> >>> 6 Resources configured
> >>> 
> >>> 
> >>> Online: [ srv01 srv02 ]
> >>> 
> >>>   Master/Slave Set: msPostgresql [pgsql]
> >>>       Masters: [ srv01 ]
> >>>       Slaves: [ srv02 ]
> >>>   Clone Set: clnPingd [prmPingd]
> >>>       Started: [ srv01 srv02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node srv01:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 10
> >>> * Node srv02:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 5
> >>> 
> >>> Migration summary:
> >>> * Node srv01:
> >>>     vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18
> >>>     18:07:50 2014'
> >>> * Node srv02:
> >>> 
> >>> Failed actions:
> >>>      vip-master_monitor_10000 on srv01 'not running' (7): call=30,
> >>>      status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms,
> >>>      exec=0ms
> >>> -------------------------------------
> >>> 
> >>> However, the resource does not fail-over.
> >>> 
> >>> But, fail-over is calculated when I check cib in crm_simulate at this point
> >>> in time.
> >>> 
> >>> -------------------------------------
> >>> [root at srv01 ~]# crm_simulate -L -s
> >>> 
> >>> Current cluster status:
> >>> Online: [ srv01 srv02 ]
> >>> 
> >>>   vip-master     (ocf::heartbeat:Dummy): Stopped
> >>>   vip-rep        (ocf::heartbeat:Dummy): Stopped
> >>>   Master/Slave Set: msPostgresql [pgsql]
> >>>       Masters: [ srv01 ]
> >>>       Slaves: [ srv02 ]
> >>>   Clone Set: clnPingd [prmPingd]
> >>>       Started: [ srv01 srv02 ]
> >>> 
> >>> Allocation scores:
> >>> clone_color: clnPingd allocation score on srv01: 0
> >>> clone_color: clnPingd allocation score on srv02: 0
> >>> clone_color: prmPingd:0 allocation score on srv01: INFINITY
> >>> clone_color: prmPingd:0 allocation score on srv02: 0
> >>> clone_color: prmPingd:1 allocation score on srv01: 0
> >>> clone_color: prmPingd:1 allocation score on srv02: INFINITY
> >>> native_color: prmPingd:0 allocation score on srv01: INFINITY
> >>> native_color: prmPingd:0 allocation score on srv02: 0
> >>> native_color: prmPingd:1 allocation score on srv01: -INFINITY
> >>> native_color: prmPingd:1 allocation score on srv02: INFINITY
> >>> clone_color: msPostgresql allocation score on srv01: 0
> >>> clone_color: msPostgresql allocation score on srv02: 0
> >>> clone_color: pgsql:0 allocation score on srv01: INFINITY
> >>> clone_color: pgsql:0 allocation score on srv02: 0
> >>> clone_color: pgsql:1 allocation score on srv01: 0
> >>> clone_color: pgsql:1 allocation score on srv02: INFINITY
> >>> native_color: pgsql:0 allocation score on srv01: INFINITY
> >>> native_color: pgsql:0 allocation score on srv02: 0
> >>> native_color: pgsql:1 allocation score on srv01: -INFINITY
> >>> native_color: pgsql:1 allocation score on srv02: INFINITY
> >>> pgsql:1 promotion score on srv02: 5
> >>> pgsql:0 promotion score on srv01: 1
> >>> native_color: vip-master allocation score on srv01: -INFINITY
> >>> native_color: vip-master allocation score on srv02: INFINITY
> >>> native_color: vip-rep allocation score on srv01: -INFINITY
> >>> native_color: vip-rep allocation score on srv02: INFINITY
> >>> 
> >>> Transition Summary:
> >>>   * Start   vip-master   (srv02)
> >>>   * Start   vip-rep      (srv02)
> >>>   * Demote  pgsql:0      (Master -> Slave srv01)
> >>>   * Promote pgsql:1      (Slave -> Master srv02)
> >>> 
> >>> -------------------------------------
> >>> 
> >>> In addition, fail-over is calculated even if "cluster_recheck_interval" is
> >>> carried out.
> >>> 
> >>> Fail-over is carried out even if I carry out cibadmin -B.
> >>> 
> >>> -------------------------------------
> >>> [root at srv01 ~]# cibadmin -B
> >>> 
> >>> [root at srv01 ~]# crm_mon -1 -Af
> >>> Last updated: Tue Feb 18 18:21:15 2014
> >>> Last change: Tue Feb 18 18:21:00 2014 via cibadmin on srv01
> >>> Stack: corosync
> >>> Current DC: srv01 (3232238180) - partition with quorum
> >>> Version: 1.1.10-9d39a6b
> >>> 2 Nodes configured
> >>> 6 Resources configured
> >>> 
> >>> 
> >>> Online: [ srv01 srv02 ]
> >>> 
> >>>   vip-master     (ocf::heartbeat:Dummy): Started srv02
> >>>   vip-rep        (ocf::heartbeat:Dummy): Started srv02
> >>>   Master/Slave Set: msPostgresql [pgsql]
> >>>       Masters: [ srv02 ]
> >>>       Slaves: [ srv01 ]
> >>>   Clone Set: clnPingd [prmPingd]
> >>>       Started: [ srv01 srv02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node srv01:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 5
> >>> * Node srv02:
> >>>      + default_ping_set                  : 100
> >>>      + master-pgsql                      : 10
> >>> 
> >>> Migration summary:
> >>> * Node srv01:
> >>>     vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18
> >>>     18:07:50 2014'
> >> 
> >> You have resource-stickiness=INFINITY, this is what is preventing the failover from occurring. Set resource-stickiness=1 or 0 and the failover should occur.
> >> 
> >> -- Vossel
> >> 
> >>> * Node srv02:
> >>> 
> >>> Failed actions:
> >>>      vip-master_monitor_10000 on srv01 'not running' (7): call=30,
> >>>      status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms,
> >>>      exec=0ms
> >>> 
> >>> -------------------------------------
> >>> 
> >>> It is a problem to be behind with practice of fail-over.
> >>> I think that the cause that fail-over is late for from error is Pacemaker.
> >>> 
> >>> I registered these contents and log information with Bugzilla.
> >>>   * http://bugs.clusterlabs.org/show_bug.cgi?id=5197
> >>> 
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>> 
> >> 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
>