[Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)

Tue Feb 18 11:18:44 EST 2014

----- Original Message -----
> From: renayama19661014 at ybb.ne.jp
> To: "PaceMaker-ML" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, February 17, 2014 7:06:53 PM
> Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not	calculated.)
> 
> Hi All,
> 
> I confirmed movement at the time of the trouble in one of Master/Slave in
> Pacemaker1.1.11.
> 
> -------------------------------------
> 
> Step1) Constitute a cluster.
> 
> [root at srv01 ~]# crm_mon -1 -Af
> Last updated: Tue Feb 18 18:07:24 2014
> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
> Stack: corosync
> Current DC: srv01 (3232238180) - partition with quorum
> Version: 1.1.10-9d39a6b
> 2 Nodes configured
> 6 Resources configured
> 
> 
> Online: [ srv01 srv02 ]
> 
>  vip-master     (ocf::heartbeat:Dummy): Started srv01
>  vip-rep        (ocf::heartbeat:Dummy): Started srv01
>  Master/Slave Set: msPostgresql [pgsql]
>      Masters: [ srv01 ]
>      Slaves: [ srv02 ]
>  Clone Set: clnPingd [prmPingd]
>      Started: [ srv01 srv02 ]
> 
> Node Attributes:
> * Node srv01:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 10
> * Node srv02:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 5
> 
> Migration summary:
> * Node srv01:
> * Node srv02:
> 
> Step2) Monitor error in vip-master.
> 
> [root at srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state
> 
> [root at srv01 ~]# crm_mon -1 -Af
> Last updated: Tue Feb 18 18:07:58 2014
> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
> Stack: corosync
> Current DC: srv01 (3232238180) - partition with quorum
> Version: 1.1.10-9d39a6b
> 2 Nodes configured
> 6 Resources configured
> 
> 
> Online: [ srv01 srv02 ]
> 
>  Master/Slave Set: msPostgresql [pgsql]
>      Masters: [ srv01 ]
>      Slaves: [ srv02 ]
>  Clone Set: clnPingd [prmPingd]
>      Started: [ srv01 srv02 ]
> 
> Node Attributes:
> * Node srv01:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 10
> * Node srv02:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 5
> 
> Migration summary:
> * Node srv01:
>    vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18
>    18:07:50 2014'
> * Node srv02:
> 
> Failed actions:
>     vip-master_monitor_10000 on srv01 'not running' (7): call=30,
>     status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms,
>     exec=0ms
> -------------------------------------
> 
> However, the resource does not fail-over.
> 
> But, fail-over is calculated when I check cib in crm_simulate at this point
> in time.
> 
> -------------------------------------
> [root at srv01 ~]# crm_simulate -L -s
> 
> Current cluster status:
> Online: [ srv01 srv02 ]
> 
>  vip-master     (ocf::heartbeat:Dummy): Stopped
>  vip-rep        (ocf::heartbeat:Dummy): Stopped
>  Master/Slave Set: msPostgresql [pgsql]
>      Masters: [ srv01 ]
>      Slaves: [ srv02 ]
>  Clone Set: clnPingd [prmPingd]
>      Started: [ srv01 srv02 ]
> 
> Allocation scores:
> clone_color: clnPingd allocation score on srv01: 0
> clone_color: clnPingd allocation score on srv02: 0
> clone_color: prmPingd:0 allocation score on srv01: INFINITY
> clone_color: prmPingd:0 allocation score on srv02: 0
> clone_color: prmPingd:1 allocation score on srv01: 0
> clone_color: prmPingd:1 allocation score on srv02: INFINITY
> native_color: prmPingd:0 allocation score on srv01: INFINITY
> native_color: prmPingd:0 allocation score on srv02: 0
> native_color: prmPingd:1 allocation score on srv01: -INFINITY
> native_color: prmPingd:1 allocation score on srv02: INFINITY
> clone_color: msPostgresql allocation score on srv01: 0
> clone_color: msPostgresql allocation score on srv02: 0
> clone_color: pgsql:0 allocation score on srv01: INFINITY
> clone_color: pgsql:0 allocation score on srv02: 0
> clone_color: pgsql:1 allocation score on srv01: 0
> clone_color: pgsql:1 allocation score on srv02: INFINITY
> native_color: pgsql:0 allocation score on srv01: INFINITY
> native_color: pgsql:0 allocation score on srv02: 0
> native_color: pgsql:1 allocation score on srv01: -INFINITY
> native_color: pgsql:1 allocation score on srv02: INFINITY
> pgsql:1 promotion score on srv02: 5
> pgsql:0 promotion score on srv01: 1
> native_color: vip-master allocation score on srv01: -INFINITY
> native_color: vip-master allocation score on srv02: INFINITY
> native_color: vip-rep allocation score on srv01: -INFINITY
> native_color: vip-rep allocation score on srv02: INFINITY
> 
> Transition Summary:
>  * Start   vip-master   (srv02)
>  * Start   vip-rep      (srv02)
>  * Demote  pgsql:0      (Master -> Slave srv01)
>  * Promote pgsql:1      (Slave -> Master srv02)
> 
> -------------------------------------
> 
> In addition, fail-over is calculated even if "cluster_recheck_interval" is
> carried out.
> 
> Fail-over is carried out even if I carry out cibadmin -B.
> 
> -------------------------------------
> [root at srv01 ~]# cibadmin -B
> 
> [root at srv01 ~]# crm_mon -1 -Af
> Last updated: Tue Feb 18 18:21:15 2014
> Last change: Tue Feb 18 18:21:00 2014 via cibadmin on srv01
> Stack: corosync
> Current DC: srv01 (3232238180) - partition with quorum
> Version: 1.1.10-9d39a6b
> 2 Nodes configured
> 6 Resources configured
> 
> 
> Online: [ srv01 srv02 ]
> 
>  vip-master     (ocf::heartbeat:Dummy): Started srv02
>  vip-rep        (ocf::heartbeat:Dummy): Started srv02
>  Master/Slave Set: msPostgresql [pgsql]
>      Masters: [ srv02 ]
>      Slaves: [ srv01 ]
>  Clone Set: clnPingd [prmPingd]
>      Started: [ srv01 srv02 ]
> 
> Node Attributes:
> * Node srv01:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 5
> * Node srv02:
>     + default_ping_set                  : 100
>     + master-pgsql                      : 10
> 
> Migration summary:
> * Node srv01:
>    vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18
>    18:07:50 2014'

You have resource-stickiness=INFINITY, this is what is preventing the failover from occurring. Set resource-stickiness=1 or 0 and the failover should occur.

-- Vossel

> * Node srv02:
> 
> Failed actions:
>     vip-master_monitor_10000 on srv01 'not running' (7): call=30,
>     status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms,
>     exec=0ms
> 
> -------------------------------------
> 
> It is a problem to be behind with practice of fail-over.
> I think that the cause that fail-over is late for from error is Pacemaker.
> 
> I registered these contents and log information with Bugzilla.
>  * http://bugs.clusterlabs.org/show_bug.cgi?id=5197
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>