[Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)

Mon Feb 17 20:06:53 EST 2014

Hi All,

I confirmed movement at the time of the trouble in one of Master/Slave in Pacemaker1.1.11.

-------------------------------------

Step1) Constitute a cluster.

[root at srv01 ~]# crm_mon -1 -Af
Last updated: Tue Feb 18 18:07:24 2014
Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
Stack: corosync
Current DC: srv01 (3232238180) - partition with quorum
Version: 1.1.10-9d39a6b
2 Nodes configured
6 Resources configured

Online: [ srv01 srv02 ]

 vip-master     (ocf::heartbeat:Dummy): Started srv01 
 vip-rep        (ocf::heartbeat:Dummy): Started srv01 
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ srv01 ]
     Slaves: [ srv02 ]
 Clone Set: clnPingd [prmPingd]
     Started: [ srv01 srv02 ]

Node Attributes:
* Node srv01:
    + default_ping_set                  : 100       
    + master-pgsql                      : 10        
* Node srv02:
    + default_ping_set                  : 100       
    + master-pgsql                      : 5         

Migration summary:
* Node srv01: 
* Node srv02: 

Step2) Monitor error in vip-master.

[root at srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state 

[root at srv01 ~]# crm_mon -1 -Af          
Last updated: Tue Feb 18 18:07:58 2014
Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
Stack: corosync
Current DC: srv01 (3232238180) - partition with quorum
Version: 1.1.10-9d39a6b
2 Nodes configured
6 Resources configured

Online: [ srv01 srv02 ]

 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ srv01 ]
     Slaves: [ srv02 ]
 Clone Set: clnPingd [prmPingd]
     Started: [ srv01 srv02 ]

Node Attributes:
* Node srv01:
    + default_ping_set                  : 100       
    + master-pgsql                      : 10        
* Node srv02:
    + default_ping_set                  : 100       
    + master-pgsql                      : 5         

Migration summary:
* Node srv01: 
   vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18 18:07:50 2014'
* Node srv02: 

Failed actions:
    vip-master_monitor_10000 on srv01 'not running' (7): call=30, status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms, exec=0ms
-------------------------------------

However, the resource does not fail-over.

But, fail-over is calculated when I check cib in crm_simulate at this point in time.

-------------------------------------
[root at srv01 ~]# crm_simulate -L -s

Current cluster status:
Online: [ srv01 srv02 ]

 vip-master     (ocf::heartbeat:Dummy): Stopped 
 vip-rep        (ocf::heartbeat:Dummy): Stopped 
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ srv01 ]
     Slaves: [ srv02 ]
 Clone Set: clnPingd [prmPingd]
     Started: [ srv01 srv02 ]

Allocation scores:
clone_color: clnPingd allocation score on srv01: 0
clone_color: clnPingd allocation score on srv02: 0
clone_color: prmPingd:0 allocation score on srv01: INFINITY
clone_color: prmPingd:0 allocation score on srv02: 0
clone_color: prmPingd:1 allocation score on srv01: 0
clone_color: prmPingd:1 allocation score on srv02: INFINITY
native_color: prmPingd:0 allocation score on srv01: INFINITY
native_color: prmPingd:0 allocation score on srv02: 0
native_color: prmPingd:1 allocation score on srv01: -INFINITY
native_color: prmPingd:1 allocation score on srv02: INFINITY
clone_color: msPostgresql allocation score on srv01: 0
clone_color: msPostgresql allocation score on srv02: 0
clone_color: pgsql:0 allocation score on srv01: INFINITY
clone_color: pgsql:0 allocation score on srv02: 0
clone_color: pgsql:1 allocation score on srv01: 0
clone_color: pgsql:1 allocation score on srv02: INFINITY
native_color: pgsql:0 allocation score on srv01: INFINITY
native_color: pgsql:0 allocation score on srv02: 0
native_color: pgsql:1 allocation score on srv01: -INFINITY
native_color: pgsql:1 allocation score on srv02: INFINITY
pgsql:1 promotion score on srv02: 5
pgsql:0 promotion score on srv01: 1
native_color: vip-master allocation score on srv01: -INFINITY
native_color: vip-master allocation score on srv02: INFINITY
native_color: vip-rep allocation score on srv01: -INFINITY
native_color: vip-rep allocation score on srv02: INFINITY

Transition Summary:
 * Start   vip-master   (srv02)
 * Start   vip-rep      (srv02)
 * Demote  pgsql:0      (Master -> Slave srv01)
 * Promote pgsql:1      (Slave -> Master srv02)

-------------------------------------

In addition, fail-over is calculated even if "cluster_recheck_interval" is carried out.

Fail-over is carried out even if I carry out cibadmin -B.

-------------------------------------
[root at srv01 ~]# cibadmin -B

[root at srv01 ~]# crm_mon -1 -Af
Last updated: Tue Feb 18 18:21:15 2014
Last change: Tue Feb 18 18:21:00 2014 via cibadmin on srv01
Stack: corosync
Current DC: srv01 (3232238180) - partition with quorum
Version: 1.1.10-9d39a6b
2 Nodes configured
6 Resources configured

Online: [ srv01 srv02 ]

 vip-master     (ocf::heartbeat:Dummy): Started srv02 
 vip-rep        (ocf::heartbeat:Dummy): Started srv02 
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ srv02 ]
     Slaves: [ srv01 ]
 Clone Set: clnPingd [prmPingd]
     Started: [ srv01 srv02 ]

Node Attributes:
* Node srv01:
    + default_ping_set                  : 100       
    + master-pgsql                      : 5         
* Node srv02:
    + default_ping_set                  : 100       
    + master-pgsql                      : 10        

Migration summary:
* Node srv01: 
   vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18 18:07:50 2014'
* Node srv02: 

Failed actions:
    vip-master_monitor_10000 on srv01 'not running' (7): call=30, status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms, exec=0ms

-------------------------------------

It is a problem to be behind with practice of fail-over.
I think that the cause that fail-over is late for from error is Pacemaker.

I registered these contents and log information with Bugzilla.
 * http://bugs.clusterlabs.org/show_bug.cgi?id=5197

Best Regards,
Hideo Yamauchi.