[Pacemaker] Problem : By colocations limitation, the resource appointment of the combination does not become effective.

Fri Mar 5 00:42:53 UTC 2010

Hi All,

We test complicated colocation appointment.

We did resource appointment to start by limitation of colocation together.

But, the resource that set limitation starts when the resource that we appointed does not start in a
certain procedure.

We did the following appointment.

      <rsc_colocation id="rsc_colocation01-1" rsc="UMgroup01" with-rsc="clnPingd" score="1000"/>

When clnPingd did not start, we met with the phenomenon that UMgroup01 started.

The procedure to generate a phenomenon is as follows.

STEP1) Start corosync.
STEP2) Send cib.xml to Pacemaker.
STEP3) A cluster is stable.

[root at srv01 ~]# crm_mon -1
============
Last updated: Wed Mar  3 13:21:21 2010
Stack: openais
Current DC: srv01 - partition with quorum
Version: 1.0.7-6e1815972fc236825bf3658d7f8451d33227d420
4 Nodes configured, 4 expected votes
13 Resources configured.
============
Online: [ srv01 srv02 srv03 srv04 ]

 Resource Group: UMgroup01
     UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
     UmIPaddr   (ocf::heartbeat:Dummy): Started srv01
     UmDummy01  (ocf::heartbeat:Dummy): Started srv01
     UmDummy02  (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-1
     prmExPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
     prmIpPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmApPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-2
     prmExPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-1       (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-2       (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-3       (ocf::heartbeat:Dummy): Started srv02
     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
     prmApPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
 Resource Group: OVDBgroup02-3
     prmExPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-1       (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-2       (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-3       (ocf::heartbeat:Dummy): Started srv03
     prmIpPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
     prmApPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
 Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started srv04
 Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started srv01
 Resource Group: grpStonith3
     prmStonithN3       (stonith:external/ssh): Started srv02
 Resource Group: grpStonith4
     prmStonithN4       (stonith:external/ssh): Started srv03
 Clone Set: clnUMgroup01
     Started: [ srv01 srv04 ]
 Clone Set: clnPingd
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnDiskd1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy2
     Started: [ srv01 srv02 srv03 srv04 ]

STEP4) Camouflage a stop error of pingd of the srv01 node.

pingd_stop() {
   exit $OCF_ERR_GENERIC
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
    fi

STEP5) Stop a clnPingd clone.

[root at srv01 ~]# crm
crm(live)# resource
crm(live)resource# stop clnPingd

[root at srv01 ~]# crm_mon -1 -f
============
Last updated: Wed Mar  3 13:24:16 2010
Stack: openais
Current DC: srv01 - partition with quorum
Version: 1.0.7-6e1815972fc236825bf3658d7f8451d33227d420
4 Nodes configured, 4 expected votes
13 Resources configured.
============

Online: [ srv01 srv02 srv03 srv04 ]

 Resource Group: UMgroup01
     UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
     UmIPaddr   (ocf::heartbeat:Dummy): Started srv01
     UmDummy01  (ocf::heartbeat:Dummy): Started srv01
     UmDummy02  (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-1
     prmExPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
     prmIpPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmApPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
 Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started srv04
 Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started srv01
 Resource Group: grpStonith3
     prmStonithN3       (stonith:external/ssh): Started srv02
 Resource Group: grpStonith4
     prmStonithN4       (stonith:external/ssh): Started srv03
 Clone Set: clnUMgroup01
     Started: [ srv01 srv04 ]
 Clone Set: clnDiskd1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy2
     Started: [ srv01 srv02 srv03 srv04 ]

Migration summary:
* Node srv02: 
* Node srv04: 
* Node srv03: 
* Node srv01: 
   clnPrmPingd:0: migration-threshold=10 fail-count=1000000

STEP6) Return a revision of pingd.

pingd_stop() {
#   exit $OCF_ERR_GENERIC
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
    fi

STEP7) Reboot a srv01 node.

STEP8) Wait for completion of STONITH.(STONITH is completed by a retry)

[root at srv02 ~]# crm_mon -1
============
Last updated: Wed Mar  3 13:34:12 2010
Stack: openais
Current DC: srv02 - partition with quorum
Version: 1.0.7-6e1815972fc236825bf3658d7f8451d33227d420
4 Nodes configured, 4 expected votes
13 Resources configured.
============

Online: [ srv02 srv03 srv04 ]
OFFLINE: [ srv01 ]

 Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started srv04
 Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started srv03
 Resource Group: grpStonith3
     prmStonithN3       (stonith:external/ssh): Started srv02
 Resource Group: grpStonith4
     prmStonithN4       (stonith:external/ssh): Started srv03
 Clone Set: clnUMgroup01
     Started: [ srv04 ]
     Stopped: [ clnUmResource:0 ]
 Clone Set: clnDiskd1
     Started: [ srv02 srv03 srv04 ]
     Stopped: [ clnPrmDiskd1:0 ]
 Clone Set: clnG3dummy1
     Started: [ srv02 srv03 srv04 ]
     Stopped: [ clnG3dummy01:0 ]
 Clone Set: clnG3dummy2
     Started: [ srv02 srv03 srv04 ]
     Stopped: [ clnG3dummy02:0 ]

STEP9) Start corosync in srv01 which rebooted. 
[root at srv02 ~]# crm_mon -1
============
Last updated: Wed Mar  3 13:37:57 2010
Stack: openais
Current DC: srv02 - partition with quorum
Version: 1.0.7-6e1815972fc236825bf3658d7f8451d33227d420
4 Nodes configured, 4 expected votes
13 Resources configured.
============

Online: [ srv01 srv02 srv03 srv04 ]

 Resource Group: UMgroup01
     UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
     UmIPaddr   (ocf::heartbeat:Dummy): Started srv01
     UmDummy01  (ocf::heartbeat:Dummy): Started srv01
     UmDummy02  (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-1
     prmExPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
     prmIpPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmApPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
 Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started srv04
 Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started srv03
 Resource Group: grpStonith3
     prmStonithN3       (stonith:external/ssh): Started srv02
 Resource Group: grpStonith4
     prmStonithN4       (stonith:external/ssh): Started srv03
 Clone Set: clnUMgroup01
     Started: [ srv01 srv04 ]
 Clone Set: clnDiskd1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy2
     Started: [ srv01 srv02 srv03 srv04 ]

[root at srv02 ~]# ptest -L -s | grep UmVIPcheck
group_color: UmVIPcheck allocation score on srv01: 300
group_color: UmVIPcheck allocation score on srv02: -1000000
group_color: UmVIPcheck allocation score on srv03: -1000000
group_color: UmVIPcheck allocation score on srv04: -1000000
native_color: UmVIPcheck allocation score on srv01: 1600
native_color: UmVIPcheck allocation score on srv02: -1000000
native_color: UmVIPcheck allocation score on srv03: -1000000
native_color: UmVIPcheck allocation score on srv04: -1000000

But clnPingd does not start in srv01, but UMgroup01 starts after this.
* Because there was colocation limitation, we did not expect start of UMgroup01.

Is there an error for my setting? 
Or is it a bug?
Or is this right movement?

I attached the thing which added a pengine directory of srv02 to a result of hb_report.
But, I delete it and attach it because a file is big as for the information of srv01,srv03,srv04.

Best Regards,
Hideo Yamauchi.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report996.tar.gz
Type: application/x-gzip-compressed
Size: 256355 bytes
Desc: 2393974864-hb_report996.tar.gz
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100305/1bfaaa37/attachment-0003.bin>