[Pacemaker] About replacement of clone and handling of the fail number of times.

Fri Mar 12 06:30:47 UTC 2010

Hi,

We tested the trouble of the clone.

I confirmed it in the next procedure.

Step1)I start all nodes and update cib.xml.

============
Last updated: Fri Mar 12 14:53:38 2010
Stack: openais
Current DC: srv01 - partition with quorum
Version: 1.0.7-049006f172774f407e165ec82f7ee09cb73fd0e7
4 Nodes configured, 2 expected votes
13 Resources configured.
============

Online: [ srv01 srv02 srv03 srv04 ]

 Resource Group: UMgroup01
     UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
     UmIPaddr   (ocf::heartbeat:Dummy): Started srv01
     UmDummy01  (ocf::heartbeat:Dummy): Started srv01
     UmDummy02  (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-1
     prmExPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Dummy): Started srv01
     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
     prmIpPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
     prmApPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
 Resource Group: OVDBgroup02-2
     prmExPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-1       (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-2       (ocf::heartbeat:Dummy): Started srv02
     prmFsPostgreSQLDB2-3       (ocf::heartbeat:Dummy): Started srv02
     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
     prmApPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
 Resource Group: OVDBgroup02-3
     prmExPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-1       (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-2       (ocf::heartbeat:Dummy): Started srv03
     prmFsPostgreSQLDB3-3       (ocf::heartbeat:Dummy): Started srv03
     prmIpPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
     prmApPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
 Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started srv04
 Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started srv01
 Resource Group: grpStonith3
     prmStonithN3       (stonith:external/ssh): Started srv02
 Resource Group: grpStonith4
     prmStonithN4       (stonith:external/ssh): Started srv03
 Clone Set: clnUMgroup01
     Started: [ srv01 srv04 ]
 Clone Set: clnPingd
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnDiskd1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy2
     Started: [ srv01 srv02 srv03 srv04 ]

Step2)I generate the trouble of the clnUMgroup01 clone in a N1(srv01) node.

 [root at srv01 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:0.state 

 * The clone resources are replaced.

 [root at srv01 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy0*
 /var/run/heartbeat/rsctmp/Dummy-clnUMdummy01:1.state 
/var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:1.state

Step3)Again...I generate the trouble of the clnUMgroup01 clone in a N1(srv01) node.

 [root at srv01 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:1.state 

 [root at srv01 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy0*
 /var/run/heartbeat/rsctmp/Dummy-clnUMdummy01:0.state 
/var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:0.state

 * The clone resources are replaced.

============
Last updated: Fri Mar 12 14:56:19 2010
Stack: openais
Current DC: srv01 - partition with quorum
Version: 1.0.7-049006f172774f407e165ec82f7ee09cb73fd0e7
4 Nodes configured, 2 expected votes
13 Resources configured.
============
Online: [ srv01 srv02 srv03 srv04 ]
(snip)
Migration summary:
* Node srv02: 
* Node srv03: 
* Node srv04: 
* Node srv01: 
   clnUMdummy02:0: migration-threshold=5 fail-count=1
   clnUMdummy02:1: migration-threshold=5 fail-count=1

Step4)I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node.

[root at srv04 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:1.state 
[root at srv04 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02*
/var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:1.state

 * The clone resources are not replaced.

Step5)Again...I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node.

 * The clone resources are not replaced.

(snip)
Migration summary:
* Node srv02: 
* Node srv03: 
* Node srv04: 
   clnUMdummy02:1: migration-threshold=5 fail-count=2
* Node srv01: 
   clnUMdummy02:0: migration-threshold=5 fail-count=1
   clnUMdummy02:1: migration-threshold=5 fail-count=1

Step6)Again...I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node and N1(srv01) node.

 * In the N4 node, trouble of clnUMdummy02 is handled at five times, but, in the N1 node, it is
processed at much number of times for replacement.

(snip)
 Clone Set: clnUMgroup01
     Started: [ srv01 ]
     Stopped: [ clnUmResource:1 ]
 Clone Set: clnPingd
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnDiskd1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy1
     Started: [ srv01 srv02 srv03 srv04 ]
 Clone Set: clnG3dummy2
     Started: [ srv01 srv02 srv03 srv04 ]

Migration summary:
* Node srv02: 
* Node srv03: 
* Node srv04: 
   clnUMdummy02:1: migration-threshold=5 fail-count=5
* Node srv01: 
   clnUMdummy02:0: migration-threshold=5 fail-count=3
   clnUMdummy02:1: migration-threshold=5 fail-count=3

Of a clone rising in a N1(srv01) node at the time of "globally-unique=false" is replacing it right?
In addition, is it right movement that replacement does not happen even if a clone breaks down in a
N4(srv04) node?

We think that, furthermore, there is a problem because the replacement is different.

When it was assumed that the replacement of this clone is right, arrival to the trouble number of
times is different from a N4(srv04) node in a N1(srv01) node.

By this movement, we cannot set the limit of the trouble number of times of the clone well.

This is specifications or bug? (Or is it already solved in the development version?)
Is setting to operate definitely necessary for cib.xml?

If there is a setting method of right cib.xml, please teach it.

Because the size of the collection of hb_report result is big, I do not attach it.
If there is information of hb_report which is necessary for the solution of the problem, give me
comments.

Best Regards,
Hideo Yamauchi.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.zip
Type: application/x-zip-compressed
Size: 10575 bytes
Desc: 1726557689-cib.zip
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100312/c1b456b9/attachment-0003.bin>