[Pacemaker] About replacement of clone and handling of the fail number of times.

Tue Mar 23 16:49:42 UTC 2010

2010/3/12  <renayama19661014 at ybb.ne.jp>:
> Hi,
>
> We tested the trouble of the clone.
>
> I confirmed it in the next procedure.
>
> Step1)I start all nodes and update cib.xml.
>
> ============
> Last updated: Fri Mar 12 14:53:38 2010
> Stack: openais
> Current DC: srv01 - partition with quorum
> Version: 1.0.7-049006f172774f407e165ec82f7ee09cb73fd0e7
> 4 Nodes configured, 2 expected votes
> 13 Resources configured.
> ============
>
> Online: [ srv01 srv02 srv03 srv04 ]
>
>  Resource Group: UMgroup01
>     UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>     UmIPaddr   (ocf::heartbeat:Dummy): Started srv01
>     UmDummy01  (ocf::heartbeat:Dummy): Started srv01
>     UmDummy02  (ocf::heartbeat:Dummy): Started srv01
>  Resource Group: OVDBgroup02-1
>     prmExPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
>     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Dummy): Started srv01
>     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Dummy): Started srv01
>     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
>     prmIpPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
>     prmApPostgreSQLDB1 (ocf::heartbeat:Dummy): Started srv01
>  Resource Group: OVDBgroup02-2
>     prmExPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
>     prmFsPostgreSQLDB2-1       (ocf::heartbeat:Dummy): Started srv02
>     prmFsPostgreSQLDB2-2       (ocf::heartbeat:Dummy): Started srv02
>     prmFsPostgreSQLDB2-3       (ocf::heartbeat:Dummy): Started srv02
>     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
>     prmApPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
>  Resource Group: OVDBgroup02-3
>     prmExPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
>     prmFsPostgreSQLDB3-1       (ocf::heartbeat:Dummy): Started srv03
>     prmFsPostgreSQLDB3-2       (ocf::heartbeat:Dummy): Started srv03
>     prmFsPostgreSQLDB3-3       (ocf::heartbeat:Dummy): Started srv03
>     prmIpPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
>     prmApPostgreSQLDB3 (ocf::heartbeat:Dummy): Started srv03
>  Resource Group: grpStonith1
>     prmStonithN1       (stonith:external/ssh): Started srv04
>  Resource Group: grpStonith2
>     prmStonithN2       (stonith:external/ssh): Started srv01
>  Resource Group: grpStonith3
>     prmStonithN3       (stonith:external/ssh): Started srv02
>  Resource Group: grpStonith4
>     prmStonithN4       (stonith:external/ssh): Started srv03
>  Clone Set: clnUMgroup01
>     Started: [ srv01 srv04 ]
>  Clone Set: clnPingd
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnDiskd1
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnG3dummy1
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnG3dummy2
>     Started: [ srv01 srv02 srv03 srv04 ]
>
> Step2)I generate the trouble of the clnUMgroup01 clone in a N1(srv01) node.
>
>  [root at srv01 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:0.state
>
>  * The clone resources are replaced.
>
>  [root at srv01 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy0*
>  /var/run/heartbeat/rsctmp/Dummy-clnUMdummy01:1.state
> /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:1.state
>
> Step3)Again...I generate the trouble of the clnUMgroup01 clone in a N1(srv01) node.
>
>  [root at srv01 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:1.state
>
>  [root at srv01 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy0*
>  /var/run/heartbeat/rsctmp/Dummy-clnUMdummy01:0.state
> /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:0.state
>
>  * The clone resources are replaced.
>
> ============
> Last updated: Fri Mar 12 14:56:19 2010
> Stack: openais
> Current DC: srv01 - partition with quorum
> Version: 1.0.7-049006f172774f407e165ec82f7ee09cb73fd0e7
> 4 Nodes configured, 2 expected votes
> 13 Resources configured.
> ============
> Online: [ srv01 srv02 srv03 srv04 ]
> (snip)
> Migration summary:
> * Node srv02:
> * Node srv03:
> * Node srv04:
> * Node srv01:
>   clnUMdummy02:0: migration-threshold=5 fail-count=1
>   clnUMdummy02:1: migration-threshold=5 fail-count=1
>
> Step4)I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node.
>
> [root at srv04 ~]# rm -rf /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02\:1.state
> [root at srv04 ~]# ls /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02*
> /var/run/heartbeat/rsctmp/Dummy-clnUMdummy02:1.state
>
>  * The clone resources are not replaced.
>
> Step5)Again...I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node.
>
>  * The clone resources are not replaced.
>
> (snip)
> Migration summary:
> * Node srv02:
> * Node srv03:
> * Node srv04:
>   clnUMdummy02:1: migration-threshold=5 fail-count=2
> * Node srv01:
>   clnUMdummy02:0: migration-threshold=5 fail-count=1
>   clnUMdummy02:1: migration-threshold=5 fail-count=1
>
>
> Step6)Again...I generate the trouble of the clnUMgroup01 clone in a N4(srv04) node and N1(srv01) node.
>
>  * In the N4 node, trouble of clnUMdummy02 is handled at five times, but, in the N1 node, it is
> processed at much number of times for replacement.
>
> (snip)
>  Clone Set: clnUMgroup01
>     Started: [ srv01 ]
>     Stopped: [ clnUmResource:1 ]
>  Clone Set: clnPingd
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnDiskd1
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnG3dummy1
>     Started: [ srv01 srv02 srv03 srv04 ]
>  Clone Set: clnG3dummy2
>     Started: [ srv01 srv02 srv03 srv04 ]
>
> Migration summary:
> * Node srv02:
> * Node srv03:
> * Node srv04:
>   clnUMdummy02:1: migration-threshold=5 fail-count=5
> * Node srv01:
>   clnUMdummy02:0: migration-threshold=5 fail-count=3
>   clnUMdummy02:1: migration-threshold=5 fail-count=3
>
> Of a clone rising in a N1(srv01) node at the time of "globally-unique=false" is replacing it right?
> In addition, is it right movement that replacement does not happen even if a clone breaks down in a
> N4(srv04) node?
>
> We think that, furthermore, there is a problem because the replacement is different.
>
> When it was assumed that the replacement of this clone is right, arrival to the trouble number of
> times is different from a N4(srv04) node in a N1(srv01) node.
>
> By this movement, we cannot set the limit of the trouble number of times of the clone well.

So if I can summarize, you're saying that clnUMdummy02 should not be
allowed to run on srv01 because the combined number of failures is 6
(and clnUMdummy02 is a non-unique clone).

And that the current behavior is that clnUMdummy02 continues to run.

Is that an accurate summary?
If so, then I agree its a bug.  Could you create a bugzilla entry for it please?

>
> This is specifications or bug? (Or is it already solved in the development version?)
> Is setting to operate definitely necessary for cib.xml?
>
> If there is a setting method of right cib.xml, please teach it.
>
> Because the size of the collection of hb_report result is big, I do not attach it.
> If there is information of hb_report which is necessary for the solution of the problem, give me
> comments.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>