[Pacemaker] [Problem]Lost fail-count.

Thu Sep 30 04:31:26 EDT 2010

I see you've created a bug for this, I'll follow up there.

On Wed, Sep 29, 2010 at 10:15 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We examined the trouble outbreak of a resource during cluster division and the recovery of the
> cluster.
>
> However, at the time of cluster recovery, the phenomenon that fail-count disappeared occurred.
> Failed-Actions did not disappear then.
>
> In the next procedure, it occurred.
>
> Step1)We start Heartbeat.
>
> Step2)We stand alone in iptables in a cgl60 node.
>
> Step3)When a sfex resource started in a cgl63 node, we remove the isolation of the cgl60 node.
>
> Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error.
>  * VIPcheck,sfex becomes the resource to detect double start.
>
> Step5)fail-count is lost.
>
> ============
> Last updated: Thu Sep 16 17:26:10 2010
> Stack: Heartbeat
> Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum
> Version: 1.0.9-74392a28b7f3 stable-1.0 tip
> 4 Nodes configured, unknown expected votes
> 10 Resources configured.
> ============
>
> Online: [ cgl60 cgl61 cgl62 cgl63 ]
>
>  Resource Group: UMgroup01
>     UmVIPcheck (ocf::heartbeat:VIPcheck):      Started cgl60
>     UmIPaddr   (ocf::heartbeat:IPaddr2):       Started cgl60
>     UmDummy01  (ocf::pacemaker:Dummy): Started cgl60
>     UmDummy02  (ocf::pacemaker:Dummy): Started cgl60
>  Resource Group: OVDBgroup02-1
>     prmExPostgreSQLDB1 (ocf::heartbeat:sfex):  Started cgl60
>     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Filesystem):    Started cgl60
>     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Filesystem):    Started cgl60
>     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Filesystem):    Started cgl60
>     prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2):       Started cgl60
>     prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60
>  Resource Group: OVDBgroup02-2
>     prmExPostgreSQLDB2 (ocf::heartbeat:sfex):  Started cgl61
>     prmFsPostgreSQLDB2-1       (ocf::heartbeat:Filesystem):    Started cgl61
>     prmFsPostgreSQLDB2-2       (ocf::heartbeat:Filesystem):    Started cgl61
>     prmFsPostgreSQLDB2-3       (ocf::heartbeat:Filesystem):    Started cgl61
>     prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2):       Started cgl61
>     prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61
>  Resource Group: OVDBgroup02-3
>     prmExPostgreSQLDB3 (ocf::heartbeat:sfex):  Started cgl62
>     prmFsPostgreSQLDB3-1       (ocf::heartbeat:Filesystem):    Started cgl62
>     prmFsPostgreSQLDB3-2       (ocf::heartbeat:Filesystem):    Started cgl62
>     prmFsPostgreSQLDB3-3       (ocf::heartbeat:Filesystem):    Started cgl62
>     prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2):       Started cgl62
>     prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62
> (snip)
> Migration summary:
> * Node cgl60:
> * Node cgl61:
> * Node cgl62:
> * Node cgl63: -----> Lost fail-count.....
>
> Failed actions:
>    prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): unknown error
>    UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown error
>
>
> The trouble of the start processing seems to detect it when we watch log.
>
> Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation prmExPostgreSQLDB1_start_0
> (call=46, rc=1, cib-update=91, confirmed=true) unknown error
>
> What is the cause of the disappearance of fail-count?
>
> I attach log.
>  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496
>
> Best Regard,
> Hideo Yamauchi.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>