[Pacemaker] [Problem]Lost fail-count.
Andrew Beekhof
andrew at beekhof.net
Thu Sep 30 08:31:26 UTC 2010
I see you've created a bug for this, I'll follow up there.
On Wed, Sep 29, 2010 at 10:15 AM, <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We examined the trouble outbreak of a resource during cluster division and the recovery of the
> cluster.
>
> However, at the time of cluster recovery, the phenomenon that fail-count disappeared occurred.
> Failed-Actions did not disappear then.
>
> In the next procedure, it occurred.
>
> Step1)We start Heartbeat.
>
> Step2)We stand alone in iptables in a cgl60 node.
>
> Step3)When a sfex resource started in a cgl63 node, we remove the isolation of the cgl60 node.
>
> Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error.
> * VIPcheck,sfex becomes the resource to detect double start.
>
> Step5)fail-count is lost.
>
> ============
> Last updated: Thu Sep 16 17:26:10 2010
> Stack: Heartbeat
> Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum
> Version: 1.0.9-74392a28b7f3 stable-1.0 tip
> 4 Nodes configured, unknown expected votes
> 10 Resources configured.
> ============
>
> Online: [ cgl60 cgl61 cgl62 cgl63 ]
>
> Resource Group: UMgroup01
> UmVIPcheck (ocf::heartbeat:VIPcheck): Started cgl60
> UmIPaddr (ocf::heartbeat:IPaddr2): Started cgl60
> UmDummy01 (ocf::pacemaker:Dummy): Started cgl60
> UmDummy02 (ocf::pacemaker:Dummy): Started cgl60
> Resource Group: OVDBgroup02-1
> prmExPostgreSQLDB1 (ocf::heartbeat:sfex): Started cgl60
> prmFsPostgreSQLDB1-1 (ocf::heartbeat:Filesystem): Started cgl60
> prmFsPostgreSQLDB1-2 (ocf::heartbeat:Filesystem): Started cgl60
> prmFsPostgreSQLDB1-3 (ocf::heartbeat:Filesystem): Started cgl60
> prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2): Started cgl60
> prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60
> Resource Group: OVDBgroup02-2
> prmExPostgreSQLDB2 (ocf::heartbeat:sfex): Started cgl61
> prmFsPostgreSQLDB2-1 (ocf::heartbeat:Filesystem): Started cgl61
> prmFsPostgreSQLDB2-2 (ocf::heartbeat:Filesystem): Started cgl61
> prmFsPostgreSQLDB2-3 (ocf::heartbeat:Filesystem): Started cgl61
> prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2): Started cgl61
> prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61
> Resource Group: OVDBgroup02-3
> prmExPostgreSQLDB3 (ocf::heartbeat:sfex): Started cgl62
> prmFsPostgreSQLDB3-1 (ocf::heartbeat:Filesystem): Started cgl62
> prmFsPostgreSQLDB3-2 (ocf::heartbeat:Filesystem): Started cgl62
> prmFsPostgreSQLDB3-3 (ocf::heartbeat:Filesystem): Started cgl62
> prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2): Started cgl62
> prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62
> (snip)
> Migration summary:
> * Node cgl60:
> * Node cgl61:
> * Node cgl62:
> * Node cgl63: -----> Lost fail-count.....
>
> Failed actions:
> prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): unknown error
> UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown error
>
>
> The trouble of the start processing seems to detect it when we watch log.
>
> Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation prmExPostgreSQLDB1_start_0
> (call=46, rc=1, cib-update=91, confirmed=true) unknown error
>
> What is the cause of the disappearance of fail-count?
>
> I attach log.
> * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496
>
> Best Regard,
> Hideo Yamauchi.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
More information about the Pacemaker
mailing list