[Pacemaker] [Problem]Lost fail-count.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Wed Sep 29 08:15:04 UTC 2010
Hi,
We examined the trouble outbreak of a resource during cluster division and the recovery of the
cluster.
However, at the time of cluster recovery, the phenomenon that fail-count disappeared occurred.
Failed-Actions did not disappear then.
In the next procedure, it occurred.
Step1)We start Heartbeat.
Step2)We stand alone in iptables in a cgl60 node.
Step3)When a sfex resource started in a cgl63 node, we remove the isolation of the cgl60 node.
Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error.
* VIPcheck,sfex becomes the resource to detect double start.
Step5)fail-count is lost.
============
Last updated: Thu Sep 16 17:26:10 2010
Stack: Heartbeat
Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum
Version: 1.0.9-74392a28b7f3 stable-1.0 tip
4 Nodes configured, unknown expected votes
10 Resources configured.
============
Online: [ cgl60 cgl61 cgl62 cgl63 ]
Resource Group: UMgroup01
UmVIPcheck (ocf::heartbeat:VIPcheck): Started cgl60
UmIPaddr (ocf::heartbeat:IPaddr2): Started cgl60
UmDummy01 (ocf::pacemaker:Dummy): Started cgl60
UmDummy02 (ocf::pacemaker:Dummy): Started cgl60
Resource Group: OVDBgroup02-1
prmExPostgreSQLDB1 (ocf::heartbeat:sfex): Started cgl60
prmFsPostgreSQLDB1-1 (ocf::heartbeat:Filesystem): Started cgl60
prmFsPostgreSQLDB1-2 (ocf::heartbeat:Filesystem): Started cgl60
prmFsPostgreSQLDB1-3 (ocf::heartbeat:Filesystem): Started cgl60
prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2): Started cgl60
prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60
Resource Group: OVDBgroup02-2
prmExPostgreSQLDB2 (ocf::heartbeat:sfex): Started cgl61
prmFsPostgreSQLDB2-1 (ocf::heartbeat:Filesystem): Started cgl61
prmFsPostgreSQLDB2-2 (ocf::heartbeat:Filesystem): Started cgl61
prmFsPostgreSQLDB2-3 (ocf::heartbeat:Filesystem): Started cgl61
prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2): Started cgl61
prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61
Resource Group: OVDBgroup02-3
prmExPostgreSQLDB3 (ocf::heartbeat:sfex): Started cgl62
prmFsPostgreSQLDB3-1 (ocf::heartbeat:Filesystem): Started cgl62
prmFsPostgreSQLDB3-2 (ocf::heartbeat:Filesystem): Started cgl62
prmFsPostgreSQLDB3-3 (ocf::heartbeat:Filesystem): Started cgl62
prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2): Started cgl62
prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62
(snip)
Migration summary:
* Node cgl60:
* Node cgl61:
* Node cgl62:
* Node cgl63: -----> Lost fail-count.....
Failed actions:
prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): unknown error
UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown error
The trouble of the start processing seems to detect it when we watch log.
Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation prmExPostgreSQLDB1_start_0
(call=46, rc=1, cib-update=91, confirmed=true) unknown error
What is the cause of the disappearance of fail-count?
I attach log.
* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496
Best Regard,
Hideo Yamauchi.
More information about the Pacemaker
mailing list