[Pacemaker] [Problem]Lost fail-count.

Wed Sep 29 04:15:04 EDT 2010

Hi,

We examined the trouble outbreak of a resource during cluster division and the recovery of the
cluster.

However, at the time of cluster recovery, the phenomenon that fail-count disappeared occurred. 
Failed-Actions did not disappear then.

In the next procedure, it occurred.

Step1)We start Heartbeat.

Step2)We stand alone in iptables in a cgl60 node.

Step3)When a sfex resource started in a cgl63 node, we remove the isolation of the cgl60 node.

Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error.
 * VIPcheck,sfex becomes the resource to detect double start.

Step5)fail-count is lost.

============
Last updated: Thu Sep 16 17:26:10 2010
Stack: Heartbeat
Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum
Version: 1.0.9-74392a28b7f3 stable-1.0 tip
4 Nodes configured, unknown expected votes
10 Resources configured.
============

Online: [ cgl60 cgl61 cgl62 cgl63 ]

 Resource Group: UMgroup01
     UmVIPcheck (ocf::heartbeat:VIPcheck):      Started cgl60
     UmIPaddr   (ocf::heartbeat:IPaddr2):       Started cgl60
     UmDummy01  (ocf::pacemaker:Dummy): Started cgl60
     UmDummy02  (ocf::pacemaker:Dummy): Started cgl60
 Resource Group: OVDBgroup02-1
     prmExPostgreSQLDB1 (ocf::heartbeat:sfex):  Started cgl60
     prmFsPostgreSQLDB1-1       (ocf::heartbeat:Filesystem):    Started cgl60
     prmFsPostgreSQLDB1-2       (ocf::heartbeat:Filesystem):    Started cgl60
     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Filesystem):    Started cgl60
     prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2):       Started cgl60
     prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60
 Resource Group: OVDBgroup02-2
     prmExPostgreSQLDB2 (ocf::heartbeat:sfex):  Started cgl61
     prmFsPostgreSQLDB2-1       (ocf::heartbeat:Filesystem):    Started cgl61
     prmFsPostgreSQLDB2-2       (ocf::heartbeat:Filesystem):    Started cgl61
     prmFsPostgreSQLDB2-3       (ocf::heartbeat:Filesystem):    Started cgl61
     prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2):       Started cgl61
     prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61
 Resource Group: OVDBgroup02-3
     prmExPostgreSQLDB3 (ocf::heartbeat:sfex):  Started cgl62
     prmFsPostgreSQLDB3-1       (ocf::heartbeat:Filesystem):    Started cgl62
     prmFsPostgreSQLDB3-2       (ocf::heartbeat:Filesystem):    Started cgl62
     prmFsPostgreSQLDB3-3       (ocf::heartbeat:Filesystem):    Started cgl62
     prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2):       Started cgl62
     prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62
(snip)
Migration summary:
* Node cgl60:
* Node cgl61:
* Node cgl62:
* Node cgl63: -----> Lost fail-count.....

Failed actions:
    prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): unknown error
    UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown error

The trouble of the start processing seems to detect it when we watch log.

Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation prmExPostgreSQLDB1_start_0
(call=46, rc=1, cib-update=91, confirmed=true) unknown error

What is the cause of the disappearance of fail-count?

I attach log.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496

Best Regard,
Hideo Yamauchi.