[Pacemaker] [Problem]Reboot by the error of the clone resource influences the resource of other nodes.

Thu Mar 31 01:15:01 UTC 2011

Hi All,

We tested the trouble of the clone resource in the next procedure.

Step1) We start a cluster in three nodes.

============
Last updated: Thu Mar 31 10:01:47 2011
Stack: Heartbeat
Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum
Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0
3 Nodes configured, unknown expected votes
4 Resources configured.
============

Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online
        main_rsc        (ocf::pacemaker:Dummy) Started 
        prmDummy1:0     (ocf::pacemaker:Dummy) Started 
        prmPingd:0      (ocf::pacemaker:ping) Started 
Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online
        prmDummy1:1     (ocf::pacemaker:Dummy) Started 
        main_rsc2       (ocf::pacemaker:Dummy) Started 
        prmPingd:1      (ocf::pacemaker:ping) Started 
Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online
        prmDummy1:2     (ocf::pacemaker:Dummy) Started 
        prmPingd:2      (ocf::pacemaker:ping) Started 

Inactive resources:

Migration summary:
* Node srv01:  pingd=1
* Node srv03:  pingd=1
* Node srv02:  pingd=1

Step2) In a srv01 node, We generate the trouble of the clone resource.

[root at srv01 ~]# rm -rf /var/run/Dummy-prmDummy1.state 

Step3) In a srv02 node, it takes the reboot of the pingd clone. Under influence of this, rebooting, main_rsc2 reboots.
 * The number of the clone becomes funny somehow or other, too.

[root at srv02 ~]# tail -f /var/log/ha-log | grep stop
Mar 31 10:02:22 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=29:4:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=prmPingd:1_stop_0 )
Mar 31 10:02:25 srv02 lrmd: [24468]: info: rsc:prmPingd:1:12: stop
Mar 31 10:02:25 srv02 crmd: [24471]: info: process_lrm_event: LRM operation prmPingd:1_stop_0 (call=12, rc=0, cib-update=21, confirmed=true) ok
Mar 31 10:02:33 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=9:5:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=main_rsc2_stop_0 )
Mar 31 10:02:33 srv02 lrmd: [24468]: info: rsc:main_rsc2:14: stop
Mar 31 10:02:33 srv02 crmd: [24471]: info: process_lrm_event: LRM operation main_rsc2_stop_0 (call=14, rc=0, cib-update=23, confirmed=true) ok

============
Last updated: Thu Mar 31 10:02:40 2011
Stack: Heartbeat
Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum
Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0
3 Nodes configured, unknown expected votes
4 Resources configured.
============

Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online
Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online
        prmDummy1:1     (ocf::pacemaker:Dummy) Started     ---------> :1(funny)
        prmPingd:0      (ocf::pacemaker:ping) Started      ---------> :0(funny)
Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online
        main_rsc        (ocf::pacemaker:Dummy) Started 
        prmDummy1:2     (ocf::pacemaker:Dummy) Started     ---------> :2(funny)
        prmPingd:1      (ocf::pacemaker:ping) Started      ---------> :1(funny)

Inactive resources:

 main_rsc2      (ocf::pacemaker:Dummy): Stopped 
 Clone Set: clnDummy1
     Started: [ srv02 srv03 ]
     Stopped: [ prmDummy1:0 ]
 Clone Set: clnPingd
     Started: [ srv02 srv03 ]
     Stopped: [ prmPingd:2 ]

Migration summary:
* Node srv01: 
   prmDummy1:0: migration-threshold=1 fail-count=1
* Node srv03:  pingd=1
* Node srv02:  pingd=1

Failed actions:
    prmDummy1:0_monitor_10000 (node=srv01, call=8, rc=7, status=complete): not running

We think the reboot of pingd to be unnecessary in a srv02 node. 
Is there the method how this problem is settled?

Possibly the next bug may be related.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2508

I registered the log with Bugzilla.(attached hb_report)
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2574 

Best Regards,
Hideo Yamauchi.