[Pacemaker] [Problem]Reboot by the error of the clone resource influences the resource of other nodes.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Thu Mar 31 01:15:01 UTC 2011
Hi All,
We tested the trouble of the clone resource in the next procedure.
Step1) We start a cluster in three nodes.
============
Last updated: Thu Mar 31 10:01:47 2011
Stack: Heartbeat
Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum
Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0
3 Nodes configured, unknown expected votes
4 Resources configured.
============
Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online
main_rsc (ocf::pacemaker:Dummy) Started
prmDummy1:0 (ocf::pacemaker:Dummy) Started
prmPingd:0 (ocf::pacemaker:ping) Started
Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online
prmDummy1:1 (ocf::pacemaker:Dummy) Started
main_rsc2 (ocf::pacemaker:Dummy) Started
prmPingd:1 (ocf::pacemaker:ping) Started
Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online
prmDummy1:2 (ocf::pacemaker:Dummy) Started
prmPingd:2 (ocf::pacemaker:ping) Started
Inactive resources:
Migration summary:
* Node srv01: pingd=1
* Node srv03: pingd=1
* Node srv02: pingd=1
Step2) In a srv01 node, We generate the trouble of the clone resource.
[root at srv01 ~]# rm -rf /var/run/Dummy-prmDummy1.state
Step3) In a srv02 node, it takes the reboot of the pingd clone. Under influence of this, rebooting, main_rsc2 reboots.
* The number of the clone becomes funny somehow or other, too.
[root at srv02 ~]# tail -f /var/log/ha-log | grep stop
Mar 31 10:02:22 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=29:4:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=prmPingd:1_stop_0 )
Mar 31 10:02:25 srv02 lrmd: [24468]: info: rsc:prmPingd:1:12: stop
Mar 31 10:02:25 srv02 crmd: [24471]: info: process_lrm_event: LRM operation prmPingd:1_stop_0 (call=12, rc=0, cib-update=21, confirmed=true) ok
Mar 31 10:02:33 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=9:5:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=main_rsc2_stop_0 )
Mar 31 10:02:33 srv02 lrmd: [24468]: info: rsc:main_rsc2:14: stop
Mar 31 10:02:33 srv02 crmd: [24471]: info: process_lrm_event: LRM operation main_rsc2_stop_0 (call=14, rc=0, cib-update=23, confirmed=true) ok
============
Last updated: Thu Mar 31 10:02:40 2011
Stack: Heartbeat
Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum
Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0
3 Nodes configured, unknown expected votes
4 Resources configured.
============
Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online
Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online
prmDummy1:1 (ocf::pacemaker:Dummy) Started ---------> :1(funny)
prmPingd:0 (ocf::pacemaker:ping) Started ---------> :0(funny)
Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online
main_rsc (ocf::pacemaker:Dummy) Started
prmDummy1:2 (ocf::pacemaker:Dummy) Started ---------> :2(funny)
prmPingd:1 (ocf::pacemaker:ping) Started ---------> :1(funny)
Inactive resources:
main_rsc2 (ocf::pacemaker:Dummy): Stopped
Clone Set: clnDummy1
Started: [ srv02 srv03 ]
Stopped: [ prmDummy1:0 ]
Clone Set: clnPingd
Started: [ srv02 srv03 ]
Stopped: [ prmPingd:2 ]
Migration summary:
* Node srv01:
prmDummy1:0: migration-threshold=1 fail-count=1
* Node srv03: pingd=1
* Node srv02: pingd=1
Failed actions:
prmDummy1:0_monitor_10000 (node=srv01, call=8, rc=7, status=complete): not running
We think the reboot of pingd to be unnecessary in a srv02 node.
Is there the method how this problem is settled?
Possibly the next bug may be related.
* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2508
I registered the log with Bugzilla.(attached hb_report)
* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2574
Best Regards,
Hideo Yamauchi.
More information about the Pacemaker
mailing list