<br><br><div class="gmail_quote">On Wed, Oct 14, 2009 at 12:26 AM, Andrew Beekhof <span dir="ltr">&lt;<a href="mailto:andrew@beekhof.net">andrew@beekhof.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On Wed, Oct 14, 2009 at 2:48 AM, hj lee &lt;<a href="mailto:kerdosa@gmail.com">kerdosa@gmail.com</a>&gt; wrote:<br>

&gt; Hi,<br>

&gt;<br>

&gt; I configured two nodes cluster on RHEL 5.3 with the following resources.<br>

&gt; Note that I am using pacemaker-1.0.6.<br>

&gt; - IPMI stonith as a clone. Each IPMI clone is monitoring the other node.<br>

&gt; - One Master/Slave resource: Master is running on node1, Slave is running on<br>

&gt; node2.<br>

&gt; - One FakeIPMI resource.<br>

&gt;<br>

&gt; When I manually trigger the failure in monitor and stop operation of<br>

&gt; FakeIPMI at node1, the IPMI stonith running on node2 detects its state<br>

&gt; unclean correctly and it tries to demote Master resource in node1 and reset<br>

&gt; th node1. The problem I am seeing is the promotion happens 60 sec later<br>

&gt; after the stonith reset the node1 successfully.<br>

&gt;<br>

&gt; I want the Slave gets promoted immediately right after the stonith reset<br>

&gt; returned successfully! From the log,<br>

<br>

</div>You mean the one we can&#39;t see or comment on?<br></blockquote><div><br>Hi,<br><br>I attached the logs. Here is a summary of logs: <br>1. I manually triggered  testdummy-res:0 to fail on monitor and stop operation. the pengine detected it.<br>

Oct 14 13:10:24 node1 pengine: [4033]: info: unpack_rsc_op: testdummy-res:0_monitor_10000 on node2 returned 1 (unknown error) instead of the expected value: 0 (ok)<br> <br>2. The stop failed on testdummy-res:0<br>Oct 14 13:10:24 node1 crmd: [4034]: WARN: status_from_rc: Action 6 (testdummy-res:0_stop_0) on node2 failed (target: 0 vs. rc: 1): Error<br>

</div></div><br>3. The stonith opeartion is scheduled to node2<br>Oct 14 13:10:24 node1 pengine: [4033]: WARN: stage6: Scheduling Node node2 for STONITH<br><br>4. The demote/promote is scheduled<br>Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource testdummy-res:0#011(node2)<br>

Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource testdummy-res:1#011(Started node1)<br>Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource ipmi-stonith-res:0#011(node2)<br>Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource ipmi-stonith-res:1#011(Started node1)<br>

Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Demote vmrd-res:0#011(Master -&gt; Stopped node2)<br>Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource vmrd-res:0#011(node2)<br>Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Promote vmrd-res:1#011(Slave -&gt; Master node1)<br>

Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Move resource vsstvm-res#011(Started node2 -&gt; node1)<br><br>5. stonith reset is issued and returned successfully<br>Oct 14 13:10:24 node1 stonithd: [4029]: info: stonith_operate_locally::2688: sending fencing op RESET for node2 to ipmi-stonith-res:1 (external/ipmi) (pid=12365)<br>

...........<br>Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_run_cmd: &#39;/usr/lib/stonith/plugins/external/ipmi reset node2&#39; output: Chassis Power Control: Reset<br>Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_reset_req: running &#39;ipmi reset&#39; returned 0<br>

<br>6. Starting demote at node2<br>Oct 14 13:10:24 node1 crmd: [4034]: info: te_rsc_command: Initiating action 23: demote vmrd-res:0_demote_0 on node2<br><br>7. Pacemaker knows that the demote will fail because node2 is offline<br>

Oct 14 13:10:24 node1 crmd: [4034]: notice: fail_incompletable_actions: Action 23 (23) is scheduled for node2 (offline)<br><br>8. This action_timer_callback seems trigger the promote.<br>Oct 14 13:11:30 node1 crmd: [4034]: WARN: action_timer_callback: Timer popped (timeout=6000, abort_level=1000000, complete=false)<br>

<br>9. Eventually node1 is promoted<br>Oct 14 13:11:31 node1 crmd: [4034]: info: te_rsc_command: Initiating action 63: notify vmrd-res:1_pre_notify_promote_0 on node1 (local)<br><br><br>There is 66 sec gap between step 7 and step 8. I think once stonith reset the node2 successfully, then all the operations scheduled on that node should be canceled immediately and promote the slave node immediately.<br>

<br>Thank you very much in advance<br><br><br>