[Pacemaker] promote is delayed more than 60 sec after stonith reset

Wed Oct 14 21:03:40 UTC 2009

On Wed, Oct 14, 2009 at 12:26 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

> On Wed, Oct 14, 2009 at 2:48 AM, hj lee <kerdosa at gmail.com> wrote:
> > Hi,
> >
> > I configured two nodes cluster on RHEL 5.3 with the following resources.
> > Note that I am using pacemaker-1.0.6.
> > - IPMI stonith as a clone. Each IPMI clone is monitoring the other node.
> > - One Master/Slave resource: Master is running on node1, Slave is running
> on
> > node2.
> > - One FakeIPMI resource.
> >
> > When I manually trigger the failure in monitor and stop operation of
> > FakeIPMI at node1, the IPMI stonith running on node2 detects its state
> > unclean correctly and it tries to demote Master resource in node1 and
> reset
> > th node1. The problem I am seeing is the promotion happens 60 sec later
> > after the stonith reset the node1 successfully.
> >
> > I want the Slave gets promoted immediately right after the stonith reset
> > returned successfully! From the log,
>
> You mean the one we can't see or comment on?
>

Hi,

I attached the logs. Here is a summary of logs:
1. I manually triggered  testdummy-res:0 to fail on monitor and stop
operation. the pengine detected it.
Oct 14 13:10:24 node1 pengine: [4033]: info: unpack_rsc_op:
testdummy-res:0_monitor_10000 on node2 returned 1 (unknown error) instead of
the expected value: 0 (ok)

2. The stop failed on testdummy-res:0
Oct 14 13:10:24 node1 crmd: [4034]: WARN: status_from_rc: Action 6
(testdummy-res:0_stop_0) on node2 failed (target: 0 vs. rc: 1): Error

3. The stonith opeartion is scheduled to node2
Oct 14 13:10:24 node1 pengine: [4033]: WARN: stage6: Scheduling Node node2
for STONITH

4. The demote/promote is scheduled
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource
testdummy-res:0#011(node2)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource
testdummy-res:1#011(Started node1)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource
ipmi-stonith-res:0#011(node2)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource
ipmi-stonith-res:1#011(Started node1)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Demote
vmrd-res:0#011(Master -> Stopped node2)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource
vmrd-res:0#011(node2)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Promote
vmrd-res:1#011(Slave -> Master node1)
Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Move resource
vsstvm-res#011(Started node2 -> node1)

5. stonith reset is issued and returned successfully
Oct 14 13:10:24 node1 stonithd: [4029]: info: stonith_operate_locally::2688:
sending fencing op RESET for node2 to ipmi-stonith-res:1 (external/ipmi)
(pid=12365)
...........
Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_run_cmd:
'/usr/lib/stonith/plugins/external/ipmi reset node2' output: Chassis Power
Control: Reset
Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_reset_req: running
'ipmi reset' returned 0

6. Starting demote at node2
Oct 14 13:10:24 node1 crmd: [4034]: info: te_rsc_command: Initiating action
23: demote vmrd-res:0_demote_0 on node2

7. Pacemaker knows that the demote will fail because node2 is offline
Oct 14 13:10:24 node1 crmd: [4034]: notice: fail_incompletable_actions:
Action 23 (23) is scheduled for node2 (offline)

8. This action_timer_callback seems trigger the promote.
Oct 14 13:11:30 node1 crmd: [4034]: WARN: action_timer_callback: Timer
popped (timeout=6000, abort_level=1000000, complete=false)

9. Eventually node1 is promoted
Oct 14 13:11:31 node1 crmd: [4034]: info: te_rsc_command: Initiating action
63: notify vmrd-res:1_pre_notify_promote_0 on node1 (local)

There is 66 sec gap between step 7 and step 8. I think once stonith reset
the node2 successfully, then all the operations scheduled on that node
should be canceled immediately and promote the slave node immediately.

Thank you very much in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091014/b32a38b4/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.zip
Type: application/zip
Size: 21702 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091014/b32a38b4/attachment-0004.zip>