[ClusterLabs] successful ipmi stonith still times out

Ken Gaillot kgaillot at redhat.com
Thu Dec 17 19:04:12 CET 2015


On 12/17/2015 10:32 AM, Ron Kerry wrote:
> I have a customer (running SLE 11 SP4 HAE) who is seeing the following
> stonith behavior running the ipmi stonith plugin.
> 
> Dec 15 14:21:43 test4 pengine[24002]:  warning: pe_fence_node: Node
> test3 will be fenced because termination was requested
> Dec 15 14:21:43 test4 pengine[24002]:  warning: determine_online_status:
> Node test3 is unclean
> Dec 15 14:21:43 test4 pengine[24002]:  warning: stage6: Scheduling Node
> test3 for STONITH
> 
> ... it issues the reset and it is noted ...
> Dec 15 14:21:45 test4 external/ipmi(STONITH-test3)[177184]: [177197]:
> debug: ipmitool output: Chassis Power Control: Reset
> Dec 15 14:21:46 test4 stonith-ng[23999]:   notice: log_operation:
> Operation 'reboot' [177179] (call 2 from crmd.24003) for host 'test3'
> with device 'STONITH-test3' returned: 0 (OK)
> 
> ... test3 does go down ...
> Dec 15 14:22:21 test4 kernel: [90153.906461] Cell 2 (test3) left the
> membership
> 
> ... but the stonith operation times out (it said OK earlier) ...
> Dec 15 14:22:56 test4 stonith-ng[23999]:   notice: remote_op_timeout:
> Action reboot (a399a8cb-541a-455e-8d7c-9072d48667d1) for test3
> (crmd.24003) timed out
> Dec 15 14:23:05 test4 external/ipmi(STONITH-test3)[177667]: [177678]:
> debug: ipmitool output: Chassis Power is on
> 
> Dec 15 14:23:56 test4 crmd[24003]:    error:
> stonith_async_timeout_handler: Async call 2 timed out after 132000ms
> Dec 15 14:23:56 test4 crmd[24003]:   notice: tengine_stonith_callback:
> Stonith operation 2/51:100:0:f43dc87c-faf0-4034-8b51-be0c13c95656: Timer
> expired (-62)
> Dec 15 14:23:56 test4 crmd[24003]:   notice: tengine_stonith_callback:
> Stonith operation 2 for test3 failed (Timer expired): aborting transition.
> Dec 15 14:23:56 test4 crmd[24003]:   notice: abort_transition_graph:
> Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
> 
> This looks like a bug but a quick search did not turn up anything. Does
> anyone recognize this problem?

Fence timeouts can be tricky to troubleshoot because there are multiple
timeouts involved. The process goes like this:

1. crmd asks the local stonithd to do the fence.

2. The local stonithd queries all stonithd's to ensure it has the latest
status of all fence devices.

3. The local stonithd chooses a fence device (or possibly devices, if
topology is involved) and picks the best stonithd (or stonithd's) to
actually execute the fencing.

4. The chosen stonithd (or stonithd's) runs the fence agent to do the
actual fencing, then replies to the original stonithd, which replies to
the original requester.

So the crmd can timeout waiting for a reply from stonithd, the local
stonithd can timeout waiting for query replies from all stonithd's, the
local stonithd can timeout waiting for a reply from one or more
executing stonithd's, or an executing stonithd can timeout waiting for a
reply from the fence device.

Another factor is that some reboots can be remapped to off then on. This
will happen, for example, if the fence device doesn't have a reboot
command, or if it's in a fence topology level with other devices. So in
that case, there's the possibility of a timeout for the off command, and
the on command.

In this case, one thing that's odd is that the "Async call 2 timed out"
message is the timeout for the crmd waiting for a reply from stonithd.
The crmd timeout is always a minute longer than stonithd's timeout,
which should be more than enough time for stonithd to reply. I'm not
sure what's going on there.

I'd look closely at the entire fence configuration (is topology
involved? what are the configured timeouts? are the configuration
options correct?), and trace through the logs to see what step or steps
are actually timing out.

I do see here that the reboot times out before the "Chassis Power is on"
message, so it's possible the reboot timeout is too short to account for
a full cycle. But I'm not sure why it would report OK before that,
unless maybe that was for one step of the larger process.



More information about the Users mailing list