[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Tue May 12 10:28:51 CEST 2015

I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:

- The device responds "OK" to the reboot.
- 132s later, crmd complains about the timeout.

I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().

Thanks, Shaheed

-----Original Message-----
From: Shaheedur Haque (shahhaqu) 
Sent: 09 May 2015 07:23
To: users at clusterlabs.org
Subject: RE: STONITH error: stonith_async_timeout_handler despite successful fence

Hi,

I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:

============
May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
May  8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
May  8 18:30:15 octl-03 crmd[15635]:    error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
May  8 18:30:15 octl-03 crmd[15635]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May  8 18:30:15 octl-03 crmd[15635]:   notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
May  8 18:30:15 octl-03 crmd[15635]:   notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
============

Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue. 

FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.

Thanks, Shaheed