[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence
Shaheedur Haque (shahhaqu)
shahhaqu at cisco.com
Mon May 18 17:08:54 CEST 2015
Thanks for clarifying.
Is it a fair assumption that between 1.1.10 and 1.1.12/13 that not much has changed on-the-wire between the components, and I could just replace the crmd that comes with 14.04 with one I built locally. If so, then I could test that simply myself, and then report back here or to Canonical as needed (changing more than one binary is clearly also an option, so if changing more/all is needed then that would be good to know too...apologies for sounding like a noob, this is all very new to me).
-----Original Message-----
From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
Sent: 18 May 2015 15:25
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence
On Mon, May 18, 2015 at 08:58:30AM +0000, Shaheedur Haque (shahhaqu) wrote:
> I'm a little uncertain what the suggestion is; if this sounds like a
> bug that has been fixed, then then presumably it would be best if I
> could point Canonical to the upstream fix (all I could come up with
> was dbbb6a6, in possibly the same area, but to my untrained eye, it is
> hard to guess if this could be the same thing). If it is thought to be
> a new bug, then presumably I am better off working with upstream?
I'm also not sure if it's a new bug. It's just that there were a number of changes since 1.1.10.
> Either way, if a new bug is needed, it seems I should start with a bug here...
Well, I'd suggest the other way around, as the ubuntu maintainers should know better how to handle it, were to check if it's a new bug or not, etc. Though it is of course fine to inquire here.
Thanks,
Dejan
> -----Original Message-----
> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
> Sent: 15 May 2015 19:00
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] STONITH error:
> stonith_async_timeout_handler despite successful fence
>
> Hi,
>
> On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
> > I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
> >
> > - The device responds "OK" to the reboot.
> > - 132s later, crmd complains about the timeout.
>
> It seems you can open a bug report for this. It could be, however, that the bug has already been fixed in the meantime, so best to file the bug with ubuntu.
>
> Thanks,
>
> Dejan
>
> > I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
> >
> > Thanks, Shaheed
> >
> > -----Original Message-----
> > From: Shaheedur Haque (shahhaqu)
> > Sent: 09 May 2015 07:23
> > To: users at clusterlabs.org
> > Subject: RE: STONITH error: stonith_async_timeout_handler despite
> > successful fence
> >
> > Hi,
> >
> > I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
> >
> > ============
> > May 8 18:28:03 octl-03 stonith-ng[15633]: notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
> > May 8 18:28:03 octl-03 stonith-ng[15633]: notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
> > May 8 18:28:03 octl-03 stonith-ng[15633]: notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
> > May 8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
> > May 8 18:30:15 octl-03 crmd[15635]: error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
> > May 8 18:30:15 octl-03 crmd[15635]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> > May 8 18:30:15 octl-03 crmd[15635]: notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> > May 8 18:30:15 octl-03 crmd[15635]: notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
> > May 8 18:30:15 octl-03 crmd[15635]: notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
> > May 8 18:30:15 octl-03 crmd[15635]: notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
> > ============
> >
> > Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue.
> >
> > FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
> >
> > Thanks, Shaheed
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list