[Pacemaker] heartbeat stop hangs sometimes

Mon Feb 22 16:52:18 EST 2010

On Mon, Feb 22, 2010 at 08:46:23PM +0100, Andrew Beekhof wrote:
> On Mon, Feb 22, 2010 at 5:10 PM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
> > On Mon, Feb 22, 2010 at 01:00:29PM +0100, Markus M. wrote:
> >> Hello,
> >>
> >> sometimes "heartbeat stop" seems to hang (latest packets from
> >> clusterlabs.org, RHEL5 x86_64, 2-node cluster with only one node
> >> running).
> >>
> >> The last lines from ha-debug are like this:
> >>
> >> Feb 22 12:52:48 dbprod21 ccm: [24053]: info: client (pid=24058) removed from ccm
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_ha_control: Disconnected from Heartbeat
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_cib_control: Disconnecting CIB
> >> Feb 22 12:52:48 dbprod21 cib: [24054]: info: cib_process_readwrite: We are now in R/O mode
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: crmd_cib_connection_destroy: Connection to the CIB terminated...
> >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: send_ipc_message: IPC Channel to 24058 is not connected
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
> >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: send_via_callback_channel: Delivery of reply to client 24058/d9c9c281-4f38-46d8-b83e-54135f6c75e9 failed
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: free_mem: Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
> >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
> >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_exit: [crmd] stopped (0)
> >> Feb 22 12:52:48 dbprod21 heartbeat: [24040]: info: killing /usr/lib64/heartbeat/attrd process group 24057 with signal 15
> >
> > Yep.
> > I've seen this too,
> > a few times.
> > Apparently attrd sometimes "ignores" a term signal.
> > Once an additional signal comes in, or any message is processed, i.e.
> > once the mainloop() actually _processes_ the signal,
> > it is handled and attrd dies.
> >
> > Unfortunately I don't see exactly where this signal is "lost", though.
> > The signal is delivered, the flag is raised, mainloop should recognize
> > and handle it...
> > And I don't yet have a reliable way to reproduce it, either.
> > If you have, let us know!
> >
> > Maybe the following helps (sorry, patch is likely not whitespace clean)
> >
> > diff -r 1a6d0f690c3e lib/common/mainloop.c
> > --- a/lib/common/mainloop.c     Thu Feb 18 22:36:49 2010 +0100
> > +++ b/lib/common/mainloop.c     Mon Feb 22 17:09:31 2010 +0100
> > @@ -191,7 +191,12 @@
> >     CRM_ASSERT(sizeof(crm_signal_t) > sizeof(GSource));
> >     source = g_source_new(&crm_signal_funcs, sizeof(crm_signal_t));
> >
> > -    crm_signals[sig] = (crm_signal_t*)mainloop_setup_trigger(source, G_PRIORITY_HIGH, NULL, NULL);
> > +    crm_signals[sig] = (crm_signal_t*)mainloop_setup_trigger(source,
> > +                   /* TERM is higher priority than other signals,
> > +                    * signals are higher priority than other ipc.
> > +                    * yes, minus: smaller is "higher". */
> > +                   G_PRIORITY_HIGH - (sig == SIGTERM ? 2 : 1),
> > +                   NULL, NULL);
> >     CRM_ASSERT(crm_signals[sig] != NULL);
> >
> >     crm_signals[sig]->handler = dispatch;
> 
> I've applied a similar patch to stable.
> Also in stable is a patch that waits up to 2.5 minutes for post-crmd
> clients to terminate.
> So either way we should have this resolved.

Thanks.
I likely have to port the 2½ minute patch over to heartbeat proper...
Now that will be fun ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.