[Pacemaker] heartbeat stop hangs sometimes

Lars Ellenberg lars.ellenberg at linbit.com
Thu Mar 4 19:10:33 UTC 2010


On Thu, Mar 04, 2010 at 04:21:22PM +0100, Markus M. wrote:
> Andrew Beekhof wrote:
> 
> >>Unfortunately it doesn't fix the problem. Heartbeat still hangs:
> >
> >The pacemaker patch wont affect heartbeat-based clusters. Sorry.
> 
> Maybe i wasn't very clear in my communication, we _are_ using
> pacemaker together with heartbeat for the cluster communication.
> 
> I applied Lars' patch to lib/common/mainloop.c and rebuild &
> installed the pacemaker and pacemaker-libs rpms. But while testing
> it hangs again (after working for about ~20 times).

If it is again attrd, then apparently even though the signal is
delivered, its trigger is not processed.
Maybe mainloop is used inappropriate or buggy in itself?
Or attrd does some blocking operations outside of mainloop, and restarts
those on EINTR or EGAIN or whatever that syscall may be reporting,
without returning back to mainloop?

If you can easily reproduce, you could attach strace to attrd,
and once it reproduces again, find the line about catching TERM,
and what attrd does next...

Without going too deep into that, the "pacemaker patch" that "wont
affect heartbeat-based clusters.  Sorry." is the one about shutdown
escalation, i.e. sending an additional KILL to child processes
if they take more than 2½ minutes to shut down,
which lives in lib/ais/plugin.c

That "fail safe" part would need to be ported to heartbeat proper,
and child mainloops could then be arbitrarily broken, if they don't die
after KILL, it is usually not a userland problem ;)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.




More information about the Pacemaker mailing list