[Pacemaker] [Openais] very slow pacemaker/corosync shutdown
Andrew Beekhof
andrew at beekhof.net
Thu Sep 19 23:50:34 UTC 2013
On 20/09/2013, at 8:19 AM, Lists <lists at benjamindsmith.com> wrote:
> On 09/18/2013 06:49 PM, Andrew Beekhof wrote:
>> On 19/09/2013, at 8:25 AM, David Lang <david at lang.hm> wrote:
>>
>>> What's the best way to see what it's getting stuck doing?
>> Log files.
>>
>>> Is there a good way to tell if this is a pacemaker or corosync problem (so I can drop one of the lists from the thread)?
>> Not without further information
>>
>
> We've had the same problem here, trying to get HA dns/named service working. Works great for a day or so, then seizes up, simple commands like `crm_standby -v true` timeout after 120 seconds, etc. We're testing for release, and keep running into issues like this. At first we suspected firewall issues, but even after confirmed operation and several hand-offs of HA services back and forth, it still dies within a day or so.
>
> We're on CentOS 6/64 with yum packages augmented from http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/
> with exclude=pacemaker* corosync*
>
> In order to make the log files visible, I've snipped out a time period during which it becomes unresponsive visible at http://hal.schoolpathways.com/details/
>
> I don't know the exact moment,
I do.
It is right when you start seeing messages like:
Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: send_ais_text: Peer overloaded or membership in flux: Re-sending message (Attempt 1 of 20)
Eventually that escalates to:
Sep 19 00:59:39 [9004] nomad.schoolpathways.com crmd: error: send_ais_text: Sending message 94 via cpg: FAILED (rc=6): Try again: Success (0)
From this we can infer that corosync has gotten horribly confused and, as a consequence, pacemaker can't talk to its peers anymore.
> this is a test cluster and not being monitored by a netmon. Any other details I could provide that would be useful/helpful?
Shortly before this, Corosync claims:
Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: pcmk_cpg_membership: Left[2.0] crmd.1
Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node bender.schoolpathways.com[1] - corosync-cpg is now offline
Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: peer_update_callback: Client bender.schoolpathways.com/peer now has status [offline] (DC=true)
Is this true?
If not, perhaps some timeouts need to be adjusted. A switch to udpu (instead of multicast) may also be helpful.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130920/3b4f12da/attachment-0004.sig>
More information about the Pacemaker
mailing list