[Pacemaker] [Openais] very slow pacemaker/corosync shutdown
Lists
lists at benjamindsmith.com
Fri Sep 20 00:46:55 UTC 2013
On 09/19/2013 04:50 PM, Andrew Beekhof wrote:
> From this we can infer that corosync has gotten horribly confused and, as a consequence, pacemaker can't talk to its peers anymore.
>
>> >this is a test cluster and not being monitored by a netmon. Any other details I could provide that would be useful/helpful?
> Shortly before this, Corosync claims:
>
> Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: pcmk_cpg_membership: Left[2.0] crmd.1
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node bender.schoolpathways.com[1] - corosync-cpg is now offline
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: peer_update_callback: Client bender.schoolpathways.com/peer now has status [offline] (DC=true)
>
> Is this true?
> If not, perhaps some timeouts need to be adjusted. A switch to udpu (instead of multicast) may also be helpful.
Although the times you specifically mention were probably due to
intentionally created failures, later, similar messages would have been
clearly outside the range of time where I was testing. I've updated
corosync.conf to use udpu from an example config and continue testing.
What timeout values might be useful to consider? These two machines are
next to each other, on the same gigabit switch and no packet loss has
ever been detected.Truth is that I'm unsure what would be waiting.
More information about the Pacemaker
mailing list