[Pacemaker] [Openais] very slow pacemaker/corosync shutdown

Fri Sep 20 00:46:55 UTC 2013

On 09/19/2013 04:50 PM, Andrew Beekhof wrote:
>  From this we can infer that corosync has gotten horribly confused and, as a consequence, pacemaker can't talk to its peers anymore.
>
>> >this is a test cluster and not being monitored by a netmon. Any other details I could provide that would be useful/helpful?
> Shortly before this, Corosync claims:
>
> Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: pcmk_cpg_membership: 	Left[2.0] crmd.1
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: crm_update_peer_proc: 	pcmk_cpg_membership: Node bender.schoolpathways.com[1] - corosync-cpg is now offline
> Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: peer_update_callback: 	Client bender.schoolpathways.com/peer now has status [offline] (DC=true)
>
> Is this true?
> If not, perhaps some timeouts need to be adjusted.  A switch to udpu (instead of multicast) may also be helpful.

Although the times you specifically mention were probably due to 
intentionally created failures, later, similar messages would have been 
clearly outside the range of time where I was testing. I've updated 
corosync.conf to use udpu from an example config and continue testing.

What timeout values might be useful to consider? These two machines are 
next to each other, on the same gigabit switch and no packet loss has 
ever been detected.Truth is that I'm unsure what would be waiting.