[Pacemaker] [Openais] very slow pacemaker/corosync shutdown

Thu Sep 19 22:19:57 UTC 2013

On 09/18/2013 06:49 PM, Andrew Beekhof wrote:
> On 19/09/2013, at 8:25 AM, David Lang <david at lang.hm> wrote:
>
>> What's the best way to see what it's getting stuck doing?
> Log files.
>
>> Is there a good way to tell if this is a pacemaker or corosync problem (so I can drop one of the lists from the thread)?
> Not without further information
>

We've had the same problem here, trying to get HA dns/named service 
working. Works great for a day or so, then seizes up, simple commands 
like `crm_standby -v true` timeout after 120 seconds, etc. We're testing 
for release, and keep running into issues like this. At first we 
suspected firewall issues, but even after confirmed operation and 
several hand-offs of HA services back and forth, it still dies within a 
day or so.

We're on CentOS 6/64 with yum packages augmented from 
http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/
with exclude=pacemaker* corosync*

In order to make the log files visible, I've snipped out a time period 
during which it becomes unresponsive visible at 
http://hal.schoolpathways.com/details/

I don't know the exact moment, this is a test cluster and not being 
monitored by a netmon. Any other details I could provide that would be 
useful/helpful?