[Pacemaker] fail over just failed

Mon Apr 22 20:32:32 EDT 2013

On 19/04/2013, at 11:02 AM, Daniel Black <daniel.black at openquery.com> wrote:

> Hi,
> 
> I had an incident this morning (16:22) where each node lost corosync connection to the cluster and when offline in minute intervals. Once quorum was lot each server disabled its managed IPs. Servers were rebooted around 17:52:23.
> 
> Its a fairly simple resource of 5 IPv4 address and 5 IPv6 addresses. (ignore web1_v5 - was stopped and a mistake - now deleted).
> 
> The cluster IPs are different from those managed. I checked the time on all servers and they are all within the same second so ntp was working.
> 
> crm --version  - 1.2.5 (Build da93d3523e6a5b76753cc752eb2701a8a1fcacca)
> corosync -v   -Corosync Cluster Engine, version '2.3.0'
> pacemakerd --version -  Pacemaker 1.1.9
> 
> Logs from all servers and config attached.
> 
> on all nodes:
> 
> sudo crm_verify -LV
>    crit: get_timet_now:        Defaulting to 'now'
>    crit: get_timet_now:        Defaulting to 'now'
> ...
> 
> What does this message mean?

It means some idiot (me) forgot to downgrade a debug message before committing. 
You can safely ignore it.

Were there any other issues resulting from the connectivity loss?

> 
> Any clue on cause? Have I got something dumb in my config?
> 
> Its on a VPS provider hence UDPU and not multicast. Would using their API for stonith help me?
> 
> 
> --
> Daniel Black, Engineer @ Open Query (http://openquery.com)
> Remote expertise & maintenance for MySQL/MariaDB server environments.<corosync_offline.tar.gz>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org