[Pacemaker] fail over just failed

Fri Apr 19 01:02:55 UTC 2013

Hi,

I had an incident this morning (16:22) where each node lost corosync connection to the cluster and when offline in minute intervals. Once quorum was lot each server disabled its managed IPs. Servers were rebooted around 17:52:23.

Its a fairly simple resource of 5 IPv4 address and 5 IPv6 addresses. (ignore web1_v5 - was stopped and a mistake - now deleted).

The cluster IPs are different from those managed. I checked the time on all servers and they are all within the same second so ntp was working.

crm --version  - 1.2.5 (Build da93d3523e6a5b76753cc752eb2701a8a1fcacca)
corosync -v   -Corosync Cluster Engine, version '2.3.0'
pacemakerd --version -  Pacemaker 1.1.9

Logs from all servers and config attached.

on all nodes:

sudo crm_verify -LV
    crit: get_timet_now:        Defaulting to 'now'
    crit: get_timet_now:        Defaulting to 'now'
...

What does this message mean?

Any clue on cause? Have I got something dumb in my config?

Its on a VPS provider hence UDPU and not multicast. Would using their API for stonith help me?

--
Daniel Black, Engineer @ Open Query (http://openquery.com)
Remote expertise & maintenance for MySQL/MariaDB server environments.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync_offline.tar.gz
Type: application/x-compressed-tar
Size: 19135 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130419/fc323b96/attachment-0003.bin>