[Pacemaker] fail over just failed
Daniel Black
daniel.black at openquery.com
Fri Apr 19 01:02:55 UTC 2013
Hi,
I had an incident this morning (16:22) where each node lost corosync connection to the cluster and when offline in minute intervals. Once quorum was lot each server disabled its managed IPs. Servers were rebooted around 17:52:23.
Its a fairly simple resource of 5 IPv4 address and 5 IPv6 addresses. (ignore web1_v5 - was stopped and a mistake - now deleted).
The cluster IPs are different from those managed. I checked the time on all servers and they are all within the same second so ntp was working.
crm --version - 1.2.5 (Build da93d3523e6a5b76753cc752eb2701a8a1fcacca)
corosync -v -Corosync Cluster Engine, version '2.3.0'
pacemakerd --version - Pacemaker 1.1.9
Logs from all servers and config attached.
on all nodes:
sudo crm_verify -LV
crit: get_timet_now: Defaulting to 'now'
crit: get_timet_now: Defaulting to 'now'
...
What does this message mean?
Any clue on cause? Have I got something dumb in my config?
Its on a VPS provider hence UDPU and not multicast. Would using their API for stonith help me?
--
Daniel Black, Engineer @ Open Query (http://openquery.com)
Remote expertise & maintenance for MySQL/MariaDB server environments.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync_offline.tar.gz
Type: application/x-compressed-tar
Size: 19135 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130419/fc323b96/attachment-0003.bin>
More information about the Pacemaker
mailing list