[Pacemaker] Pacemaker cluster took almost 2 hours to migrate
Andrew Beekhof
andrew at beekhof.net
Fri Apr 4 02:26:30 UTC 2014
On 24 Mar 2014, at 8:23 pm, Sergey A. Tachenov <stachenov at runbox.com> wrote:
> At this point the second node finally realizes something is wrong there,
> fences the first node and takes over. After reboot, everything looks
> like it's working fine now. Needless to say, 1 hour 45 minutes is a bit
> too long for a recovery.
>
> Got any ideas where to look? Basically I'd like Pacemaker to detect
> whatever happened and migrate to another node before trying to monitor,
> restart or whatever else it tried to do with those resources.
>
> As far as I understand, Pacemaker is supposed to restart a service as
> soon as the monitor operation fails (provided that I didn't specify
> on-fail for the monitor action). Why didn't it try to restart any
> resources until 45 minutes later? I expected to see something like this:
>
> monitor fails -> restart fails -> STONITH
So would I.
At this point though I would suggest an upgrade:
1. Fedora 16 is EOL
2. This looks like an lrmd issue and the lrmd was rewritten for 1.1.9
3. http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/
Why not try CentOS which ships 1.1.10 via official channels?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140404/361c6a8b/attachment-0003.sig>
More information about the Pacemaker
mailing list