[Pacemaker] Pacemaker cluster took almost 2 hours to migrate

Fri Apr 4 02:26:30 UTC 2014

On 24 Mar 2014, at 8:23 pm, Sergey A. Tachenov <stachenov at runbox.com> wrote:

> At this point the second node finally realizes something is wrong there, 
> fences the first node and takes over. After reboot, everything looks 
> like it's working fine now. Needless to say, 1 hour 45 minutes is a bit 
> too long for a recovery.
> 
> Got any ideas where to look? Basically I'd like Pacemaker to detect 
> whatever happened and migrate to another node before trying to monitor, 
> restart or whatever else it tried to do with those resources.
> 
> As far as I understand, Pacemaker is supposed to restart a service as 
> soon as the monitor operation fails (provided that I didn't specify 
> on-fail for the monitor action). Why didn't it try to restart any 
> resources until 45 minutes later? I expected to see something like this:
> 
> monitor fails -> restart fails -> STONITH

So would I.

At this point though I would suggest an upgrade:

1. Fedora 16 is EOL
2. This looks like an lrmd issue and the lrmd was rewritten for 1.1.9
3. http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/

Why not try CentOS which ships 1.1.10 via official channels?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140404/361c6a8b/attachment-0003.sig>