[Pacemaker] Long failover

Fri Nov 14 13:44:17 UTC 2014

On Fri, Nov 14, 2014 at 4:33 PM, Dmitry Matveichev
<d.matveichev at mfisoft.ru> wrote:
> We've already tried to set it but it didn't help.
>

I doubt it is possible to say anything without logs.

> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
>
> -----Original Message-----
> From: Andrei Borzenkov [mailto:arvidjaar at gmail.com]
> Sent: Friday, November 14, 2014 4:12 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Long failover
>
> On Fri, Nov 14, 2014 at 2:57 PM, Dmitry Matveichev <d.matveichev at mfisoft.ru> wrote:
>> Hello,
>>
>>
>>
>> We have a cluster configured via pacemaker+corosync+crm. The
>> configuration
>> is:
>>
>>
>>
>> node master
>>
>> node slave
>>
>> primitive HA-VIP1 IPaddr2 \
>>
>>         params ip=192.168.22.71 nic=bond0 \
>>
>>         op monitor interval=1s
>>
>> primitive HA-variator lsb: variator \
>>
>>         op monitor interval=1s \
>>
>>         meta migration-threshold=1 failure-timeout=1s
>>
>> group HA-Group HA-VIP1 HA-variator
>>
>> property cib-bootstrap-options: \
>>
>>         dc-version=1.1.10-14.el6-368c726 \
>>
>>         cluster-infrastructure="classic openais (with plugin)" \
>>
>>         expected-quorum-votes=2 \
>>
>>         stonith-enabled=false \
>>
>>        no-quorum-policy=ignore \
>>
>>         last-lrm-refresh=1383871087
>>
>> rsc_defaults rsc-options: \
>>
>>         resource-stickiness=100
>>
>>
>>
>> Firstly I make the variator service down  on the master node (actually
>> I delete the service binary and kill the variator process, so the
>> variator fails to restart). Resources very quickly move on the slave
>> node as expected. Then I return the binary on the master and restart
>> the variator service. Now I make the same stuff with binary and service on slave node.
>> The crm status command quickly shows me HA-variator   (lsb: variator):
>> Stopped. But it take to much time (for us) before recourses are switched on
>> the master node (around 1 min).   Then line
>>
>> Failed actions:
>>
>>     HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013',
>> queued=0ms, exec=0ms
>>
>> appears in the crm status and recourses are switched.
>>
>>
>>
>> What is that timeout? Where I can change it?
>>
>
> This is operation timeout. You can change it in operation definition:
> op monitor interval=1s timeout=5s
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org