[Pacemaker] Long failover
Dmitry Matveichev
d.matveichev at mfisoft.ru
Mon Nov 17 09:31:52 UTC 2014
Hello,
Debug logs from slave are attached. Hope it helps.
------------------------
Kind regards,
Dmitriy Matveichev.
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Monday, November 17, 2014 10:48 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Long failover
> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>
> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>
>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev at mfisoft.ru> wrote:
>>>
>>> Hello,
>>>
>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>
>>> node master
>>> node slave
>>> primitive HA-VIP1 IPaddr2 \
>>> params ip=192.168.22.71 nic=bond0 \
>>> op monitor interval=1s
>>> primitive HA-variator lsb: variator \
>>> op monitor interval=1s \
>>> meta migration-threshold=1 failure-timeout=1s group HA-Group
>>> HA-VIP1 HA-variator property cib-bootstrap-options: \
>>> dc-version=1.1.10-14.el6-368c726 \
>>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> General advice, don't use the plugin. See:
>>
>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>
>>> expected-quorum-votes=2 \
>>> stonith-enabled=false \
>>> no-quorum-policy=ignore \
>>> last-lrm-refresh=1383871087
>>> rsc_defaults rsc-options: \
>>> resource-stickiness=100
>>>
>>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>
>> I see what you mean:
>>
>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>
>> (1 minute goes by)
>>
>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning:
>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on
>> slave.mfisoft.ru timed out
>>
>
> Is it possible that pacemaker is confused by time difference on master
> and slave?
Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)
>
>> Is there a corosync log file configured? That would have more detail on slave.
>>
>>> Then line
>>> Failed actions:
>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms appears in the crm status and recourses are switched.
>>>
>>> What is that timeout? Where I can change it?
>>>
>>> ------------------------
>>> Kind regards,
>>> Dmitriy Matveichev.
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.log
Type: application/octet-stream
Size: 161865 bytes
Desc: 1.log
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141117/0f7d3265/attachment-0004.obj>
More information about the Pacemaker
mailing list