[Pacemaker] Pengine behavior

Sun Jul 29 20:59:53 EDT 2012

On Fri, Jul 20, 2012 at 6:39 PM, Виталий Давудов
<vitaliy.davudov at vts24.ru> wrote:
> Hi, David!
>
> Yes, you are right, I'm trying to do active call failover. I hope to achieve
> 3 secs silence during the call (now it's 5 secs). If there is any kind of
> directive in corosync to monitor the node more aggressively (every 1 sec),
> I'll be very happy.


man corosync.conf has a few.  I'm guessing you need to further tune
one or more of
       token
       token_retransmit
       hold
       token_retransmits_before_loss_const
       join
       send_join
       consensus
       merge
       downcheck
       fail_recv_const


>
>
> 19.07.2012 18:43, David Vossel пишет:
>
>> ----- Original Message -----
>>>
>>> From: "Виталий Давудов" <vitaliy.davudov at vts24.ru>
>>> To: "The Pacemaker cluster resource manager"
>>> <pacemaker at oss.clusterlabs.org>
>>> Sent: Thursday, July 19, 2012 8:08:12 AM
>>> Subject: Re: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi!
>>>
>>> I had moved my cluster from heartbeat to corosync.
>>> Here corosync.conf content:
>>>
>>> compatibility: whitetank
>>>
>>> totem {
>>> version: 2
>>> token: 500
>>> downcheck: 500
>>> secauth: off
>>> threads: 0
>>> interface {
>>> ringnumber: 0
>>> bindnetaddr: 10.10.1.0
>>> mcastaddr: 226.94.1.1
>>> mcastport: 5405
>>> }
>>> }
>>>
>>> logging {
>>> fileline: off
>>> to_stderr: no
>>> to_logfile: yes
>>> to_syslog: yes
>>> logfile: /var/log/corosync.log
>>> debug: on
>>> timestamp: on
>>> logger_subsys {
>>> subsys: AMF
>>> debug: off
>>> }
>>> }
>>>
>>> amf {
>>> mode: disabled
>>> }
>>>
>>> quorum {
>>> provider: corosync_votequorum
>>> expected_votes: 1
>>> }
>>>
>>> Pacemaker configuration is not changed.
>>>
>>> After first node crashed in corosync.log I can see that monitoring
>>> stoped at 15:15:24 (i.e. node crashed at 15:15:24 ):
>>>
>>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>>> monitor
>>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
>>> (fs:monitor:stdout) OK
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>>> monitor
>>> Jul 19 15:53:24 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>>> monitor
>>> Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
>>> started and ready to provide service.
>>> Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma
>>>
>>> On second node in corosync.log:
>>>
>>> Jul 19 15:53:27 corosync [TOTEM ] The token was lost in the
>>> OPERATIONAL state.
>>> Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
>>> configuration.
>>> Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv
>>> buffer size (262142 bytes).
>>> Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send
>>> buffer size (262142 bytes).
>>> Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
>>> Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.
>>>
>>> I.e. second node detected crash after 3 secs.
>>>
>>> Is there any way to reduce this amount of time?
>>>
>> Are you trying to do active call failover or something?  How quickly do
>> you need this failure detected?  Are you hoping the failover will just be a
>> blip in the audio?  There may be a way to monitor the node more aggressively
>> with some sort of ping.. but less that 3 seconds is very aggressive.
>>
>> I haven't dealt with trying to optimize this to the point you are probably
>> needing.  Hopefully someone else has some ideas.  I'm sure you have more
>> potential for optimization using the corosync stack though.
>>
>> -- Vossel
>>
>>> Thanks in advance for all yours hints.
>>>
>>>
>>> 12.07.2012 10:47, Виталий Давудов пишет:
>>>
>>>
>>> David, thanks for your answer!
>>>
>>> I'll try to migrate to corosync.
>>>
>>> 11.07.2012 22:40, David Vossel пишет:
>>>
>>>
>>>
>>> ----- Original Message -----
>>>
>>>
>>> From: "Виталий Давудов" <vitaliy.davudov at vts24.ru>
>>> To: pacemaker at oss.clusterlabs.org
>>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>>> Subject: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi, list!
>>>
>>> I have configured cluster for voip application.
>>> Here my configuration:
>>>
>>> # crm configure show
>>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>>> attributes standby="off"
>>> Ah... right here is your problem. You are using freeswitch instead of
>>> Asterisk :P
>>>
>>>
>>>
>>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>>> attributes standby="off"
>>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>>> nic="eth1.50" \
>>> op monitor interval="1s"
>>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>>> nic="eth1.554" \
>>> op monitor interval="1s"
>>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>>> nic="eth1.552" \
>>> op monitor interval="1s"
>>> primitive fs lsb:FSSofia \
>>> op monitor interval="1s" enabled="false" timeout="2s"
>>> on-fail="standby" \
>>> meta target-role="Started"
>>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>>> meta target-role="Started"
>>> order FS-after-IP inf: HAServices fs
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.12-unknown" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> expected-quorum-votes="1" \
>>> no-quorum-policy="ignore" \
>>> last-lrm-refresh="1299964019"
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100"
>>>
>>> When 1-st node was crashed, then 2-nd node become active. During this
>>> process in ha-debug file I found lines:
>>>
>>> ...
>>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>>> Starting sub-system "pengine"
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>>> /usr/lib64/heartbeat/pengine
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>>> pengine
>>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>>> Taking over DC status for this partition
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>>> We are now in R/W mode
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_master for section 'all'
>>> (origin=local/crmd/11, version=0.391.20): ok (
>>> rc=0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section cib
>>> (origin=local/crmd/12, version=0.391.20): ok (rc
>>> =0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section crm_config
>>> (origin=local/crmd/14, version=0.391.20):
>>> ok (rc=0)
>>> ...
>>>
>>> After "Starting pengine", only thru 4 seconds occured next action.
>>> What happens at this time? Is it possible to reduce this time?
>>> I seem to remember seeing something related to this in the code at
>>> one point. I believe it is limited only to the use of heartbeat as
>>> the messaging layer. After starting the pengine, the crmd sleeps
>>> waiting for the pengine to start before contacting it. The sleep is
>>> just a guess at how long it will take before the pengine will be up
>>> and ready to accept a connection though. That's why it is so long...
>>> so the gap will hopefully be large enough that no one will ever run
>>> into any problems with it (I am not a big fan of this type of logic
>>> at all) I'd recommend moving to corosync and seeing if this delay
>>> goes away.
>>>
>>> -- Vossel
>>>
>>>
>>>
>>> Thanks in advance.
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> --
>
> Best regards,
> Vitaly
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org