[Pacemaker] Pengine behavior

Thu Jul 19 09:08:12 EDT 2012

Hi!

I had moved my cluster from heartbeat to corosync.
Here corosync.conf  content:

compatibility: whitetank

totem {
         version: 2
         token: 500
         downcheck: 500
         secauth: off
         threads: 0
         interface {
                 ringnumber: 0
                 bindnetaddr: 10.10.1.0
                 mcastaddr: 226.94.1.1
                 mcastport: 5405
         }
}

logging {
         fileline: off
         to_stderr: no
         to_logfile: yes
         to_syslog: yes
         logfile: /var/log/corosync.log
         debug: on
         timestamp: on
         logger_subsys {
                 subsys: AMF
                 debug: off
         }
}

amf {
         mode: disabled
}

quorum {
         provider: corosync_votequorum
         expected_votes: 1
}

Pacemaker configuration is not changed.

After first node crashed in corosync.log I can see that monitoring 
stoped at 15:15:24 (i.e. node crashed at 15:15:24):

Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12: 
monitor
Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10: 
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14: 
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output: 
(fs:monitor:stdout) OK
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12: 
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10: 
monitor
Jul 19 *15:53:24* freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14: 
monitor
Jul 19 15:55:00 corosync [MAIN  ] Corosync Cluster Engine ('1.2.7'): 
started and ready to provide service.
Jul 19 15:55:00 corosync [MAIN  ] Corosync built-in features: nss rdma

On second node in corosync.log:

Jul 19 *15:53:27* corosync [TOTEM ] The token was lost in the 
OPERATIONAL state.
Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new 
configuration.
Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv buffer 
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send buffer 
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.

I.e. second node detected crash after 3 secs.

Is there any way to reduce this amount of time?

Thanks in advance for all yours hints.

12.07.2012 10:47, Виталий Давудов пишет:
> David, thanks for your answer!
>
> I'll try to migrate to corosync.
>
> 11.07.2012 22:40, David Vossel пишет:
>>
>> ----- Original Message -----
>>> From: "Виталий Давудов" <vitaliy.davudov at vts24.ru>
>>> To: pacemaker at oss.clusterlabs.org
>>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>>> Subject: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi, list!
>>>
>>> I have configured cluster for voip application.
>>> Here my configuration:
>>>
>>> # crm configure show
>>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>>> attributes standby="off"
>> Ah... right here is your problem. You are using freeswitch instead of 
>> Asterisk :P
>>
>>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>>> attributes standby="off"
>>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>>> nic="eth1.50" \
>>> op monitor interval="1s"
>>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>>> nic="eth1.554" \
>>> op monitor interval="1s"
>>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>>> nic="eth1.552" \
>>> op monitor interval="1s"
>>> primitive fs lsb:FSSofia \
>>> op monitor interval="1s" enabled="false" timeout="2s"
>>> on-fail="standby" \
>>> meta target-role="Started"
>>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>>> meta target-role="Started"
>>> order FS-after-IP inf: HAServices fs
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.12-unknown" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> expected-quorum-votes="1" \
>>> no-quorum-policy="ignore" \
>>> last-lrm-refresh="1299964019"
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100"
>>>
>>> When 1-st node was crashed, then 2-nd node become active. During this
>>> process in ha-debug file I found lines:
>>>
>>> ...
>>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>>> Starting sub-system "pengine"
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>>> /usr/lib64/heartbeat/pengine
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>>> pengine
>>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>>> Taking over DC status for this partition
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>>> We are now in R/W mode
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_master for section 'all'
>>> (origin=local/crmd/11, version=0.391.20): ok (
>>> rc=0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section cib
>>> (origin=local/crmd/12, version=0.391.20): ok (rc
>>> =0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section crm_config
>>> (origin=local/crmd/14, version=0.391.20):
>>> ok (rc=0)
>>> ...
>>>
>>> After "Starting pengine", only thru 4 seconds occured next action.
>>> What happens at this time? Is it possible to reduce this time?
>> I seem to remember seeing something related to this in the code at 
>> one point.  I believe it is limited only to the use of heartbeat as 
>> the messaging layer.  After starting the pengine, the crmd sleeps 
>> waiting for the pengine to start before contacting it.  The sleep is 
>> just a guess at how long it will take before the pengine will be up 
>> and ready to accept a connection though.  That's why it is so long... 
>> so the gap will hopefully be large enough that no one will ever run 
>> into any problems with it (I am not a big fan of this type of logic 
>> at all)  I'd recommend moving to corosync and seeing if this delay 
>> goes away.
>>
>> -- Vossel
>>
>>> Thanks in advance.
>>> -- 
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>

-- 
Best regards,
Vitaly

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120719/79a6dbf0/attachment-0003.html>