[Pacemaker] Question regarding starting of master/slave resources and ELECTIONs

Andrew Beekhof andrew at beekhof.net
Thu Apr 14 09:14:40 UTC 2011


On Thu, Apr 14, 2011 at 10:49 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>>> I noticed that 4 of the master/slave resources will start right away but
>>> the
>>> 5 master/slave resource seems to take a minute or so and I am only running
>>> with one node.
>>> Is this expected?
>>
>> Probably, if the other 4 take around a minute each to start.
>> There is an lrmd config variable that controls how much parallelism it
>> allows (but i forget the name).
>> <Bob> It's max-children and I set it to 40 for this test to see if it would
>> change the behavior.  (/sbin/lrmadmin -p max-children 40)
>
> Thats surprising.  I'll have a look at the logs.

Looking at the logs, I see a couple of things:


This is very bad:
Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: WARN: get_uuid:
Could not calculate UUID for mgraid-s000030311-0
Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: WARN:
populate_cib_nodes_ha: Node mgraid-s000030311-0: no uuid found

For some reason pacemaker cant get the node's uuid from heartbeat.


So we start a few things:

Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=23:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSS000030311:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=49:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030312:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=75:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030313:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=101:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030314:0_start_0 )

But then another change comes in:

Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
abort_transition_graph: need_abort:59 - Triggered transition abort
(complete=0) : Non-status change

Normally we'd recompute and keep going, but it was a(nother) replace
operation, so:

Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: info:
do_state_transition: State transition S_TRANSITION_ENGINE ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
origin=do_cib_replaced ]

All the time goes here:

Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:37:00 mgraid-S000030311-1 crmd: [17529]: ERROR:
crm_timer_popped: Integration Timer (I_INTEGRATED) just popped!

but its not at all clear to me why - although certainly avoiding the
election would help.
Is there any chance to load all the changes at once?


Possibly the delay related to the UUID issue above, possibly it might
be related to one of these two patches that went in after 1.0.9

andrew (stable-1.0)	High: crmd: Make sure we always poke the FSA after
a transition to clear any TE_HALT actions CS: 9187c0506fd3 On:
2010-07-07
andrew (stable-1.0)	High: crmd: Reschedule the PE_START action if its
not already running when we try to use it CS: e44dfe49e448 On:
2010-11-11

Could you try turning on debug and/or a more recent version?




More information about the Pacemaker mailing list