[Pacemaker] Utilization & resource stickiness strange behaviour - sorted resources

Thu Jan 19 15:58:07 UTC 2012

Hello,

On 01/19/2012 01:39 PM, agutxi Agustin wrote:
> Hi all,
> I am trying to set up a cluster of virtual machine hosts, and while
> doing so, I came out with a very strange behaviour (I think it may be
> a bug) and I hope you can lend me a hand in debugging this.
> For testing the behaviour observed in my production environment, I set
> up 2 new simple machines with no location/colocation/order
> constraints, and changed the Xen resource agent with the dummy
> resource agent, and the behaviour was the same.

Have you tried Pacemaker 1.1.6? There have been some utilization fixes.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> The scenario is the following:
> - The strategy is "utilization".
> - 2 nodes: vmHost1 and vmHost2, with 2 cores each handle 5 resources:
> DummyVM001-005, with resource-stickiness="INFINITY".
> 
> What happens is the following:
> - If I start resources DummyVM001-004, everything is fine. 2 resources
> run on each of the nodes
> - Now, I start DummyVM005, but utilization is full, so it does not
> start (cool :)
> - Then , if I stop any of the running resources, everything goes
> smoothly and DummyVM005 starts up. That's cool too.
> * Here comes the strange part:
> I have full utilization and resource stickines INFINITY, so starting
> new resources shouldn't change anything in the cluster status BUT
> If any of the freshly restarted resources is alphabetically sorted
> before, the cluster stops the "last" resource alphabetically sorted
> running  and starts the stopped one.
> I don't think this is the expected behaviour, please correct me if I wrong.
> 
> Thank you kindly,
> Agustin
> 
> status:
> ====
> Online: [ vmHost1 vmHost2 ]
> 
>  DummyVM1	(ocf::pacemaker:Dummy):	Started vmHost1
>  DummyVM2	(ocf::pacemaker:Dummy):	Started vmHost1
>  DummyVM3	(ocf::pacemaker:Dummy):	Started vmHost2
>  DummyVM5	(ocf::pacemaker:Dummy):	Started vmHost2
> crm(live)# resource start DummyVM4
> 
> My configuration:
> ============
> crm(live)# configure show
> node vmHost1 \
> 	utilization cores="2"
> node vmHost2 \
> 	utilization cores="2"
> primitive DummyVM1 ocf:pacemaker:Dummy \
> 	op monitor interval="60s" timeout="60s" \
> 	op start on-fail="restart" interval="0" \
> 	op stop on-fail="ignore" interval="0" \
> 	utilization cores="1" \
> 	meta is-managed="true" migration-threshold="2"
> primitive DummyVM2 ocf:pacemaker:Dummy \
> 	op monitor interval="60s" timeout="60s" \
> 	op start on-fail="restart" interval="0" \
> 	op stop on-fail="ignore" interval="0" \
> 	utilization cores="1" \
> 	meta is-managed="true" migration-threshold="2"
> primitive DummyVM3 ocf:pacemaker:Dummy \
> 	op monitor interval="60s" timeout="60s" \
> 	op start on-fail="restart" interval="0" \
> 	op stop on-fail="ignore" interval="0" \
> 	utilization cores="1" \
> 	meta is-managed="true" migration-threshold="2"
> primitive DummyVM4 ocf:pacemaker:Dummy \
> 	op monitor interval="60s" timeout="60s" \
> 	op start on-fail="restart" interval="0" \
> 	op stop on-fail="ignore" interval="0" \
> 	utilization cores="1" \
> 	meta is-managed="true" migration-threshold="2" target-role="Started"
> primitive DummyVM5 ocf:pacemaker:Dummy \
> 	op monitor interval="60s" timeout="60s" \
> 	op start on-fail="restart" interval="0" \
> 	op stop on-fail="ignore" interval="0" \
> 	utilization cores="1" \
> 	meta is-managed="true" migration-threshold="2" target-role="Started"
> property $id="cib-bootstrap-options" \
> 	dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> 	cluster-infrastructure="openais" \
> 	expected-quorum-votes="2" \
> 	stonith-enabled="false" \
> 	stop-all-resources="false" \
> 	placement-strategy="utilization" \
> 	no-quorum-policy="ignore" \
> 	cluster-infrastructure="openais" \
> 	stop-orphan-resources="true" \
> 	stop-orphan-actions="true" \
> 	last-lrm-refresh="1326975274"
> rsc_defaults $id="rsc-options" \
> 	resource-stickiness="INFINITY"
> 
> 
> and the important part in /var/log/syslog :
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <cib
> admin_epoch="0" epoch="75" num_updates="4" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-   <configuration >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-     <resources >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-       <primitive
> id="DummyVM4" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-
> <meta_attributes id="DummyVM4-meta_attributes" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-           <nvpair
> value="Stopped" id="DummyVM4-meta_attributes-target-role" />
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-         </meta_attributes>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-       </primitive>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-     </resources>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff-   </configuration>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </cib>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <cib epoch="76"
> num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
> crm_feature_set="3.0.5" have-quorum="1" cib-last-written="Thu Jan 19
> 12:35:47 2012" dc-uuid="vmHost1" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+   <configuration >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+     <resources >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+       <primitive
> class="ocf" id="DummyVM4" provider="pacemaker" type="Dummy" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+
> <meta_attributes id="DummyVM4-meta_attributes" >
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+           <nvpair
> id="DummyVM4-meta_attributes-target-role" name="target-role"
> value="Started" />
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+         </meta_attributes>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+       </primitive>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+     </resources>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+   </configuration>
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </cib>
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: abort_transition_graph:
> te_update_diff:131 - Triggered transition abort (complete=1, tag=diff,
> id=(null), magic=NA, cib=0.76.1) : Non-status change
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_pe_invoke: Query 191:
> Requesting the current CIB: S_POLICY_ENGINE
> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib_process_request:
> Operation complete: op cib_replace for section resources
> (origin=local/cibadmin/2, version=0.76.1): ok (rc=0)
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_pe_invoke_callback:
> Invoking the PE: query=191, ref=pe_calc-dc-1326976579-119, seq=64,
> quorate=1
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print:
> DummyVM1#011(ocf::pacemaker:Dummy):#011Started vmHost1
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print:
> DummyVM2#011(ocf::pacemaker:Dummy):#011Started vmHost1
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print:
> DummyVM3#011(ocf::pacemaker:Dummy):#011Started vmHost2
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print:
> DummyVM4#011(ocf::pacemaker:Dummy):#011Stopped
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print:
> DummyVM5#011(ocf::pacemaker:Dummy):#011Started vmHost2
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: RecurringOp:  Start
> recurring monitor (60s) for DummyVM4 on vmHost2
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave
> DummyVM1#011(Started vmHost1)
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave
> DummyVM2#011(Started vmHost1)
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave
> DummyVM3#011(Started vmHost2)
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Start
> DummyVM4#011(vmHost2)
> Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Stop
> DummyVM5#011(vmHost2)
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: unpack_graph: Unpacked
> transition 33: 6 actions in 6 synapses
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_te_invoke: Processing
> graph 33 (ref=pe_calc-dc-1326976579-119) derived from
> /var/lib/pengine/pe-input-717.bz2
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating
> action 19: stop DummyVM5_stop_0 on vmHost2
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo
> action 6 fired and confirmed
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action
> DummyVM5_stop_0 (19) confirmed on vmHost2 (rc=0)
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo
> action 7 fired and confirmed
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo
> action 5 fired and confirmed
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating
> action 17: start DummyVM4_start_0 on vmHost2
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action
> DummyVM4_start_0 (17) confirmed on vmHost2 (rc=0)
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating
> action 18: monitor DummyVM4_monitor_60000 on vmHost2
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action
> DummyVM4_monitor_60000 (18) confirmed on vmHost2 (rc=0)
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: run_graph:
> ====================================================
> Jan 19 13:36:19 vmHost1 crmd: [729]: notice: run_graph: Transition 33
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-717.bz2): Complete
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_graph_trigger:
> Transition 33 is now complete
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: notify_crmd: Transition 33
> status: done - <null>
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition:
> Starting PEngine Recheck Timer
> 
> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120119/2812c2e8/attachment-0004.sig>