[Pacemaker] Configuration recommandations for (very?) large cluster
Andrew Beekhof
andrew at beekhof.net
Thu Aug 14 09:35:34 CEST 2014
On 14 Aug 2014, at 3:28 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> 14.08.2014 05:24, Andrew Beekhof wrote:
>>
>> On 14 Aug 2014, at 12:05 am, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote:
>>
>>> On Wed, Aug 13, 2014 at 10:33:55AM +1000, Andrew Beekhof wrote:
>>>> On 13 Aug 2014, at 2:02 am, Cédric Dufour - Idiap Research Institute <cedric.dufour at idiap.ch> wrote:
>>>>> On 12/08/14 07:52, Andrew Beekhof wrote:
>>>>>> On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute <cedric.dufour at idiap.ch> wrote:
>>>
>>> ...
>>>
>>>>> While I still had the ~450 resources, I also "accidentally" brought
>>>>> all 22 nodes back to life together (well, actually started the DC
>>>>> alone and then started the remaining 21 nodes together). As could be
>>>>> expected, the DC got quite busy (dispatching/executing the ~450*22
>>>>> monitoring operations on all nodes). It took 40 minutes for the
>>>>> cluster to stabilize. But it did stabilize, with no timeout and not
>>>>> monitor operations failure! A few "high CIB load detected / throttle
>>>>> down mode" messages popped up but all went well.
>>>
>>> Cool.
>>>
>>>> Thats about 0.12s per operation, not too bad.
>>>> More importantly, I'm glad to hear that real-world clusters are seeing
>>>> the same kind of improvements as those in the lab.
>>>>
>>>> It would be interesting to know how the 40 minutes compares to bringing one node online at a time.
>>>>
>>>>>
>>>>> Q: Is there a way to favorize more powerful nodes for the DC (iow. push the DC "election" process in a preferred direction) ?
>>>>
>>>> Only by starting it first and ensuring it doesn't die (we prfioritize the node with the largest crmd process uptime).
>>>
>>> Uhm, there was a patch once for pacemaker-1.0.
>>> The latest version I found right now is below.
>>> Written by Klaus Wenninger, iirc.
>>
>> The new CIB code actually reduces the case for a patch like this - since all updates are performed on all hosts.
>> So the workload from the CIB should be pretty much identical on all nodes.
>>
>> The load from the crmd is mostly from talking to the lrmd, which is dependant on resource placement rather than being (or not being) the DC.
>
> I've seen the different picture with 1024 unique clone instances. crmd's
> CPU load on DC is much higher for that case during probe/start/stop.
Thats with the new CIB code?
>
>>
>> About the only reason would be to make the pengine go faster - and I'm not completely convinced that is a sufficient justification.
>
> In one cluster I have heterogeneous HW - some nodes are Xeons, some are
> Atoms. I'd prefer to have a way to perform pe calculations on the former
> ones.
>
>>
>>>
>>> The idea was to communicate via environment
>>> a "HA_dc_prio" value, with meanings:
>>>
>>> unset => use default of 1
>>> <= -1: Node does not become DC (does not vote)
>>
>> The "< 0" behaviour sounds dangerous.
>> That would preclude a DC from being elected in a partition where all nodes had this value.
>>
>>> == 0: Node may only become DC if no node with >= 1 is available.
>>> It also will trigger an election whenever a node joins.
>>> (== 1: default)
>>>> = 1: "classic pacemaker behavior",
>>> but changed so positive prio will be checked first,
>>> and higher positive prio will win
>>>
>>> It may still apply (with white space changes) to current pacemaker 1.0.
>>>
>>> It will need some more adjustments for pacemaker 1.1,
>>> but a quick browse through the code suggests it won't be too much work.
>>>
>>> Lars
>>>
>>>
>>> --- crmd/election.c.orig 2011-11-28 16:24:54.345431668 +0100
>>> +++ crmd/election.c 2011-11-28 16:39:18.008420543 +0100
>>> @@ -33,6 +33,7 @@
>>> GHashTable *voted = NULL;
>>> uint highest_born_on = -1;
>>> static int current_election_id = 1;
>>> +static int our_dc_prio = INT_MIN; /* INT_MIN/<0/==0/>0 not_set/not_voting/retrigger_election/default_behaviour_plus_prio */
>>>
>>> static int
>>> crm_uptime(struct timeval *output)
>>> @@ -107,6 +108,20 @@
>>> break;
>>> }
>>>
>>> + if (our_dc_prio == INT_MIN) {
>>> + char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> + if (dc_prio_str == NULL) {
>>> + our_dc_prio = 1;
>>> + } else {
>>> + our_dc_prio = atoi(dc_prio_str);
>>> + }
>>> + }
>>> +
>>> + if (our_dc_prio < 0) {
>>> + not_voting = TRUE;
>>> + }
>>> +
>>> if (not_voting == FALSE) {
>>> if (is_set(fsa_input_register, R_STARTING)) {
>>> not_voting = TRUE;
>>> @@ -123,6 +138,7 @@
>>> current_election_id++;
>>> crm_xml_add(vote, F_CRM_ELECTION_OWNER, fsa_our_uuid);
>>> crm_xml_add_int(vote, F_CRM_ELECTION_ID, current_election_id);
>>> + crm_xml_add_int(vote, F_CRM_DC_PRIO, our_dc_prio);
>>>
>>> crm_uptime(&age);
>>> crm_xml_add_int(vote, F_CRM_ELECTION_AGE_S, age.tv_sec);
>>> @@ -241,8 +258,9 @@
>>> {
>>> struct timeval your_age;
>>> int age;
>>> int election_id = -1;
>>> + int your_dc_prio = 1;
>>> int log_level = LOG_INFO;
>>> gboolean use_born_on = FALSE;
>>> gboolean done = FALSE;
>>> gboolean we_loose = FALSE;
>>> @@ -273,6 +291,18 @@
>>> your_version = crm_element_value(vote->msg, F_CRM_VERSION);
>>> election_owner = crm_element_value(vote->msg, F_CRM_ELECTION_OWNER);
>>> crm_element_value_int(vote->msg, F_CRM_ELECTION_ID, &election_id);
>>> + crm_element_value_int(vote->msg, F_CRM_DC_PRIO, &your_dc_prio);
>>> +
>>> + if (our_dc_prio == INT_MIN) {
>>> + char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> + if (dc_prio_str == NULL) {
>>> + our_dc_prio = 1;
>>> + } else {
>>> + our_dc_prio = atoi(dc_prio_str);
>>> + }
>>> + }
>>> +
>>> crm_element_value_int(vote->msg, F_CRM_ELECTION_AGE_S, (int *)&(your_age.tv_sec));
>>> crm_element_value_int(vote->msg, F_CRM_ELECTION_AGE_US, (int *)&(your_age.tv_usec));
>>>
>>> @@ -334,6 +364,13 @@
>>> reason = "Recorded";
>>> done = TRUE;
>>>
>>> + } else if(our_dc_prio < your_dc_prio) {
>>> + reason = "DC Prio";
>>> + we_loose = TRUE;
>>> +
>>> + } else if(our_dc_prio > your_dc_prio) {
>>> + reason = "DC Prio";
>>> +
>>> } else if (compare_version(your_version, CRM_FEATURE_SET) < 0) {
>>> reason = "Version";
>>> we_loose = TRUE;
>>> @@ -400,6 +437,7 @@
>>>
>>> crm_xml_add(novote, F_CRM_ELECTION_OWNER, election_owner);
>>> crm_xml_add_int(novote, F_CRM_ELECTION_ID, election_id);
>>> + crm_xml_add_int(novote, F_CRM_DC_PRIO, 0); /* rather don't advertise a negative value */
>>>
>>> send_cluster_message(vote_from, crm_msg_crmd, novote, TRUE);
>>> free_xml(novote);
>>> --- include/crm/msg_xml.h.orig 2011-11-28 16:41:47.309414327 +0100
>>> +++ include/crm/msg_xml.h 2011-11-28 16:42:23.921417584 +0100
>>> @@ -33,6 +33,7 @@
>>> # define F_CRM_USER "crm_user"
>>> # define F_CRM_JOIN_ID "join_id"
>>> # define F_CRM_ELECTION_ID "election-id"
>>> +# define F_CRM_DC_PRIO "dc-prio"
>>> # define F_CRM_ELECTION_AGE_S "election-age-sec"
>>> # define F_CRM_ELECTION_AGE_US "election-age-nano-sec"
>>> # define F_CRM_ELECTION_OWNER "election-owner"
>>> --- lib/ais/plugin.c.orig 2011-11-28 16:42:57.002411543 +0100
>>> +++ lib/ais/plugin.c 2011-11-28 16:44:22.160413844 +0100
>>> @@ -409,6 +409,9 @@
>>> get_config_opt(pcmk_api, local_handle, "use_logd", &value, "no");
>>> pcmk_env.use_logd = value;
>>>
>>> + get_config_opt(pcmk_api, local_handle, "dc_prio", &value, "1");
>>> + pcmk_env.dc_prio = value;
>>> +
>>> get_config_opt(pcmk_api, local_handle, "use_mgmtd", &value, "no");
>>> if (ais_get_boolean(value) == FALSE) {
>>> int lpc = 0;
>>> @@ -599,6 +602,7 @@
>>> pcmk_env.logfile = NULL;
>>> pcmk_env.use_logd = "false";
>>> pcmk_env.syslog = "daemon";
>>> + pcmk_env.dc_prio = "1";
>>>
>>> if (cs_uid != root_uid) {
>>> ais_err("Corosync must be configured to start as 'root',"
>>> --- lib/ais/utils.c.orig 2011-11-28 16:45:01.940415754 +0100
>>> +++ lib/ais/utils.c 2011-11-28 16:45:33.018412117 +0100
>>> @@ -237,6 +237,7 @@
>>> setenv("HA_logfacility", pcmk_env.syslog, 1);
>>> setenv("HA_LOGFACILITY", pcmk_env.syslog, 1);
>>> setenv("HA_use_logd", pcmk_env.use_logd, 1);
>>> + setenv("HA_dc_prio", pcmk_env.dc_prio, 1);
>>> setenv("HA_quorum_type", pcmk_env.quorum, 1);
>>> /* *INDENT-ON* */
>>>
>>> --- lib/ais/utils.h.orig 2011-11-28 16:45:45.143412597 +0100
>>> +++ lib/ais/utils.h 2011-11-28 16:46:37.026410208 +0100
>>> @@ -238,6 +238,7 @@
>>> const char *syslog;
>>> const char *logfile;
>>> const char *use_logd;
>>> + const char *dc_prio;
>>> const char *quorum;
>>> };
>>>
>>> --- crmd/messages.c.orig 2012-05-25 16:23:22.913106180 +0200
>>> +++ crmd/messages.c 2012-05-25 16:28:30.330263392 +0200
>>> @@ -36,6 +36,8 @@
>>> #include <crmd_messages.h>
>>> #include <crmd_lrm.h>
>>>
>>> +static int our_dc_prio = INT_MIN;
>>> +
>>> GListPtr fsa_message_queue = NULL;
>>> extern void crm_shutdown(int nsig);
>>>
>>> @@ -693,7 +695,19 @@
>>> /*========== DC-Only Actions ==========*/
>>> if (AM_I_DC) {
>>> if (strcmp(op, CRM_OP_JOIN_ANNOUNCE) == 0) {
>>> - return I_NODE_JOIN;
>>> + if (our_dc_prio == INT_MIN) {
>>> + char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> + if (dc_prio_str == NULL) {
>>> + our_dc_prio = 1;
>>> + } else {
>>> + our_dc_prio = atoi(dc_prio_str);
>>> + }
>>> + }
>>> + if (our_dc_prio == 0)
>>> + return I_ELECTION;
>>> + else
>>> + return I_NODE_JOIN;
>>>
>>> } else if (strcmp(op, CRM_OP_JOIN_REQUEST) == 0) {
>>> return I_JOIN_REQUEST;
>>>
>>>
>>> --
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>>
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140814/ab710a07/attachment-0001.sig>
More information about the Pacemaker
mailing list