[Pacemaker] About Quorum control at the time of the service stop.(no-quorum-policy=freeze)

Mon Sep 13 09:32:27 EDT 2010

On Fri, Sep 10, 2010 at 7:22 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We confirmed movement of no-quorum-policy=freeze in four node constitution.
>
> Of course we understand that quorum control does not act in Heartbeat well.
>
> We confirmed the service stop of four nodes in the next procedure.
>
> Step1) We start four nodes.(3ACT:1STB)
>
> Step2) We send cib.xml.
>
> ============
> Last updated: Fri Sep 10 14:16:30 2010
> Stack: Heartbeat
> Current DC: srv04 (96faf899-13a6-4550-9d3b-b784f7241d06) - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 4 Nodes configured, unknown expected votes
> 7 Resources configured.
> ============
>
> Online: [ srv01 srv02 srv03 srv04 ]
>
>  Resource Group: Group01
>     Dummy01    (ocf::heartbeat:Dummy): Started srv01
>     Dummy01-2  (ocf::heartbeat:Dummy): Started srv01
>  Resource Group: Group02
>     Dummy02    (ocf::heartbeat:Dummy): Started srv02
>     Dummy02-2  (ocf::heartbeat:Dummy): Started srv02
>  Resource Group: Group03
>     Dummy03    (ocf::heartbeat:Dummy): Started srv03
>     Dummy03-2  (ocf::heartbeat:Dummy): Started srv03
>  Resource Group: grpStonith1
>     prmStonith1-3      (stonith:external/ssh): Started srv01
>  Resource Group: grpStonith2
>     prmStonith2-3      (stonith:external/ssh): Started srv02
>  Resource Group: grpStonith3
>     prmStonith3-3      (stonith:external/ssh): Started srv03
>  Resource Group: grpStonith4
>     prmStonith4-3      (stonith:external/ssh): Started srv04
>
> Step3) We stop the first node after being stable.
>
> [root at srv02 ~]# crm_mon -1
> ============
> Last updated: Fri Sep 10 14:17:07 2010
> Stack: Heartbeat
> Current DC: srv04 (96faf899-13a6-4550-9d3b-b784f7241d06) - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 4 Nodes configured, unknown expected votes
> 7 Resources configured.
> ============
>
> Online: [ srv02 srv03 srv04 ]
> OFFLINE: [ srv01 ]
>
>  Resource Group: Group01
>     Dummy01    (ocf::heartbeat:Dummy): Started srv04 ---->FO
>     Dummy01-2  (ocf::heartbeat:Dummy): Started srv04 ---->FO
>  Resource Group: Group02
>     Dummy02    (ocf::heartbeat:Dummy): Started srv02
>     Dummy02-2  (ocf::heartbeat:Dummy): Started srv02
>  Resource Group: Group03
>     Dummy03    (ocf::heartbeat:Dummy): Started srv03
>     Dummy03-2  (ocf::heartbeat:Dummy): Started srv03
>  Resource Group: grpStonith1
>     prmStonith1-3      (stonith:external/ssh): Started srv03
>  Resource Group: grpStonith2
>     prmStonith2-3      (stonith:external/ssh): Started srv02
>  Resource Group: grpStonith3
>     prmStonith3-3      (stonith:external/ssh): Started srv03
>  Resource Group: grpStonith4
>     prmStonith4-3      (stonith:external/ssh): Started srv04
>
>
> Step4) Furthermore, we stop the next node after being stable.
>  * Because a notice of ccm which does not have Quorum is late, two remaining node nodes move the
> resource.

Thats not strictly true.
The movement is initiated before the second node shuts down, so it is
considered safe because we still had quorum at the point the decision
was made.

>
> [root at srv03 ~]# crm_mon -1
> ============
> Last updated: Fri Sep 10 14:17:59 2010
> Stack: Heartbeat
> Current DC: srv04 (96faf899-13a6-4550-9d3b-b784f7241d06) - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 4 Nodes configured, unknown expected votes
> 7 Resources configured.
> ============
>
> Online: [ srv03 srv04 ]
> OFFLINE: [ srv01 srv02 ]
>
>  Resource Group: Group01
>     Dummy01    (ocf::heartbeat:Dummy): Started srv04
>     Dummy01-2  (ocf::heartbeat:Dummy): Started srv04
>  Resource Group: Group02
>     Dummy02    (ocf::heartbeat:Dummy): Started srv04 ---->FO
>     Dummy02-2  (ocf::heartbeat:Dummy): Started srv04 ---->FO
>  Resource Group: Group03
>     Dummy03    (ocf::heartbeat:Dummy): Started srv03
>     Dummy03-2  (ocf::heartbeat:Dummy): Started srv03
>  Resource Group: grpStonith1
>     prmStonith1-3      (stonith:external/ssh): Started srv03
>  Resource Group: grpStonith2
>     prmStonith2-3      (stonith:external/ssh): Started srv04
>  Resource Group: grpStonith3
>     prmStonith3-3      (stonith:external/ssh): Started srv03
>  Resource Group: grpStonith4
>     prmStonith4-3      (stonith:external/ssh): Started srv04
>
> Step5) We stop one node after being more stable.
>  * We stopped it since I became have-quorum=0 of cib.
>
> [root at srv03 ~]# cibadmin -Q | more
> <cib epoch="102" num_updates="3" admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0.1"
> have-quorum="0" dc-uuid="96faf899-13a6-4550-9d3b-b784f
> 7241d06">
>
> Step6) Some resources moved to the last node.
>
> [root at srv04 ~]# crm_mon -1
> ============
> Last updated: Fri Sep 10 14:19:43 2010
> Stack: Heartbeat
> Current DC: srv04 (96faf899-13a6-4550-9d3b-b784f7241d06) - partition WITHOUT quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 4 Nodes configured, unknown expected votes
> 7 Resources configured.
> ============
>
> Online: [ srv04 ]
> OFFLINE: [ srv01 srv02 srv03 ]
>
>  Resource Group: Group01
>     Dummy01    (ocf::heartbeat:Dummy): Started srv04
>     Dummy01-2  (ocf::heartbeat:Dummy): Started srv04
>  Resource Group: Group02
>     Dummy02    (ocf::heartbeat:Dummy): Started srv04
>     Dummy02-2  (ocf::heartbeat:Dummy): Started srv04
>  Resource Group: Group03
>     Dummy03    (ocf::heartbeat:Dummy): Started srv04 ---->Why FO?
>     Dummy03-2  (ocf::heartbeat:Dummy): Started srv04 ---->Why FO?

In this case, it is because a member of our partition owned the
resource at the time we initiated the move.

Unfortunately the scenario here isn't quite testing what you had in mind.
You only achieve the expected behavior if you remove the second and
third machines from the cluster _ungracefully_.
Ie. by fencing them or unplugging them.

>  Resource Group: grpStonith1
>     prmStonith1-3      (stonith:external/ssh): Started srv04
>  Resource Group: grpStonith2
>     prmStonith2-3      (stonith:external/ssh): Started srv04
>  Resource Group: grpStonith4
>     prmStonith4-3      (stonith:external/ssh): Started srv04
>
>
> We thought that the resource that I left in a left node in Step5 did not move last.
> Because the reason is because it appoints no-quorum-policy=freeze.

Freeze still allows recovery within a partition.
Recovery can also occur for graceful shutdowns because the partition
owned the resource beforehand.

> However, the starting resource seems to move at the time of no-quorum-policy=freeze when I watch a
> source code.
>
> (snip)
> action_t *
> custom_action(resource_t *rsc, char *key, const char *task,
>              node_t *on_node, gboolean optional, gboolean save_action,
>              pe_working_set_t *data_set)
> {
>        action_t *action = NULL;
>        GListPtr possible_matches = NULL;
>        CRM_CHECK(key != NULL, return NULL);
>        CRM_CHECK(task != NULL, return NULL);
> (snip)
>                } else if(is_set(data_set->flags, pe_flag_have_quorum) == FALSE
>                        && data_set->no_quorum_policy == no_quorum_freeze) {
>                        crm_debug_3("Check resource is already active");
>                        if(rsc->fns->active(rsc, TRUE) == FALSE) {
>                                action->runnable = FALSE;
>                                crm_debug("%s\t%s (cancelled : quorum freeze)",
>                                          action->node->details->uname,
>                                          action->uuid);
>                        }
>
>                } else {
>
> (snip)
>
> Is this specifications of right no-quorum-policy=freeze movement?
> Is there detailed explanation of the no-quorum-policy=freeze movement somewhere?
>
> Best Regards,
> Hideo Yamauchi.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>