[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Mon Nov 11 19:15:19 EST 2013

Can you try with these two patches please?

+ Andrew Beekhof (4 seconds ago) fec946a: Fix: crmd: When the DC gracefully shuts down, record the new expected state into the cib  (HEAD, master)
+ Andrew Beekhof (10 seconds ago) 740122a: Fix: crmd: When a peer expectedly shuts down, record the new join and expected states into the cib 

On 12 Nov 2013, at 11:05 am, Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 12 Nov 2013, at 10:29 am, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
>> 
>> On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 
>>> 11.11.2013 09:00, Vladislav Bogdanov wrote:
>>> ...
>>>>>>> Looking at crm-fence-peer.sh script, it would determine peer state as
>>>>>>> offline immediately if node state (all of)
>>>>>>> * doesn't contain "expected" tag or has it set to "down"
>>>>>>> * has "in_ccm" tag set to false
>>>>>>> * has "crmd" tag set to anything except "online"
>>>>>>> 
>>>>>>> On the other hand, crmd sets "expected" = "down" only after fencing is
>>>>>>> complete (probably the same for "in_ccm"?). Shouldn't is do the same (or
>>>>>>> may be just remove that tag) if clean shutdown about to be complete?
>>>>>> 
>>>>>> That would make sense.  Are you using the plugin, cman or corosync 2?
>>>> 
>>> 
>>> This one works in all tests I was able to imagine, but I'm not sure it is
>>> completely safe to set expected="down" for old DC (in test when drbd is promoted on DC and it reboots).
>>> 
>>> From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001
>>> From: Vladislav Bogdanov <bubble at hoster-ok.com>
>>> Date: Mon, 11 Nov 2013 14:32:48 +0000
>>> Subject: [PATCH] Update node values in cib on clean shutdown
>>> 
>>> ---
>>> crmd/callbacks.c  |    6 +++++-
>>> crmd/membership.c |    2 +-
>>> 2 files changed, 6 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>> index 3dae17b..9cfb973 100644
>>> --- a/crmd/callbacks.c
>>> +++ b/crmd/callbacks.c
>>> @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>>           } else if (safe_str_eq(node->uname, fsa_our_dc) && crm_is_peer_active(node) == FALSE) {
>>>               /* Did the DC leave us? */
>>>               crm_notice("Our peer on the DC (%s) is dead", fsa_our_dc);
>>> +                /* FIXME: is it safe? */
>> 
>> Not at all safe.  It will prevent fencing.
>> 
>>> +                crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN);
>>>               register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL);
>>>           }
>>>           break;
>>> @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>> 
>>>   if (AM_I_DC) {
>>>       xmlNode *update = NULL;
>>> +        int flags = node_update_peer;
>>>       gboolean alive = crm_is_peer_active(node);
>>>       crm_action_t *down = match_down_event(0, node->uuid, NULL, appeared);
>>> 
>>> @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>> 
>>>               crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>               crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN);
>>> +                flags |= node_update_cluster | node_update_join | node_update_expected;
>> 
>> This does look ok though
> 
> With the exception of 'node_update_cluster'.  
> That didn't change here and shouldn't be touched until it really does leave the membership.
> 
>> 
>>>               check_join_state(fsa_state, __FUNCTION__);
>>> 
>>>               update_graph(transition_graph, down);
>>> @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>>           crm_trace("Other %p", down);
>>>       }
>>> 
>>> -        update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__);
>>> +        update = do_update_node_cib(node, flags, NULL, __FUNCTION__);
>>>       fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
>>>                           cib_scope_local | cib_quorum_override | cib_can_create);
>>>       free_xml(update);
>>> diff --git a/crmd/membership.c b/crmd/membership.c
>>> index be1863a..d68b3aa 100644
>>> --- a/crmd/membership.c
>>> +++ b/crmd/membership.c
>>> @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * parent, const char *s
>>>   crm_xml_add(node_state, XML_ATTR_UNAME, node->uname);
>>> 
>>>   if (flags & node_update_cluster) {
>>> -        if (safe_str_eq(node->state, CRM_NODE_ACTIVE)) {
>>> +        if (crm_is_peer_active(node)) {
>> 
>> This is also wrong.  XML_NODE_IN_CLUSTER is purely a record of whether the node is in the current corosync/cman/heartbeat membership.
>> 
>>>           value = XML_BOOLEAN_YES;
>>>       } else if (node->state) {
>>>           value = XML_BOOLEAN_NO;
>>> -- 
>>> 1.7.1
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131112/130379c5/attachment-0003.sig>