[Pacemaker] lrm monitor failure status lost during DC election

Tue Apr 30 20:45:20 EDT 2013

On 19/04/2013, at 6:36 AM, David Adair <david_adair at xyratex.com> wrote:

> Hello.
> 
> I have an issue with pacemaker 1.1.6.1 but believe this may still be
> present in the
> latest git versions and would like to know if the fix makes sense.
> 
> 
> What I see is the following:
> Setup:
> - 2 node cluster
> - ocf:heartbeat:Dumy resource on non-DC node.
> - Force DC reboot or stonith and fail resource while there is no DC.
> 
> Result:
> - node with failed monitor becomes DC (good)
> 
> - lrmd reports resource as failed during every monitor interval but
> since these failures are not rc status changes they are not sent to crmd.
> (good -- it is failing, but ..)
> 
> - crm_mon / cibadmin --query report resource as running OK. (not good)
> 
> 
> The resource has failed but is never restarted   I believe the failing
> resource and any group it belongs to should be recovered during/after
> the DC election.
> 
> I think  this is due to the operation of build_active_RAs on the surviving node:
> 
>        build_operation_update(xml_rsc, &(entry->rsc), entry->last,
> __FUNCTION__);
>        build_operation_update(xml_rsc, &(entry->rsc), entry->failed,
> __FUNCTION__);
>        for (gIter = entry->recurring_op_list; gIter != NULL; gIter =
> gIter->next) {
>            build_operation_update(xml_rsc, &(entry->rsc),
> gIter->data, __FUNCTION__);
>        }
> 
> What this produces is
> last                 failed                     list[0]
>      list[1]
> start_0: rc=0; monitor_1000: rc=7; monitor_1000: rc=7; monitor_1000: rc=0

list[] should only have one element as both are for monitor_1000

I have a vague recollection of an old bug in this area and strongly suspect that something more recent wont have the same problem.

> 
> The final result in the cib appears to be the last entry which is from
> the initial
> transition of the monitor from rc=-1 to rc=0.
> 
> To fix this I swapped the order of recurring_op_list so that the last transition
> is at the end of the list rather than the beginning.  With this this change I
> see what I believe is the desired behavior -- the resource is stopped and
> re-stared when the DC election is finalized.
> 
> The memcpy is a backport of a corresponding change in lrmd_copy_event
> to simplify debugging by maintaining the rcchanged time.
> 
> ---------------------
> This patch swaps the order of recurring operations (monitors) in the
> lrm history cache.  By placing the most recent change at the end of the
> list it is properly detected by pengine after a DC election.
> 
> With the new events placed at the start of the list the last thing
> in the list is the initial startup with rc=0.  This makes pengine
> believe the resource is working properly even though lrmd is reporting
> constand failure.
> 
> It is fairly easy to get into this situation when a shared resource
> (storage enclosure) fails and causes the DC to be stonithed.
> 
> diff --git a/crmd/lrm.c b/crmd/lrm.c
> index 187db76..f8974f6 100644
> --- a/crmd/lrm.c
> +++ b/crmd/lrm.c
> @@ -217,7 +217,7 @@ update_history_cache(lrm_rsc_t * rsc, lrm_op_t * op)
> 
>     if (op->interval > 0) {
>         crm_trace("Adding recurring op: %s_%s_%d", op->rsc_id,
> op->op_type, op->interval);
> -        entry->recurring_op_list =
> g_list_prepend(entry->recurring_op_list, copy_lrm_op(op));
> +        entry->recurring_op_list =
> g_list_append(entry->recurring_op_list, copy_lrm_op(op));
> 
>     } else if (entry->recurring_op_list && safe_str_eq(op->op_type,
> RSC_STATUS) == FALSE) {
>         GList *gIter = entry->recurring_op_list;
> @@ -1756,6 +1756,9 @@ copy_lrm_op(const lrm_op_t * op)
> 
>     crm_malloc0(op_copy, sizeof(lrm_op_t));
> 
> +       /* Copy all int values, pointers fixed below */
> +       memcpy(op_copy, op, sizeof(lrm_op_t));
> +
>     op_copy->op_type = crm_strdup(op->op_type);
>     /* input fields */
>     op_copy->params = g_hash_table_new_full(crm_str_hash, g_str_equal,
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org