[Pacemaker] lrm monitor failure status lost during DC election

Thu Apr 18 16:36:57 EDT 2013

Hello.

I have an issue with pacemaker 1.1.6.1 but believe this may still be
present in the
latest git versions and would like to know if the fix makes sense.


What I see is the following:
Setup:
- 2 node cluster
- ocf:heartbeat:Dumy resource on non-DC node.
- Force DC reboot or stonith and fail resource while there is no DC.

Result:
- node with failed monitor becomes DC (good)

- lrmd reports resource as failed during every monitor interval but
since these failures are not rc status changes they are not sent to crmd.
(good -- it is failing, but ..)

- crm_mon / cibadmin --query report resource as running OK. (not good)


The resource has failed but is never restarted   I believe the failing
resource and any group it belongs to should be recovered during/after
the DC election.

I think  this is due to the operation of build_active_RAs on the surviving node:

        build_operation_update(xml_rsc, &(entry->rsc), entry->last,
__FUNCTION__);
        build_operation_update(xml_rsc, &(entry->rsc), entry->failed,
__FUNCTION__);
        for (gIter = entry->recurring_op_list; gIter != NULL; gIter =
gIter->next) {
            build_operation_update(xml_rsc, &(entry->rsc),
gIter->data, __FUNCTION__);
        }

What this produces is
last                 failed                     list[0]
      list[1]
start_0: rc=0; monitor_1000: rc=7; monitor_1000: rc=7; monitor_1000: rc=0

The final result in the cib appears to be the last entry which is from
the initial
transition of the monitor from rc=-1 to rc=0.

To fix this I swapped the order of recurring_op_list so that the last transition
is at the end of the list rather than the beginning.  With this this change I
see what I believe is the desired behavior -- the resource is stopped and
re-stared when the DC election is finalized.

The memcpy is a backport of a corresponding change in lrmd_copy_event
to simplify debugging by maintaining the rcchanged time.

---------------------
This patch swaps the order of recurring operations (monitors) in the
lrm history cache.  By placing the most recent change at the end of the
list it is properly detected by pengine after a DC election.

With the new events placed at the start of the list the last thing
in the list is the initial startup with rc=0.  This makes pengine
believe the resource is working properly even though lrmd is reporting
constand failure.

It is fairly easy to get into this situation when a shared resource
(storage enclosure) fails and causes the DC to be stonithed.

diff --git a/crmd/lrm.c b/crmd/lrm.c
index 187db76..f8974f6 100644
--- a/crmd/lrm.c
+++ b/crmd/lrm.c
@@ -217,7 +217,7 @@ update_history_cache(lrm_rsc_t * rsc, lrm_op_t * op)

     if (op->interval > 0) {
         crm_trace("Adding recurring op: %s_%s_%d", op->rsc_id,
op->op_type, op->interval);
-        entry->recurring_op_list =
g_list_prepend(entry->recurring_op_list, copy_lrm_op(op));
+        entry->recurring_op_list =
g_list_append(entry->recurring_op_list, copy_lrm_op(op));

     } else if (entry->recurring_op_list && safe_str_eq(op->op_type,
RSC_STATUS) == FALSE) {
         GList *gIter = entry->recurring_op_list;
@@ -1756,6 +1756,9 @@ copy_lrm_op(const lrm_op_t * op)

     crm_malloc0(op_copy, sizeof(lrm_op_t));

+       /* Copy all int values, pointers fixed below */
+       memcpy(op_copy, op, sizeof(lrm_op_t));
+
     op_copy->op_type = crm_strdup(op->op_type);
     /* input fields */
     op_copy->params = g_hash_table_new_full(crm_str_hash, g_str_equal,