[ClusterLabs] Regular pengine warnings after a transient failure

Mon Mar 7 22:49:37 UTC 2016

On Tue, Mar 8, 2016 at 7:03 AM, Ferenc Wágner <wferi at niif.hu> wrote:

> Ken Gaillot <kgaillot at redhat.com> writes:
>
> > On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> >
> >> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by
> vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event
> (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98,
> source=process_graph_event:600, 0)
> >
> > That means the action was initiated by a different node (the previous DC
> > presumably),

I suspect s/previous/other/

With a stuck machine its entirely possible that the other nodes elected a
new leader.
Would I be right in guessing that fencing is disabled?

> so the new DC wants to recalculate everything.
>
> Time travel was sort of possible in that situation, and recurring
> monitor operations are not logged, so this is indeed possible.  The main
> thing is that it wasn't mishandled.
>
> >> recovery actions turned into start actions for the resources stopped
> >> during the previous transition.  However, almost all other recovery
> >> actions just disappeared without any comment.  This was actually
> >> correct, but I really wonder why the cluster decided to paper over
> >> the previous monitor operation timeouts.  Maybe the operations
> >> finished meanwhile and got accounted somehow, just not logged?
> >
> > I'm not sure why the PE decided recovery was not necessary. Operation
> > results wouldn't be accepted without being logged.
>
> At which logging level?  I can't see recurring monitor operation logs in
> syslog (at default logging level: notice) nor in /var/log/pacemaker.log
> (which contains info level messages as well).
>

The DC will log that the recurring monitor was successfully started, but
due to noise it doesn't log subsequent successes.

>
> However, the info level logs contain more "Transition aborted" lines, as
> if only the first of them got logged with notice level.  This would make
> sense, since the later ones don't make any difference on an already
> aborted transition, so they aren't that important.  And in fact such
> lines were suppressed from the syslog I checked first, for example:
>
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: ---
> 0.613.120 2
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: +++
> 0.613.121 (null)
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     +  /cib:
> @num_updates=121
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     ++
> /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:
> <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000"
> operation="monitor" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.10"
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="645" rc-code="0" op-st
> 12:55:39 [8479] vhbl07        cib:     info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=vhbl05/crmd/362, version=0.613.121)
> 12:55:39 [8484] vhbl07       crmd:     info: abort_transition_graph:
>  Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign
> event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7,
> cib=0.613.121, source=process_graph_event:600, 0)
> 12:55:39 [8484] vhbl07       crmd:     info: process_graph_event:
> Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a
> different node
>
> I can very much imagine this cancelling the FAILED state induced by a
> monitor timeout like:
>
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++
>                                        <lrm_resource id="vm-elm"
> type="TransientDomain" class="ocf" provider="niif">
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++
>                                          <lrm_rsc_op
> id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_60000"
> operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.10"
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000"
> last-rc-change="1456833279" exe
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++
>                                          <lrm_rsc_op id="vm-elm_last_0"
> operation_key="vm-elm_start_0" operation="start"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0"
> last-run="1456091121" last-rc-change="1456091121" e
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++
>                                        </lrm_resource>
>
> The transition-keys match, does this mean that the above is a late
> result from the monitor operation which was considered timed-out
> previously?  How did it reach vhbl07, if the DC at that time was vhbl03?
>

Everything goes into the cib (replicated datastore) and the DC(s) get
notified.

>
> > The pe-input files from the transitions around here should help.
>
> They are available.  What shall I look for?
>
> >> Basically, the cluster responded beyond my expectations, sparing lots of
> >> unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
> >> software!  But I'm left with these "Processing failed op monitor"
> >> warnings emitted every 15 minutes (timer pops).  Is it safe and clever
> >> to cleanup the affected resources?  Would that get rid of them without
> >> invoking other effects, like recoveries for example?
> >
> > That's normal; it's how the cluster maintains the effect of a failure
> > that has not been cleared. The logs can be confusing, because it's not
> > apparent from that message alone whether the failure is new or old.
>
> Ah, do you mean that these are the same thing that appears after "Failed
> Actions:" at the end of the crm_mon output?  They certainly match, and
> the logs are confusing indeed.
>
> > Cleaning up the resource will end the failure condition, so what happens
> > next depends on the configuration and state of the cluster. If the
> > failure was preventing a preferred node from running the resource, the
> > resource could move, depending on other factors such as stickiness.
>
> These resources are (still) running fine, suffered only monitor failures
> and are node-neutral, so it should be safe to cleanup them, I suppose.
> --
> Thanks for your quick and enlightening answer!  I feared the mere length
> of my message would scare everybody away...
> Regards,
> Feri
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160308/539a847e/attachment.htm>