[Pacemaker] meta failure-timeout: crashed resource is assumed to be Started?

Thu Oct 23 15:59:21 UTC 2014

В Thu, 23 Oct 2014 13:46:00 +0200
Carsten Otto <carsten.otto at andrena.de> пишет:

> Dear all,
> 
> I did not get any response so far. Could you please find the time and
> tell me how the "meta failure-timeout" is supposed to work, in
> combination with monitor operations?
> 

If you attach unedited logs from the point of the FIRST failure as
well as your configuration you probably will get more chances.
Failure-timeout should have no relation to monitor operation; most
likely monitor actually indicates FIRST is running even when it is not. 

> Thanks,
> Carsten
> 
> On Thu, Oct 16, 2014 at 05:06:41PM +0200, Carsten Otto wrote:
> > Dear all,
> > 
> > I configured meta failure-timeout=60sec on all of my resources. For the
> > sake of simplicity, assume I have a group of two resources FIRST and
> > SECOND (where SECOND is started after FIRST, surprise!).
> > 
> > If now FIRST crashes, I see a failure, as expected. I also see that
> > SECOND is stopped, as expected.
> > 
> > Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen
> > that the "failure-timeout" for FIRST is reached, and its failure is
> > cleaned. This also is expected.
> > 
> > The problem now is that after the 60sec timeout pacemaker assumes that
> > FIRST is in the Started state. There is no indication about that in the
> > log files, and the last monitor operation which ran just a few seconds
> > before also indicated that FIRST is actually not running.
> > 
> > As a consequence of the bug, pacemaker tries to re-start SECOND on the
> > same system, which fails to start (as it depends on FIRST, which
> > actually is not running). Only then the resources are started on the
> > other system.
> > 
> > So, my question is:
> > Why does pacemaker assume that a previously failed resource is "Started"
> > when the "meta failure-timeout" is triggered? Why is the monitor
> > operation not invoked to determine the correct state?
> > 
> > The corresponding lines of the log file, about a minute after FIRST
> > crashed and the stop operation for SECOND was triggered:
> > 
> > Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that FIRST is not running)
> > [...]
> > Oct 16 16:27:23 [2104] HOSTNAME       lrmd:     info: log_finished:         finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 exec-time:62827ms queue-time:0ms
> > Oct 16 16:27:23 [2107] HOSTNAME       crmd:   notice: process_lrm_event:    LRM operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) ok
> > Oct 16 16:27:23 [2107] HOSTNAME       crmd:     info: match_graph_event:    Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0)
> > Oct 16 16:27:23 [2107] HOSTNAME       crmd:   notice: run_graph:    Transition 40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped
> > Oct 16 16:27:23 [2107] HOSTNAME       crmd:     info: do_state_transition:  State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> > Oct 16 16:27:23 [2100] HOSTNAME        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/225, version=0.1450.89)
> > Oct 16 16:27:23 [2100] HOSTNAME        cib:     info: cib_process_request:  Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/226, version=0.1450.89)
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_config:        On loss of CCM Quorum: Ignore
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: determine_online_status_fencing:      Node HOSTNAME is active
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: determine_online_status:      Node HOSTNAME is online
> > [...]
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   FIRST has failed 1 times on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        Clearing expired failcount for FIRST on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   FIRST has failed 1 times on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        Clearing expired failcount for FIRST on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   FIRST has failed 1 times on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        Clearing expired failcount for FIRST on HOSTNAME
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME
> > [...]
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: group_print:   Resource Group: GROUP
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: native_print:              FIRST   (ocf::heartbeat:xxx):      Started HOSTNAME 
> > Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: native_print:              SECOND     (ocf::heartbeat:yyy):        Stopped 
> > 
> > Thank you,
> > Carsten
> 
> 
> 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141023/1b785400/attachment-0004.sig>