[Pacemaker] Designated reaction of Pacemaker to monitor-op returning rc=7 (OCF_NOT_RUNNING)

Cnut Jansen work at cnutjansen.eu
Fri Sep 24 13:24:06 EDT 2010


Am 26.08.2010 10:38, schrieb Dejan Muhamedagic:
> Hi,
> 
> On Wed, Aug 25, 2010 at 08:56:08PM +0200, Cnut Jansen wrote:
>> Am 25.08.2010 16:00, schrieb Dejan Muhamedagic:
>>> Hi,
>>>
>>> On Tue, Aug 24, 2010 at 05:19:23PM +0200, Cnut Jansen wrote:
>>>> Hi,
>>>>
>>>> just (for now) a short question for to make sure I didn't miss anything:
>>>> What's the designated reaction of Pacemaker when a resource agents
>>>> called for monitoring a resource, which is supposed to run and thus
>>>> resulting in a return of 0 (OCF_SUCCESS), returns 7 (OCF_NOT_RUNNING)?
>>>> Shall Pacemaker's very next call be for stopping the resource or shall
>>>> it be yet another (or even several) monitorings?
>>>
>>> It should be stop, followed by start, either on the same node or
>>> on another depending on the migration-threshold setting and
>>> failcount.
>>
>> Ok, that's what I expected.
>> So there are neither so-far-unknown-to-me circumstances where it's by
>> design that Pacemaker - after having gotten a rc=7 from the RA; and for
>> adding a "FAILED" behind the resource in crm_mon, it obviously also
>> understood it correctly - calls the RA yet another several times for
>> monitoring (while letting the rest of the cluster hang) before finally
>> calling the desired stop, instead of immediately calling the RA for
>> stopping and continueing with the pending transactions and migrations.
> 
> Yes, that sounds quite unusual.

Just for reference:
Though I'm not absolutely sure about it, from today's point of view that
strange not-stopping-resource-after-after-rc=7 maybe might have been
symptoms/combinations of quite sluggish cluster (Pacemaker still waiting
for returns of RAs and/or Pacemaker itself) and zombie-monitor-ops
(since I only saw in my own RAs' outputs that they'd get called for
monitor-action, but not the id or something of the monitor-op calling them).
Since yesterday, when we patched to latest officially released
SLES11-HAE-SP1-packages, the zombie-monitor-ops (as well as many other
problems) are gone (and only a few minor new ones so far (-;); and though
not having explicitly looked/searched for it, I lately haven't seen such
ignoring rc=7 and re-calling monitor-actions several times anymore.
(But lately I also - due to enhancements to my own RAs (Tomcat6/Apache)
- could remove the 15sec-start-delays for the monitor-op, which speeded
them up a lot and thus them then only rarely being the ones attracting
the zombie-monitor-ops)

Current version now is (SLES11-HAE-SP1): 1.1.2-0.6.1 (Arch: x86_64)
Displays in crm_mon as: 1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b


>> (btw., jfyi: migration-thresholds are currently completely banned out of
> 
> Why? Anything wrong with them?

See my other thread, the filed bugzilla linked in there and Andrew
Beekhof's confirmational cleared-upstream-note about fail-counts in
bugzilla.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2468

migration-threshold and failure-timeout seem to be fixed in this new,
current SLES-release too.


>> my configurations, so this is another issue; I probably also might have
>> yet another issue / possible bug regarding zombie-(monitor-)operations,
>> with symptoms like of an off-by-one-error)
>
> Please file a bugzilla if you find a bug.

Though I allready had collected dozens of hb_reports with
zombie-monitor-operations occuring and could quite exactly "predict"
such a zombie from only watching the crm_mon during nodes switching to
standby, I haven't found/identified an exact cause for it yet (turned
out to at least not show up as an ordinary off-by-one-error; in the
beginning it often hit the resources controlled by my own RAs, which
were the ones starting last, but after having speeded them up, it rather
hit them the least(-#), therefor I haven't file anything about that yet.
Anyway, those zombie-monitor-operations seem to be gone now too, so they
probably were only yet another long resolved old-version-bug, due to the
very conservative policies for enterprise distributions.





More information about the Pacemaker mailing list