[Pacemaker] Action from a different CRMD transition results in

Tue Dec 18 01:36:37 UTC 2012

On Tue, Dec 18, 2012 at 1:39 AM, Latrous, Youssef
<YLatrous at broadviewnet.com> wrote:
> Hi Andrew,
>
> Thank you for following up.
>
> I still don't see what went wrong. From the logs, RabbitMQ was working
> just fine around that time until it was ordered to shut down by CRM (for
> the failed monitor?).

Apparently not, otherwise the monitor would not have reported a failure.
Something went wrong, either in the resource script or the RabbitMQ itself.

>
> Moreover, I assume that transitions are ordered monotonically, which
> means that Transition ID 16048 happened before Transition ID 18014:
>       16048 << 18014
>
> According to the logs, Transition ID 16048 wasn't present in the logs
> dating several days before transition ID 18014 was generated. I'll then
> assume that it was generated several days ago (if not true, please give
> me a way of finding out when did this transition happen - I still
> believe that time is of essence in this case). Our monitor command
> timers are expressed in seconds.
>
> In that case, how can we say:
>   " It hasn't only just acted now. Its been repeating over and over for
> the last few weeks or so."

Because thats how its designed, thats what recurring monitors do, the
lrmd schedules them to run over and over every N seconds and the lrmd
lets us know when something changes.

>
> My understanding is that a transition happens once and only once: it
> succeeds, fails or is aborted altogether.

No.

> Corresponding events can
> repeat over and over, but each time can only be part a new transition.
> Am I missing something fundamental here?

Yes.  See above.

>
> Sorry to insist, but I have to answer this very simple question:  "What
> did happen here?"

Your resource or resource agent had a problem.
More than that I can't say because I don't have access to your logs.

>
> I'm sure you can understand my situation here.
>
> Thank you in advance for your help,
>
> Regards,
>
> Youssef
>
> -----Original Message-----
> From: pacemaker-request at oss.clusterlabs.org
> [mailto:pacemaker-request at oss.clusterlabs.org]
> Sent: Friday, December 14, 2012 5:37 AM
> To: pacemaker at oss.clusterlabs.org
> Subject: Pacemaker Digest, Vol 61, Issue 37
>
> Send Pacemaker mailing list submissions to
>         pacemaker at oss.clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> or, via email, send a message with subject or body 'help' to
>         pacemaker-request at oss.clusterlabs.org
>
> You can reach the person managing the list at
>         pacemaker-owner at oss.clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of Pacemaker digest..."
>
>
> Today's Topics:
>
>    1. Re: Action from a different CRMD transition results in
>       restarting services (Andrew Beekhof)
>    2. Re: problem with float IP with pacemaker (Andrew Beekhof)
>    3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob)
>    4. Re: booth is the state of "started" on pacemaker before booth
>       write ticket info in cib. (Jiaju Zhang)
>    5. Pacemaker stop behaviour when underlying resource is
>       unavailable (pavan tc)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 14 Dec 2012 13:32:32 +1100
> From: Andrew Beekhof <andrew at beekhof.net>
> To: The Pacemaker cluster resource manager
>         <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Action from a different CRMD transition
>         results in restarting services
> Message-ID:
>
> <CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=BA at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
> <YLatrous at broadviewnet.com> wrote:
>>
>> Andrew Beekhof <andrew at beekhof.net> wrote:
>>> 18014 is where we're up to now, 16048 is the (old) one that scheduled
>> the recurring monitor operation.
>>> I suspect you'll find the action failed earlier in the logs and thats
>> why it needed to be restarted.
>>>
>>> Not the best log message though :(
>>
>> Thanks Andrew for the quick answer. I still need more info if
> possible.
>>
>> I searched everywhere for transaction 16048 and I couldn't find a
>> trace of it (looked for up to 5 days of logs prior to transaction
> 18014).
>> It would have been good if we had timestamps for each transaction
>> involved in this situation :-)
>>
>> Is there a way to find about this old transaction in any other logs (I
>
>> looked into /var/log/messages on both nodes involved in this cluster)?
>
> Its not really relevant.
> The only important thing is that its not one we're currently executing.
>
> What you should care about is any logs that hopefully show you why the
> resource failed at around Dec  6 22:55:05.
>
>>
>> To give you an idea of how many transactions happened during this
>> period:
>>    TR_ID 18010 @ 21:52:16
>>    ...
>>    TR_ID 18018 @ 22:55:25
>>
>> Over an hour between these two events.
>>
>> Given this, how come such a (very) old transaction (~2000 transactions
>
>> before current one) only acts now? Could it be stale information in
>> pacemaker?
>
> No. It hasn't only just acted now. Its been repeating over and over for
> the last few weeks or so.
> The difference is that this time it failed.
>
>>
>> Thanks in advance.
>>
>> Youssef
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> End of Pacemaker Digest, Vol 61, Issue 37
> *****************************************
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org