[Pacemaker] Action from a different CRMD transition results in

Mon Dec 17 14:39:18 UTC 2012

Hi Andrew,

Thank you for following up.

I still don't see what went wrong. From the logs, RabbitMQ was working
just fine around that time until it was ordered to shut down by CRM (for
the failed monitor?).

Moreover, I assume that transitions are ordered monotonically, which
means that Transition ID 16048 happened before Transition ID 18014:
      16048 << 18014

According to the logs, Transition ID 16048 wasn't present in the logs
dating several days before transition ID 18014 was generated. I'll then
assume that it was generated several days ago (if not true, please give
me a way of finding out when did this transition happen - I still
believe that time is of essence in this case). Our monitor command
timers are expressed in seconds.

In that case, how can we say:
  " It hasn't only just acted now. Its been repeating over and over for
the last few weeks or so."

My understanding is that a transition happens once and only once: it
succeeds, fails or is aborted altogether. Corresponding events can
repeat over and over, but each time can only be part a new transition.
Am I missing something fundamental here?

Sorry to insist, but I have to answer this very simple question:  "What
did happen here?"

I'm sure you can understand my situation here.

Thank you in advance for your help,

Regards,

Youssef

-----Original Message-----
From: pacemaker-request at oss.clusterlabs.org
[mailto:pacemaker-request at oss.clusterlabs.org] 
Sent: Friday, December 14, 2012 5:37 AM
To: pacemaker at oss.clusterlabs.org
Subject: Pacemaker Digest, Vol 61, Issue 37

Send Pacemaker mailing list submissions to
	pacemaker at oss.clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://oss.clusterlabs.org/mailman/listinfo/pacemaker
or, via email, send a message with subject or body 'help' to
	pacemaker-request at oss.clusterlabs.org

You can reach the person managing the list at
	pacemaker-owner at oss.clusterlabs.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Pacemaker digest..."

Today's Topics:

   1. Re: Action from a different CRMD transition results in
      restarting services (Andrew Beekhof)
   2. Re: problem with float IP with pacemaker (Andrew Beekhof)
   3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob)
   4. Re: booth is the state of "started" on pacemaker before booth
      write ticket info in cib. (Jiaju Zhang)
   5. Pacemaker stop behaviour when underlying resource is
      unavailable (pavan tc)

----------------------------------------------------------------------

Message: 1
Date: Fri, 14 Dec 2012 13:32:32 +1100
From: Andrew Beekhof <andrew at beekhof.net>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] Action from a different CRMD transition
	results in restarting services
Message-ID:

<CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=BA at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
<YLatrous at broadviewnet.com> wrote:
>
> Andrew Beekhof <andrew at beekhof.net> wrote:
>> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
>> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>>
>> Not the best log message though :(
>
> Thanks Andrew for the quick answer. I still need more info if
possible.
>
> I searched everywhere for transaction 16048 and I couldn't find a 
> trace of it (looked for up to 5 days of logs prior to transaction
18014).
> It would have been good if we had timestamps for each transaction 
> involved in this situation :-)
>
> Is there a way to find about this old transaction in any other logs (I

> looked into /var/log/messages on both nodes involved in this cluster)?

Its not really relevant.
The only important thing is that its not one we're currently executing.

What you should care about is any logs that hopefully show you why the
resource failed at around Dec  6 22:55:05.

>
> To give you an idea of how many transactions happened during this
> period:
>    TR_ID 18010 @ 21:52:16
>    ...
>    TR_ID 18018 @ 22:55:25
>
> Over an hour between these two events.
>
> Given this, how come such a (very) old transaction (~2000 transactions

> before current one) only acts now? Could it be stale information in 
> pacemaker?

No. It hasn't only just acted now. Its been repeating over and over for
the last few weeks or so.
The difference is that this time it failed.

>
> Thanks in advance.
>
> Youssef
_______________________________________________
Pacemaker mailing list
Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

End of Pacemaker Digest, Vol 61, Issue 37
*****************************************