[Pacemaker] Action from a different CRMD transition results in restarting services

Thu Dec 13 21:32:32 EST 2012

On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
<YLatrous at broadviewnet.com> wrote:
>
> Andrew Beekhof <andrew at beekhof.net> wrote:
>> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
>> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>>
>> Not the best log message though :(
>
> Thanks Andrew for the quick answer. I still need more info if possible.
>
> I searched everywhere for transaction 16048 and I couldn't find a trace
> of it (looked for up to 5 days of logs prior to transaction 18014).
> It would have been good if we had timestamps for each transaction
> involved in this situation :-)
>
> Is there a way to find about this old transaction in any other logs (I
> looked into /var/log/messages on both nodes involved in this cluster)?

Its not really relevant.
The only important thing is that its not one we're currently executing.

What you should care about is any logs that hopefully show you why the
resource failed at around Dec  6 22:55:05.

>
> To give you an idea of how many transactions happened during this
> period:
>    TR_ID 18010 @ 21:52:16
>    ...
>    TR_ID 18018 @ 22:55:25
>
> Over an hour between these two events.
>
> Given this, how come such a (very) old transaction (~2000 transactions
> before current one) only acts now? Could it be stale information in
> pacemaker?

No. It hasn't only just acted now. Its been repeating over and over
for the last few weeks or so.
The difference is that this time it failed.

>
> Thanks in advance.
>
> Youssef
>
>
> Message: 4  from Pacemaker Digest, Vol 61, Issue 34
> ---------------------------------------------------------------
> Date: Thu, 13 Dec 2012 10:52:42 +1100
> From: Andrew Beekhof <andrew at beekhof.net>
> To: The Pacemaker cluster resource manager
>         <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Action from a different CRMD transition
>         results in restarting services
> Message-ID:
>
> <CAEDLWG2LtrPuxTRrd=JbV1SxTiLbG3SB0nu0fEyF3yRGrNc9BA at mail.gmail.com>
> Content-Type: text/plain; charset=windows-1252
>
> On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
> <YLatrous at broadviewnet.com> wrote:
>> Hi,
>>
>>
>>
>> I run into the following issue and I couldn?t find what it really
> means:
>>
>>
>>
>>         Detected action msgbroker_monitor_10000 from a different
> transition:
>> 16048 vs. 18014
>
> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>
> Not the best log message though :(
>
>>
>>
>>
>> I can see that its impact is to stop/start a service but I?d like to
>> understand it a bit more.
>>
>>
>>
>> Thank you in advance for any information.
>>
>>
>>
>>
>>
>> Logs about this issue:
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event:
>> Detected action msgbroker_monitor_10000 from a different transition:
>> 16048 vs. 18014
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> process_graph_event:477 - Triggered transition abort (complete=1,
>> tag=lrm_rsc_op, id=msgbroker_monitor_10000,
>> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692,
> cib=0.971.5) :
>> Old event
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
>> failcount for msgbroker on Node0 after failed monitor: rc=7
>> (update=value++,
>> time=1354852505)
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2
>> cluster nodes are eligible to run resources.
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA,
>> cib=0.971.6) : Transient
>> attribute: update
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA,
>> cib=0.971.7) : Transient
>> attribute: update
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating
>> hash entry for last-failure-msgbroker
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback:
>> Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,
>
>> quorate=1
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss
>> of CCM
>> Quorum: Ignore
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op:
>> Operation
>> txpublisher_monitor_0 found resource txpublisher active on Node1
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
>
>> failed op msgbroker_monitor_10000 on Node0: not running (7)
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice:
> common_apply_stickiness:
>> msgbroker can fail 999999 more times on Node0 before being forced off
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start
>> recurring monitor (10s) for msgbroker on Node0
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover
>> msgbroker (Started Node0)
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating
>> action
>> 37: stop msgbroker_stop_0 on Node0
>>
>>
>>
>>
>>
>> Transition 18014 details:
>>
>>
>>
>> Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
>> Transition 18014: PEngine Input stored in:
>> /var/lib/pengine/pe-input-3270.bz2
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>
>> cause=C_IPC_MESSAGE origin=handle_response ]
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked
>> transition
>> 18014: 0 actions in 0 synapses
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing
>> graph
>> 18014 (ref=pe_calc-dc-1354852338-39406) derived from
>> /var/lib/pengine/pe-input-3270.bz2
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:
>> ====================================================
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition
>> 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pengine/pe-input-3270.bz2): Complete
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition
>
>> 18014 is now complete
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition
>> 18014
>> status: done - <null>
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition:
>> Starting PEngine Recheck Timer
>>
>>
>>
>>
>>
>> Youssef
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 13 Dec 2012 01:17:17 +0000
> From: Xavier Lashmar <xlashmar at uottawa.ca>
> To: The Pacemaker cluster resource manager
>         <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2
> Message-ID:
>
> <CC445C0CEB8B8A4C87297D880D8F903BBCC0FA95 at CMS-P04.uottawa.o.univ>
> Content-Type: text/plain; charset="windows-1252"
>
> I see, thanks very much for pointing me in the right direction!
>
> Xavier Lashmar
> Universit? d'Ottawa / University of Ottawa Analyste de Syst?mes |
> Systems Analyst Service ?tudiants, service de l'informatique et des
> communications | Student services, computing and communications
> services.
> 1 Nicholas Street (810)
> Ottawa ON K1N 7B7
> T?l. | Tel. 613-562-5800 (2120)
> ________________________________
> From: Andrew Beekhof [andrew at beekhof.net]
> Sent: Tuesday, December 11, 2012 9:30 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2
>
>
>
> On Wed, Dec 12, 2012 at 1:29 AM, Xavier Lashmar
> <xlashmar at uottawa.ca<mailto:xlashmar at uottawa.ca>> wrote:
> Hello,
>
> We are attempting to mount gfs2 partitions on CentOS using DRBD +
> COROSYNC + PACEMAKER.  Unfortunately we consistently get the following
> error:
>
> You'll need to configure pacemaker to use cman for this.
> See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from
> _Scratch/ch08s02.html
>
>
> # mount /dev/vg_data/lv_data /webdata/ -t gfs2 -v mount /dev/dm-2
> /webdata
> parse_opts: opts = "rw"
>   clear flag 1 for "rw", flags = 0
> parse_opts: flags = 0
> parse_opts: extra = ""
> parse_opts: hostdata = ""
> parse_opts: lockproto = ""
> parse_opts: locktable = ""
> gfs_controld join connect error: Connection refused error mounting
> lockproto lock_dlm
>
> We are trying to find out where to get the lock_dlm libraries and
> packages for Centos 6.2 and 6.3
>
> Also, I found that the document ?Pacemaker 1.1 - Clusters from Scratch?
> the Fedora 17 version is a bit problematic.  I?m also running a Fedora
> 17 system and found no package ?dlm? as per the instructions in section
> 8.1.1
>
> yum install -y gfs2-utils dlm kernel-modules-extra
>
> Any idea if an external repository is needed?  If so, which one ? and
> which package do we need to install for CentOS 6+ ?
>
> Thanks very much
>
>
>
> [Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED]
>
>
>
> Xavier Lashmar
> Analyste de Syst?mes | Systems Analyst
> Service ?tudiants, service de l'informatique et des
> communications/Student services, computing and communications services.
> 1 Nicholas Street (810)
> Ottawa ON K1N 7B7
> T?l. | Tel. 613-562-5800 (2120)
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list:
> Pacemaker at oss.clusterlabs.org<mailto:Pacemaker at oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image003.png
> Type: image/png
> Size: 916 bytes
> Desc: image003.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image001.png
> Type: image/png
> Size: 989 bytes
> Desc: image001.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment-0001.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image002.png
> Type: image/png
> Size: 4219 bytes
> Desc: image002.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment-0002.png>
>
> ------------------------------
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> End of Pacemaker Digest, Vol 61, Issue 34
> *****************************************
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org