[Pacemaker] Action from a different CRMD transition results in restarting services

Thu Dec 13 14:33:56 UTC 2012

Andrew Beekhof <andrew at beekhof.net> wrote:
> 18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
> I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.
>
> Not the best log message though :(

Thanks Andrew for the quick answer. I still need more info if possible.

I searched everywhere for transaction 16048 and I couldn't find a trace
of it (looked for up to 5 days of logs prior to transaction 18014).
It would have been good if we had timestamps for each transaction
involved in this situation :-)

Is there a way to find about this old transaction in any other logs (I
looked into /var/log/messages on both nodes involved in this cluster)?

To give you an idea of how many transactions happened during this
period:
   TR_ID 18010 @ 21:52:16
   ...
   TR_ID 18018 @ 22:55:25

Over an hour between these two events.

Given this, how come such a (very) old transaction (~2000 transactions
before current one) only acts now? Could it be stale information in
pacemaker?

Thanks in advance.

Youssef

Message: 4  from Pacemaker Digest, Vol 61, Issue 34
---------------------------------------------------------------
Date: Thu, 13 Dec 2012 10:52:42 +1100
From: Andrew Beekhof <andrew at beekhof.net>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] Action from a different CRMD transition
	results in restarting services
Message-ID:

<CAEDLWG2LtrPuxTRrd=JbV1SxTiLbG3SB0nu0fEyF3yRGrNc9BA at mail.gmail.com>
Content-Type: text/plain; charset=windows-1252

On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
<YLatrous at broadviewnet.com> wrote:
> Hi,
>
>
>
> I run into the following issue and I couldn?t find what it really
means:
>
>
>
>         Detected action msgbroker_monitor_10000 from a different
transition:
> 16048 vs. 18014

18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.

Not the best log message though :(

>
>
>
> I can see that its impact is to stop/start a service but I?d like to 
> understand it a bit more.
>
>
>
> Thank you in advance for any information.
>
>
>
>
>
> Logs about this issue:
>
> ?
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: 
> Detected action msgbroker_monitor_10000 from a different transition: 
> 16048 vs. 18014
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> process_graph_event:477 - Triggered transition abort (complete=1, 
> tag=lrm_rsc_op, id=msgbroker_monitor_10000, 
> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692,
cib=0.971.5) :
> Old event
>
> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating 
> failcount for msgbroker on Node0 after failed monitor: rc=7 
> (update=value++,
> time=1354852505)
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 
> cluster nodes are eligible to run resources.
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, 
> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, 
> cib=0.971.6) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, 
> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, 
> cib=0.971.7) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating 
> hash entry for last-failure-msgbroker
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: 
> Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,

> quorate=1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss 
> of CCM
> Quorum: Ignore
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: 
> Operation
> txpublisher_monitor_0 found resource txpublisher active on Node1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing

> failed op msgbroker_monitor_10000 on Node0: not running (7)
>
> ?
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice:
common_apply_stickiness:
> msgbroker can fail 999999 more times on Node0 before being forced off
>
> ?
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start 
> recurring monitor (10s) for msgbroker on Node0
>
> ?
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover 
> msgbroker (Started Node0)
>
> ?
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating 
> action
> 37: stop msgbroker_stop_0 on Node0
>
>
>
>
>
> Transition 18014 details:
>
>
>
> Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
> Transition 18014: PEngine Input stored in:
> /var/lib/pengine/pe-input-3270.bz2
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS

> cause=C_IPC_MESSAGE origin=handle_response ]
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked 
> transition
> 18014: 0 actions in 0 synapses
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing 
> graph
> 18014 (ref=pe_calc-dc-1354852338-39406) derived from
> /var/lib/pengine/pe-input-3270.bz2
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:
> ====================================================
>
> Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 
> 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-3270.bz2): Complete
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition

> 18014 is now complete
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 
> 18014
> status: done - <null>
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: 
> Starting PEngine Recheck Timer
>
>
>
>
>
> Youssef
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

------------------------------

Message: 5
Date: Thu, 13 Dec 2012 01:17:17 +0000
From: Xavier Lashmar <xlashmar at uottawa.ca>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2
Message-ID:

<CC445C0CEB8B8A4C87297D880D8F903BBCC0FA95 at CMS-P04.uottawa.o.univ>
Content-Type: text/plain; charset="windows-1252"

I see, thanks very much for pointing me in the right direction!

Xavier Lashmar
Universit? d'Ottawa / University of Ottawa Analyste de Syst?mes |
Systems Analyst Service ?tudiants, service de l'informatique et des
communications | Student services, computing and communications
services.
1 Nicholas Street (810)
Ottawa ON K1N 7B7
T?l. | Tel. 613-562-5800 (2120)
________________________________
From: Andrew Beekhof [andrew at beekhof.net]
Sent: Tuesday, December 11, 2012 9:30 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2

On Wed, Dec 12, 2012 at 1:29 AM, Xavier Lashmar
<xlashmar at uottawa.ca<mailto:xlashmar at uottawa.ca>> wrote:
Hello,

We are attempting to mount gfs2 partitions on CentOS using DRBD +
COROSYNC + PACEMAKER.  Unfortunately we consistently get the following
error:

You'll need to configure pacemaker to use cman for this.
See:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from
_Scratch/ch08s02.html

# mount /dev/vg_data/lv_data /webdata/ -t gfs2 -v mount /dev/dm-2
/webdata
parse_opts: opts = "rw"
  clear flag 1 for "rw", flags = 0
parse_opts: flags = 0
parse_opts: extra = ""
parse_opts: hostdata = ""
parse_opts: lockproto = ""
parse_opts: locktable = ""
gfs_controld join connect error: Connection refused error mounting
lockproto lock_dlm

We are trying to find out where to get the lock_dlm libraries and
packages for Centos 6.2 and 6.3

Also, I found that the document ?Pacemaker 1.1 - Clusters from Scratch?
the Fedora 17 version is a bit problematic.  I?m also running a Fedora
17 system and found no package ?dlm? as per the instructions in section
8.1.1

yum install -y gfs2-utils dlm kernel-modules-extra

Any idea if an external repository is needed?  If so, which one ? and
which package do we need to install for CentOS 6+ ?

Thanks very much

[Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED]

Xavier Lashmar
Analyste de Syst?mes | Systems Analyst
Service ?tudiants, service de l'informatique et des
communications/Student services, computing and communications services.
1 Nicholas Street (810)
Ottawa ON K1N 7B7
T?l. | Tel. 613-562-5800 (2120)

_______________________________________________
Pacemaker mailing list:
Pacemaker at oss.clusterlabs.org<mailto:Pacemaker at oss.clusterlabs.org>
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
bdf24/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 916 bytes
Desc: image003.png
URL:
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
bdf24/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 989 bytes
Desc: image001.png
URL:
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
bdf24/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 4219 bytes
Desc: image002.png
URL:
<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
bdf24/attachment-0002.png>

------------------------------

_______________________________________________
Pacemaker mailing list
Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

End of Pacemaker Digest, Vol 61, Issue 34
*****************************************