<br><br><div class="gmail_quote">On Fri, Nov 12, 2010 at 6:46 AM, jiaju liu <span dir="ltr">&lt;<a href="mailto:liujiaju86@yahoo.com.cn">liujiaju86@yahoo.com.cn</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td valign="top" style="font:inherit"><br>

<blockquote style="padding-left:5px;margin-left:5px;border-left:rgb(16,16,255) 2px solid">

<div><div class="im"><br><br>&gt; Hi<br>&gt; I reboot my node, and it appears<br>&gt; node2 pingd: [3932]: info: stand_alone_ping: Node 192.168.10.100 is<br>&gt; unreachable (read)<br>&gt; and the node could not start<br>

&gt;<br>&gt;  192.168.10.100  is ib network I will start ib after the node start, so do<br>&gt; you have any idea let the node start first?Thanks very much.:-)<br>&gt;<br>&gt;<br><br>Don&#39;t use IP resources as ping nodes.<br>

You should use the IP of something outside of your cluster, like an external<br>router<br><br></div>stand_alone_ping is start automatically, I have never start it by hand, so how to set it ping external router.</div></blockquote>

</td></tr></tbody></table></blockquote><div><br></div><div>See where you set &quot;<span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; color: rgb(80, 0, 80); ">192.168.10.100&quot;, set it to something else</span></div>

<div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; color: rgb(80, 0, 80); "><br></span></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td valign="top" style="font:inherit"><blockquote style="padding-left:5px;margin-left:5px;border-left:rgb(16,16,255) 2px solid">

<div>Thanks <br><div class="im">&gt;<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>&gt;<br>&gt; Project Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>&gt; Bugs:<br>&gt; <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

&gt;<br>&gt;<br></div>-------------- next part

 --------------<br>An HTML attachment was scrubbed...<br>URL: &lt;<a href="http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101111/4d7f3ea1/attachment-0001.htm" target="_blank">http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101111/4d7f3ea1/attachment-0001.htm</a>&gt;<br>

<br>------------------------------<br><br>Message: 3<br>Date: Thu, 11 Nov 2010 11:38:24 +0100<br>From: Simon Jansen &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=simon.jansen1@googlemail.com" target="_blank">simon.jansen1@googlemail.com</a>&gt;<br>

To: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a><br>Subject: Re: [Pacemaker] Multistate Resources is not promoted<br>    automatically<br>

Message-ID:<br>    &lt;AANLkTikwgMy4nutZ4807vv2x=nN_sMj+<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=E8Y1PRu6X1eT@mail.gmail.com" target="_blank">E8Y1PRu6X1eT@mail.gmail.com</a>&gt;<br>Content-Type: text/plain; charset=&quot;iso-8859-1&quot;<br>

<br>Hi Andrew,<br><br>thank you for your answer.<br><br>Does the ocf:heartbeat:Rsyslog script call crm_master?<br>&gt; It needs to to tell pacemaker which instance to promote.<br>&gt;<br>Yes it does. But I forgot to call crm_master with the option -D in the stop<br>

action. I think that this was the error. After correcting this issue the ra<br>starts as expected.<br><br>Two questions though...<br>&gt; 1) Why use master/slave for rsyslog?<br>&gt;<br>In the master role the rsyslog daemon should function as central log server<br>

and write the entries received on UDP port 514 into a MySQL database.<br>On the passive node the rsyslog service should be started with the standard<br>config.<br>Do you think there is a better solution to solve this

 requirement?<br><br><br>&gt; 2) Is this an upstream RA? If not, you shouldn&#39;t be using the<br>&gt; ocf:heartbeat namespace.<br>&gt;<br>Ok thank you for the advice. Should I use the pacemaker class instead or<br>should I define a custom namespace?<br>

<br>--<br><br>Regards,<br><br>Simon Jansen<br><br><br>---------------------------<br>Simon Jansen<br>64291 Darmstadt<br>-------------- next part --------------<br>An HTML attachment was scrubbed...<br>URL: &lt;<a href="http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101111/9e6d50bf/attachment-0001.htm" target="_blank">http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101111/9e6d50bf/attachment-0001.htm</a>&gt;<br>

<br>------------------------------<br><br>Message: 4<br>Date: Thu, 11 Nov 2010 11:44:47 +0100<br>From: Andrew Beekhof &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>&gt;<br>

To: The

 Pacemaker cluster resource manager<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a>&gt;<br>Subject: Re: [Pacemaker] Infinite fail-count and migration-threshold<br>

    after node fail-back<br>Message-ID:<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=AANLkTimmLWZMhKxSZCHu95x0d2WGnJcujN-B7eowbFXE@mail.gmail.com" target="_blank">AANLkTimmLWZMhKxSZCHu95x0d2WGnJcujN-B7eowbFXE@mail.gmail.com</a>&gt;<br>

Content-Type: text/plain; charset=ISO-8859-1<br><br>On Mon, Oct 11, 2010 at 9:40 AM, Dan Frincu &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=dfrincu@streamwide.ro" target="_blank">dfrincu@streamwide.ro</a>&gt; wrote:<br>

&gt; Hi all,<br>&gt;<br>&gt; I&#39;ve managed to make this

 setup work, basically the issue with a<br>&gt; symmetric-cluster=&quot;false&quot; and specifying the resources&#39; location manually<br>&gt; means that the resources will always obey the location constraint, and (as<br>

&gt; far as I could see) disregard the rsc_defaults resource-stickiness values.<br><br>This definitely should not be the case.<br>Possibly your stickiness setting is being eclipsed by the combination<br>of the location constraint scores.<br>

Try INFINITY instead.<br><br>&gt; This behavior is not the expected one, in theory, setting<br>&gt; symmetric-cluster=&quot;false&quot; should affect whether resources are allowed to run<br>&gt; anywhere by default and the resource-stickiness should lock in place the<br>

&gt; resources so they don&#39;t bounce from node to node. Again, this didn&#39;t happen,<br>&gt; but by setting symmetric-cluster=&quot;true&quot;, using the same ordering and<br>&gt; collocation constraints and the resource-stickiness, the behavior is the<br>

&gt; expected

 one.<br>&gt;<br>&gt; I don&#39;t remember seeing anywhere in the docs from <a href="http://clusterlabs.org" target="_blank">clusterlabs.org</a> being<br>&gt; mentioned that the resource-stickiness only works on<br>&gt; symmetric-cluster=&quot;true&quot;, so for anyone that also stumbles upon this issue,<br>

&gt; I hope this helps.<br>&gt;<br>&gt; Regards,<br>&gt;<br>&gt; Dan<br>&gt;<br>&gt; Dan Frincu wrote:<br>&gt;&gt;<br>&gt;&gt; Hi,<br>&gt;&gt;<br>&gt;&gt; Since it was brought to my attention that I should upgrade from<br>

&gt;&gt; openais-0.80 to a more recent version of corosync, I&#39;ve done just that,<br>&gt;&gt; however I&#39;m experiencing a strange behavior on the cluster.<br>&gt;&gt;<br>&gt;&gt; The same setup was used with the below packages:<br>

&gt;&gt;<br>&gt;&gt; # rpm -qa | grep -i &quot;(openais|cluster|heartbeat|pacemaker|resource)&quot;<br>&gt;&gt; openais-0.80.5-15.2<br>&gt;&gt; cluster-glue-1.0-12.2<br>&gt;&gt; pacemaker-1.0.5-4.2<br>&gt;&gt; cluster-glue-libs-1.0-12.2<br>

&gt;&gt; resource-agents-1.0-31.5<br>&gt;&gt;

 pacemaker-libs-1.0.5-4.2<br>&gt;&gt; pacemaker-mgmt-1.99.2-7.2<br>&gt;&gt; libopenais2-0.80.5-15.2<br>&gt;&gt; heartbeat-3.0.0-33.3<br>&gt;&gt; pacemaker-mgmt-client-1.99.2-7.2<br>&gt;&gt;<br>&gt;&gt; Now I&#39;ve migrated to the most recent stable packages I could find (on the<br>

&gt;&gt; <a href="http://clusterlabs.org" target="_blank">clusterlabs.org</a> website) for RHEL5:<br>&gt;&gt;<br>&gt;&gt; # rpm -qa | grep -i &quot;(openais|cluster|heartbeat|pacemaker|resource)&quot;<br>&gt;&gt; cluster-glue-1.0.6-1.6.el5<br>

&gt;&gt; pacemaker-libs-1.0.9.1-1.el5<br>&gt;&gt; pacemaker-1.0.9.1-1.el5<br>&gt;&gt; heartbeat-libs-3.0.3-2.el5<br>&gt;&gt; heartbeat-3.0.3-2.el5<br>&gt;&gt; openaislib-1.1.3-1.6.el5<br>&gt;&gt; resource-agents-1.0.3-2.el5<br>

&gt;&gt; cluster-glue-libs-1.0.6-1.6.el5<br>&gt;&gt; openais-1.1.3-1.6.el5<br>&gt;&gt;<br>&gt;&gt; Expected behavior:<br>&gt;&gt; - all the resources the in group should go (based on location preference)<br>&gt;&gt; to bench1<br>

&gt;&gt; - if bench1 goes down, resources migrate to

 bench2<br>&gt;&gt; - if bench1 comes back up, resources stay on bench2, unless manually told<br>&gt;&gt; otherwise.<br>&gt;&gt;<br>&gt;&gt; On the previous incantation, this worked, by using the new packages, not<br>&gt;&gt; so much. Now if bench1 goes down (crm node standby `uname -n`), failover<br>

&gt;&gt; occurs, but when bench1 comes backup up, resources migrate back, even if<br>&gt;&gt; default-resource-stickiness is set, and more than that, 2 drbd block devices<br>&gt;&gt; reach infinite metrics, most notably because they try to promote the<br>

&gt;&gt; resources to a Master state on bench1, but fail to do so due to the resource<br>&gt;&gt; being held open (by some process, I could not identify it).<br>&gt;&gt;<br>&gt;&gt; Strangely enough, the resources (drbd) fail to be promoted to a Master<br>

&gt;&gt; status on bench1, so they fail back to bench2, where they are mounted<br>&gt;&gt; (functional), but crm_mon shows:<br>&gt;&gt;<br>&gt;&gt;

 Migration summary:<br>&gt;&gt; * Node <a href="http://bench2.streamwide.ro" target="_blank">bench2.streamwide.ro</a>:<br>&gt;&gt; ?drbd_mysql:1: migration-threshold=1000000 fail-count=1000000<br>&gt;&gt; ?drbd_home:1: migration-threshold=1000000 fail-count=1000000<br>

&gt;&gt; * Node <a href="http://bench1.streamwide.ro" target="_blank">bench1.streamwide.ro</a>:<br>&gt;&gt;<br>&gt;&gt; .... infinite metrics on bench2, while the drbd resources are available<br>&gt;&gt;<br>&gt;&gt; version: 8.3.2 (api:88/proto:86-90)<br>

&gt;&gt; GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by<br>&gt;&gt; <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=mockbuild@v20z-x86-64.home.local" target="_blank">mockbuild@v20z-x86-64.home.local</a>, 2009-08-29 14:07:55<br>

&gt;&gt; 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----<br>&gt;&gt; ? ns:1632 nr:1864 dw:3512 dr:3933 al:11 bm:19 lo:0 pe:0 ua:0 ap:0 ep:1<br>&gt;&gt; wo:b oos:0<br>&gt;&gt; 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----<br>

&gt;&gt; ? ns:4

 nr:24 dw:28 dr:25 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0<br>&gt;&gt; 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----<br>&gt;&gt; ? ns:4 nr:24 dw:28 dr:85 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0<br>

&gt;&gt;<br>&gt;&gt; and mounted<br>&gt;&gt;<br>&gt;&gt; /dev/drbd1 on /home type ext3 (rw,noatime,nodiratime)<br>&gt;&gt; /dev/drbd0 on /mysql type ext3 (rw,noatime,nodiratime)<br>&gt;&gt; /dev/drbd2 on /storage type ext3 (rw,noatime,nodiratime)<br>

&gt;&gt;<br>&gt;&gt; Attached is the hb_report.<br>&gt;&gt;<br>&gt;&gt; Thank you in advance.<br>&gt;&gt;<br>&gt;&gt; Best regards<br>&gt;&gt;<br>&gt;<br>&gt; --<br>&gt; Dan FRINCU<br>&gt; Systems Engineer<br>&gt; CCNA, RHCE<br>

&gt; Streamwide Romania<div class="im"><br>&gt;<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>&gt;<br>&gt; Project Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>&gt; Bugs:<br>&gt; <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

&gt;<br><br><br><br></div>------------------------------<br><br>Message: 5<br>Date: Thu, 11 Nov 2010 11:46:42 +0100<br>From: Andrew Beekhof &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>&gt;<br>

To: The Pacemaker cluster resource manager<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a>&gt;<br>Subject: Re: [Pacemaker] Multistate Resources is not promoted<br>

    automatically<br>Message-ID:<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=AANLkTinAqC-vNWHYCDRrRhgyh5JUtcJux5E9YBvrXZc6@mail.gmail.com" target="_blank">AANLkTinAqC-vNWHYCDRrRhgyh5JUtcJux5E9YBvrXZc6@mail.gmail.com</a>&gt;<br>

Content-Type: text/plain; charset=ISO-8859-1<br><br>On Thu, Nov 11, 2010 at 11:38 AM, Simon Jansen<br>&lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=simon.jansen1@googlemail.com" target="_blank">simon.jansen1@googlemail.com</a>&gt; wrote:<br>

&gt; Hi Andrew,<br>&gt;<br>&gt; thank you for your answer.<br>&gt;<br>&gt;&gt; Does the ocf:heartbeat:Rsyslog script call crm_master?<br>&gt;&gt; It needs to to tell pacemaker which instance to promote.<br>&gt;<br>&gt; Yes it does. But I forgot to call crm_master with the option -D in the stop<br>

&gt; action. I think that this was the error. After correcting this issue the ra<br>&gt; starts as expected.<br>&gt;<br>&gt;&gt; Two questions though...<br>&gt;&gt; 1) Why use master/slave for rsyslog?<br>&gt;<br>&gt; In the master role the rsyslog daemon should function as central log server<br>

&gt; and write the entries received on UDP port 514 into a MySQL database.<br>&gt; On the passive node the rsyslog service should be started with the standard<br>&gt; config.<br><br>Interesting<br><br>&gt; Do you think there is a better solution to solve this

 requirement?<br><br>No, I&#39;d just never heard rsyslog being used in this way.<br><br>&gt;&gt;<br>&gt;&gt; 2) Is this an upstream RA? If not, you shouldn&#39;t be using the<br>&gt;&gt; ocf:heartbeat namespace.<br>&gt;<br>

&gt; Ok thank you for the advice. Should I use the pacemaker class instead or<br>&gt; should I define a custom namespace?<br><br>Custom.<br><br>&gt;<br>&gt; --<br>&gt;<br>&gt; Regards,<br>&gt;<br>&gt; Simon Jansen<br>&gt;<br>

&gt;<br>&gt; ---------------------------<br>&gt; Simon Jansen<br>&gt; 64291 Darmstadt<div class="im"><br>&gt;<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>&gt;<br>&gt; Project

 Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt; Bugs:<br>&gt; <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>&gt;<br>&gt;<br><br><br>

<br></div>------------------------------<br><br>Message: 6<br>Date: Thu, 11 Nov 2010 11:47:35 +0100<br>From: Andrew Beekhof &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>&gt;<br>

To: The Pacemaker cluster resource manager<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a>&gt;<br>Subject: Re: [Pacemaker] start error because &quot;not installed&quot; - stop<br>

    fails with &quot;not installed&quot; - stonith<br>Message-ID:<br>    &lt;AANLkTikXwe6wS2F-LtLF3dvKjEt1gvPZ=<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=5BSVNj1eZ2q@mail.gmail.com" target="_blank">5BSVNj1eZ2q@mail.gmail.com</a>&gt;<br>

Content-Type: text/plain; charset=ISO-8859-1<br><br>On Sat, Oct 9, 2010 at 12:36 AM, Andreas Kurz &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=andreas.kurz@linbit.com" target="_blank">andreas.kurz@linbit.com</a>&gt; wrote:<br>

&gt; Hello,<br>&gt;<br>&gt; if a resource has encounters a start error with rc=5 &quot;not installed&quot; the<br>&gt; stop action is not skipped before a restart is tried.<br><br>I&#39;d not expect a stop action at all.  What version?<br>

<br>&gt;<br>&gt;

 Typically in such a situation the stop will also fail with the same<br>&gt; error and the node will be fenced ?... even worse there is a good change<br>&gt; this happens on all remaining nodes e.g. if there is a typo in a parameter.<br>

&gt;<br>&gt; I would expect the cluster to skip the stop action after a &quot;not<br>&gt; installed&quot; start failure followed by a start retry on a different node.<br>&gt;<br>&gt; So ... is this a feature or a bug? ;-)<br>

&gt;<br>&gt; Regards,<br>&gt; Andreas<div class="im"><br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>&gt;<br>&gt; Project Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>&gt; Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

&gt;<br><br><br><br></div>------------------------------<br><br>Message: 7<br>Date: Thu, 11 Nov 2010 11:48:59 +0100<br>From: Andrew Beekhof &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>&gt;<br>

To: The Pacemaker cluster resource manager<br>    &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a>&gt;<br>Subject: Re: [Pacemaker] [Problem]Number of

 times control of the<br>    fail-count    is late.<br>Message-ID:<br>    &lt;AANLkTinMfWBqmW_jcA8a+ic7zmfb6HMiEfBD1_SuEe=<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=G@mail.gmail.com" target="_blank">G@mail.gmail.com</a>&gt;<br>

Content-Type: text/plain; charset=ISO-8859-1<br><br>On Wed, Nov 10, 2010 at 5:20 AM,  &lt;<a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=renayama19661014@ybb.ne.jp" target="_blank">renayama19661014@ybb.ne.jp</a>&gt; wrote:<br>

&gt; Hi,<br>&gt;<br>&gt; We constituted a cluster by two node constitution.<br>&gt; The migration-threshold set it to 2.<br>&gt;<br>&gt; We confirmed a phenomenon in the next procedure.<br>&gt;<br>&gt; Step1) Start two nodes and send config5.crm. (The clnDiskd-resources is original.)<br>

&gt;<br>&gt; ============<br>&gt; Last updated: Tue Nov ?9 21:10:49 2010<br>&gt; Stack: Heartbeat<br>&gt; Current

 DC: srv02 (8c93dc22-a27e-409b-8112-4073de622daf) - partition with quorum<br>&gt; Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438<br>&gt; 2 Nodes configured, unknown expected votes<br>&gt; 5 Resources configured.<br>

&gt; ============<br>&gt;<br>&gt; Online: [ srv01 srv02 ]<br>&gt;<br>&gt; ?vip ? ?(ocf::heartbeat:IPaddr2): ? ? ? Started srv01<br>&gt; ?Clone Set: clnDiskd<br>&gt; ? ? Started: [ srv01 srv02 ]<br>&gt; ?Clone Set: clnDummy2<br>

&gt; ? ? Started: [ srv01 srv02 ]<br>&gt; ?Clone Set: clnPingd1<br>&gt; ? ? Started: [ srv01 srv02 ]<br>&gt;<br>&gt; Node Attributes:<br>&gt; * Node srv01:<br>&gt; ? ?+ default_ping_set1 ? ? ? ? ? ? ? ? : 100<br>&gt; ? ?+ diskcheck_status_internal ? ? ? ? : normal<br>

&gt; * Node srv02:<br>&gt; ? ?+ default_ping_set1 ? ? ? ? ? ? ? ? : 100<br>&gt; ? ?+ diskcheck_status_internal ? ? ? ? : normal<br>&gt;<br>&gt; Migration summary:<br>&gt; * Node srv02:<br>&gt; * Node srv01:<br>&gt;<br>&gt;<br>

&gt; Step2) We edit a

 clnDummy2 resource to raise time-out in start. (add sleep)<br>&gt;<br>&gt; ?dummy_start() {<br>&gt; ? ?sleep 180 ----&gt; add sleep<br>&gt; ? ?dummy_monitor<br>&gt; ? ?if [ $? = ?$OCF_SUCCESS ]; then<br>&gt;<br>&gt;<br>

&gt; Step3) It causes a monitor error in a clnDummy2 resource.<br>&gt;<br>&gt; ?# rm -rf /var/run/Dummy-Dummy2.state<br>&gt;<br>&gt; Step4) clnDummy2 causes time-out by restart.<br>&gt;<br>&gt; But, as for clnDummy2, a lot of starts are up after time-out once when they watch log.<br>

&gt; In fact, the reason is because pengine does not know that fail-count became INFINITY.<br>&gt;<br>&gt; Because the reason is because fail-count does not yet become INFINITY in pe-input-2001.bz2.<br>&gt; In pe-input-2002.bz2, fail-count becomes INFINITY.<br>

&gt;<br>&gt; (snip)<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: WARN: status_from_rc: Action 25 (Dummy2:0_start_0) on srv01 failed<br>&gt; (target: 0 vs. rc: -2): Error<br>&gt; Nov ?9 21:12:35 srv02

 crmd: [5896]: WARN: update_failcount: Updating failcount for Dummy2:0 on srv01<br>&gt; after failed start: rc=-2 (update=INFINITY, time=1289304755)<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: abort_transition_graph: match_graph_event:272 - Triggered<br>

&gt; transition abort (complete=0, tag=lrm_rsc_op, id=Dummy2:0_start_0,<br>&gt; magic=2:-2;25:5:0:275da7f9-7f43-43a2-8308-41d0ab78346e, cib=0.9.39) : Event failed<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_start_0 (25) confirmed on<br>

&gt; srv01 (rc=4)<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 29 fired and confirmed<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: run_graph:<br>&gt; ====================================================<br>

&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: notice: run_graph: Transition 5 (Complete=7, Pending=0, Fired=0,<br>&gt; Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-2000.bz2):

 Stopped<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: te_graph_trigger: Transition 5 is now complete<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_TRANSITION_ENGINE -&gt;<br>

&gt; S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: All 2 cluster nodes are eligible to run<br>&gt; resources.<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke: Query 72: Requesting the current CIB:<br>

&gt; S_POLICY_ENGINE<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke_callback: Invoking the PE: query=72,<br>&gt; ref=pe_calc-dc-1289304755-58, seq=2, quorate=1<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: unpack_config: On loss of CCM Quorum: Ignore<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: unpack_config: Node scores: &#39;red&#39; = -INFINITY, &#39;yellow&#39; =<br>&gt; 0, &#39;green&#39; = 0<br>&gt; Nov ?9

 21:12:35 srv02 pengine: [7208]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv02 is online<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv01 is online<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op<br>&gt; Dummy2:0_monitor_15000 on srv01: not running (7)<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op Dummy2:0_start_0 on<br>

&gt; srv01: unknown exec error (-2)<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: native_print: Dummy ? ? ?(ocf::pacemaker:Dummy): Started<br>&gt; srv01<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: native_print: vip ? ? ? ?(ocf::heartbeat:IPaddr2): ? ? ? Started<br>

&gt; srv01<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: clone_print: ?Clone Set: clnDiskd<br>&gt; Nov ?9 21:12:35 srv02

 pengine: [7208]: notice: short_print: ? ? ?Started: [ srv01 srv02 ]<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: clone_print: ?Clone Set: clnDummy2<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: native_print: ? ? ?Dummy2:0 ? ? ?(ocf::pacemaker:Dummy2):<br>

&gt; Started srv01 FAILED<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: short_print: ? ? ?Started: [ srv02 ]<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: clone_print: ?Clone Set: clnPingd1<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: short_print: ? ? ?Started: [ srv01 srv02 ]<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more<br>&gt; times on srv01 before being forced off<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01<br>&gt; Nov ?9

 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more<br>&gt; times on srv01 before being forced off<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: ERROR: unpack_operation: Specifying on_fail=fence and<br>

&gt; stonith-enabled=false makes no sense<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: RecurringOp: ?Start recurring monitor (15s) for<br>&gt; Dummy2:0 on srv01<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy (Started srv01)<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource vip ? (Started srv01)<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:0 ? ?(Started srv01)<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:1 ? ?(Started srv02)<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Recover resource Dummy2:0 ? ?(Started srv01)<br>&gt;

 Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy2:1 ? ? ?(Started srv02)<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:0 ? (Started srv01)<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:1 ? (Started srv02)<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_POLICY_ENGINE -&gt;<br>

&gt; S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: unpack_graph: Unpacked transition 6: 8 actions in 8 synapses<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: do_te_invoke: Processing graph 6<br>

&gt; (ref=pe_calc-dc-1289304755-58) derived from /var/lib/pengine/pe-input-2001.bz2<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 30 fired and confirmed<br>&gt; Nov ?9 21:12:35 srv02 crmd: [5896]: info:

 te_rsc_command: Initiating action 5: stop Dummy2:0_stop_0 on<br>&gt; srv01<br>&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Transition 6: PEngine Input stored<br>&gt; in: /var/lib/pengine/pe-input-2001.bz2<br>

&gt; Nov ?9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Configuration ERRORs found during PE<br>&gt; processing. ?Please run &quot;crm_verify -L&quot; to identify issues.<br>&gt; Nov ?9 21:12:37 srv02 attrd: [5895]: info: attrd_ha_callback: flush message from srv01<br>

&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_stop_0 (5) confirmed on<br>&gt; srv01 (rc=0)<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 31 fired and confirmed<br>

&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 8 fired and confirmed<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 28 fired and confirmed<br>&gt; Nov ?9

 21:12:37 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 24: start Dummy2:0_start_0<br>&gt; on srv01<br>&gt;<br>&gt; ?-----&gt; Must not carry out this start.<br>&gt;<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered<br>

&gt; transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,<br>&gt; magic=NA, cib=0.9.41) : Transient attribute: update<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort priority upgraded from 0 to<br>

&gt; 1000000<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort action done superceeded by<br>&gt; restart<br>&gt; Nov ?9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered<br>

&gt; transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,<br>&gt; magic=NA, cib=0.9.42) : Transient attribute:

 update<br>&gt; (snip)<br>&gt;<br>&gt; It seems to be a problem that update of fail-count was late.<br>&gt; But, this problem seems to occur by a timing.<br>&gt;<br>&gt; It affects it in fail over time of the resource that the control number of times of fail-count is<br>

&gt; wrong.<br>&gt;<br>&gt; Is this problem already discussed?<br><br>Not that I know of<br><br>&gt; Is not a delay of the update of fail-count which went by way of attrd a problem?<br><br>Indeed.<br><br>&gt;<br>&gt; ?* I attach log and some pe-files at Bugzilla.<br>

&gt; ?* <a href="http://developerbugs.linux-foundation.org/show_bug.cgi?id=2520" target="_blank">http://developerbugs.linux-foundation.org/show_bug.cgi?id=2520</a><br><br>Ok, I&#39;ll follow up there.<br><br>&gt;<br>&gt; Best Regards,<br>

&gt; Hideo Yamauchi.<div class="im"><br>&gt;<br>&gt;<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: <a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>&gt;<br>&gt; Project Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>&gt; Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

&gt;<br><br><br><br></div>------------------------------<div class="im"><br><br>_______________________________________________<br>Pacemaker mailing list<br><a href="http://cn.mc157.mail.yahoo.com/mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br><br><br></div>End of Pacemaker Digest, Vol 36, Issue 34<br>*****************************************<br>

</div></blockquote></td></tr></tbody></table><br>


       <br>_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

<br></blockquote></div><br>