Hello Dejan,<br><br>First of all thank you very much for your reply. I found that one of my node is having the permission problem. There the permission of /var/lib/pengine file was set to &quot;<span style="background-color: rgb(255, 255, 153);">999:999</span>&quot; I am not sure how!!!!!! However i changed it...<br>


<br>sir, when I pull out the interface cable i am getting only this log message:<br><br><span style="font-family: courier new,monospace; background-color: rgb(255, 255, 153);">Feb 18 16:55:58 node2 NetworkManager: &lt;info&gt; (eth0): carrier now OFF (device state 1)</span><br>


<br>And the resource ip is not moving any where at all. It is still there in the same machine... I acn view that the IP is still assigned to the eth0 interface via &quot;<span style="background-color: rgb(255, 255, 51);"># ip addr show</span>&quot;, even though the interface status is &#39;down.&#39;. Is this the split-brain?? If so how can I clear it??<br>


<br>Because of the on-fail=standy in pgsql part in my cib I am able to do a failover to another node when I manuallyu stop the postgres service in tha active machine. however even after restarting the postgres service via &quot;/etc/init.d/postgresql-8.4 start &quot; I have to run<br>


<span style="background-color: rgb(255, 255, 51);">crm resource cleanup &lt;pgclone&gt;</span> <br>to make the crm_mon or cluster identify that the service on. Till then It is showing as a failed action <br><br>crm_mon snippet<br>


--------------------------------------------------------------------<br><span style="font-family: courier new,monospace;">Last updated: Thu Feb 18 20:17:28 2010</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Stack: Heartbeat</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;"><span style="background-color: rgb(255, 255, 51);">Current DC: node2</span> (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with quorum</span><br style="font-family: courier new,monospace;">


<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">2 Nodes configured, unknown expected votes</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">3 Resources configured.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">============</span><br style="font-family: courier new,monospace;">


<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail)</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">Online: [ node1 ]</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">vir-ip  (ocf::heartbeat:IPaddr2):       <span style="background-color: rgb(255, 255, 102);">Started node1</span></span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">slony-fail      (lsb:slony_failover):   <span style="background-color: rgb(255, 255, 51);">Started node1</span></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Clone Set: pgclone</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">        Started: [ node1 ]</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">        Stopped: [ pgsql:0 ]</span><br style="font-family: courier new,monospace;">


<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Failed actions:</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">    pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): not running</span><br>


--------------------------------------------------------------------------------<br><br>Is there any way to run crm resource cleanup &lt;resource&gt; periodically??<br><br>I dont know if there is any mistake in pgsql ocf script sir.. I have given all parameters correctly but its is giving an error &quot; syntax error&quot; all the time when I use it.. I put the same meta attributes as for the current lsb as shown below...<br>


<br>Please help me out... should I reinstall the nodes again??<br><br><br><div class="gmail_quote">On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <span dir="ltr">&lt;<a href="mailto:dejanmm@fastmail.fm">dejanmm@fastmail.fm</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br>

<div><div></div><div class="h5"><br>

On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:<br>

&gt; sir,<br>

&gt;<br>

&gt; I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip<br>

&gt; using ocf:heartbeat:IPaddr2, clonned lsb script &quot;postgresql-8.4&quot; and also<br>

&gt; added a manually created script for slony database replication.<br>

&gt;<br>

&gt; Now every thing works fine but I am not able to use the ocf resource<br>

&gt; scripts. I mean fail over is not taking place or else even resource is not<br>

&gt; even taking. My <a href="http://ha.cf" target="_blank">ha.cf</a> file and cib configuration is attached with this mail<br>

&gt;<br>

&gt; My <a href="http://ha.cf" target="_blank">ha.cf</a> file<br>

&gt;<br>

&gt; autojoin none<br>

&gt; keepalive 2<br>

&gt; deadtime 15<br>

&gt; warntime 5<br>

&gt; initdead 64<br>

&gt; udpport 694<br>

&gt; bcast eth0<br>

&gt; auto_failback off<br>

&gt; node node1<br>

&gt; node node2<br>

&gt; crm respawn<br>

&gt; use_logd yes<br>

&gt;<br>

&gt;<br>

&gt; My cib.xml configuration file in cli format:<br>

&gt;<br>

&gt; node $id=&quot;3952b93e-786c-47d4-8c2f-a882e3d3d105&quot; node2 \<br>

&gt;     attributes standby=&quot;off&quot;<br>

&gt; node $id=&quot;ac87f697-5b44-4720-a8af-12a6f2295930&quot; node1 \<br>

&gt;     attributes standby=&quot;off&quot;<br>

&gt; primitive pgsql lsb:postgresql-8.4 \<br>

&gt;     meta target-role=&quot;Started&quot; resource-stickness=&quot;inherited&quot; \<br>

&gt;     op monitor interval=&quot;15s&quot; timeout=&quot;25s&quot; on-fail=&quot;standby&quot;<br>

&gt; primitive slony-fail lsb:slony_failover \<br>

&gt;     meta target-role=&quot;Started&quot;<br>

&gt; primitive vir-ip ocf:heartbeat:IPaddr2 \<br>

&gt;     params ip=&quot;192.168.10.10&quot; nic=&quot;eth0&quot; cidr_netmask=&quot;24&quot;<br>

&gt; broadcast=&quot;192.168.10.255&quot; \<br>

&gt;     op monitor interval=&quot;15s&quot; timeout=&quot;25s&quot; on-fail=&quot;standby&quot; \<br>

&gt;     meta target-role=&quot;Started&quot;<br>

&gt; clone pgclone pgsql \<br>

&gt;     meta notify=&quot;true&quot; globally-unique=&quot;false&quot; interleave=&quot;true&quot;<br>

&gt; target-role=&quot;Started&quot;<br>

&gt; colocation ip-with-slony inf: slony-fail vir-ip<br>

&gt; order slony-b4-ip inf: vir-ip slony-fail<br>

&gt; property $id=&quot;cib-bootstrap-options&quot; \<br>

&gt;     dc-version=&quot;1.0.5-3840e6b5a305ccb803d29b468556739e75532d56&quot; \<br>

&gt;     cluster-infrastructure=&quot;Heartbeat&quot; \<br>

&gt;     no-quorum-policy=&quot;ignore&quot; \<br>

&gt;     stonith-enabled=&quot;false&quot; \<br>

&gt;     last-lrm-refresh=&quot;1266488780&quot;<br>

&gt; rsc_defaults $id=&quot;rsc-options&quot; \<br>

&gt;     resource-stickiness=&quot;INFINITY&quot;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; I am assigning the cluster-ip (192.168.10.10) in eth0 with ip 192.168.10.129<br>

&gt; in one machine and 192.168.10.130 in another machine.<br>

&gt;<br>

&gt; When I pull out the eth0 interface cable fail-over is not taking place.<br>

<br>

</div></div>That&#39;s split brain. More than a resource failure. Without<br>

stonith, you&#39;ll have both nodes running all resources.<br>

<div><div></div><div class="h5"><br>

&gt; This is the log message i am getting while I pull out the cable:<br>

&gt;<br>

&gt; &quot;Feb 18 16:55:58 node2 NetworkManager: &lt;info&gt;  (eth0): carrier now OFF<br>

&gt; (device state 1)&quot;<br>

&gt;<br>

&gt; and after a miniute or two<br>

&gt;<br>

&gt; log snippet:<br>

&gt; -------------------------------------------------------------------<br>

&gt; Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 operations<br>

&gt; (13333.00us average, 0% utilization) in the last 10min<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine Recheck<br>

&gt; Timer (I_PE_CALC) just popped!<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State<br>

&gt; transition S_IDLE -&gt; S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED<br>

&gt; origin=crm_timer_popped ]<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: Progressed<br>

&gt; to state S_POLICY_ENGINE after C_TIMER_POPPED<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2<br>

&gt; cluster nodes are eligible to run resources.<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:<br>

&gt; Requesting the current CIB: S_POLICY_ENGINE<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: Invoking<br>

&gt; the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of<br>

&gt; CCM Quorum: Ignore<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores:<br>

&gt; &#39;red&#39; = -INFINITY, &#39;yellow&#39; = 0, &#39;green&#39; = 0<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node<br>

&gt; node2 is online<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:<br>

&gt; slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected value:<br>

&gt; 7 (not running)<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation<br>

&gt; slony-fail_monitor_0 found resource slony-fail active on node2<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:<br>

&gt; pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value: 7<br>

&gt; (not running)<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation<br>

&gt; pgsql:0_monitor_0 found resource pgsql:0 active on node2<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node<br>

&gt; node1 is online<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:<br>

&gt; vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:<br>

&gt; slony-fail#011(lsb:slony_failover):#011Started node2<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set:<br>

&gt; pgclone<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started: [<br>

&gt; node2 node1 ]<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start<br>

&gt; recurring monitor (15s) for pgsql:1 on node1<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource<br>

&gt; vir-ip#011(Started node2)<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource<br>

&gt; slony-fail#011(Started node2)<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource<br>

&gt; pgsql:0#011(Started node2)<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource<br>

&gt; pgsql:1#011(Started node1)<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State<br>

&gt; transition S_POLICY_ENGINE -&gt; S_TRANSITION_ENGINE [ input=I_PE_SUCCESS<br>

&gt; cause=C_IPC_MESSAGE origin=handle_response ]<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked transition<br>

&gt; 26: 1 actions in 1 synapses<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph 26<br>

&gt; (ref=pe_calc-dc-1266492773-121) derived from<br>

&gt; /var/lib/pengine/pe-input-125.bz2<br>

&gt; Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating action<br>

&gt; 15: monitor pgsql:1_monitor_15000 on node1<br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: Cannout<br>

&gt; open series file /var/lib/pengine/pe-input.last for writing<br>

<br>

</div></div>This is probably a permission problem. /var/lib/pengine should be<br>

owned by haclient:hacluster.<br>

<div class="im"><br>

&gt; Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: Transition<br>

&gt; 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action<br>

&gt; pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:<br>

&gt; ====================================================<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26<br>

&gt; (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,<br>

&gt; Source=/var/lib/pengine/pe-input-125.bz2): Complete<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition 26<br>

&gt; is now complete<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26<br>

&gt; status: done - &lt;null&gt;<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State<br>

&gt; transition S_TRANSITION_ENGINE -&gt; S_IDLE [ input=I_TE_SUCCESS<br>

&gt; cause=C_FSA_INTERNAL origin=notify_crmd ]<br>

&gt; Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting<br>

&gt; PEngine Recheck Timer<br>

&gt; ------------------------------------------------------------------------------<br>

<br>

</div>Don&#39;t see anything in the logs about the IP address resource.<br>

<div class="im"><br>

&gt; Also I am not able to use the pgsql ocf script and hence I am using the init<br>

<br>

</div>Why is that? Something wrong with pgsql? If so, then it should be<br>

fixed. It&#39;s always much better to use the OCF instead of LSB RA.<br>

<br>

Thanks,<br>

<br>

Dejan<br>

<div class="im"><br>

&gt; script and cloned it as  I need to run it on both nodes for slony data base<br>

&gt; replication.<br>

&gt;<br>

&gt; I am using the heartbeat and pacemaker debs from the updated ubuntu karmic<br>

&gt; repo. (Heartbeat 2.99)<br>

&gt;<br>

&gt; Please check my configuration and tell me where I am missing....[?][?][?]<br>

&gt; --<br>

&gt; Regards,<br>

&gt;<br>

&gt; Jayakrishnan. L<br>

&gt;<br>

&gt; Visit: <a href="http://www.jayakrishnan.bravehost.com" target="_blank">www.jayakrishnan.bravehost.com</a><br>

<br>

<br>

<br>

<br>

</div>&gt; _______________________________________________<br>

&gt; Pacemaker mailing list<br>

&gt; <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

<br>

_______________________________________________<br>

Pacemaker mailing list<br>

<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

</blockquote></div><br><br clear="all"><br>-- <br>Regards,<br><br>Jayakrishnan. L<br><br>Visit: <a href="http://www.jayakrishnan.bravehost.com">www.jayakrishnan.bravehost.com</a><br><br>