[Pacemaker] Need help!!! resources fail-over not taking place properly...

Thu Feb 18 14:52:26 UTC 2010

Hello Dejan,

First of all thank you very much for your reply. I found that one of my node
is having the permission problem. There the permission of /var/lib/pengine
file was set to "999:999" I am not sure how!!!!!! However i changed it...

sir, when I pull out the interface cable i am getting only this log message:

Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF (device
state 1)

And the resource ip is not moving any where at all. It is still there in the
same machine... I acn view that the IP is still assigned to the eth0
interface via "# ip addr show", even though the interface status is 'down.'.
Is this the split-brain?? If so how can I clear it??

Because of the on-fail=standy in pgsql part in my cib I am able to do a
failover to another node when I manuallyu stop the postgres service in tha
active machine. however even after restarting the postgres service via
"/etc/init.d/postgresql-8.4 start " I have to run
crm resource cleanup <pgclone>
to make the crm_mon or cluster identify that the service on. Till then It is
showing as a failed action

crm_mon snippet
--------------------------------------------------------------------
Last updated: Thu Feb 18 20:17:28 2010
Stack: Heartbeat
Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with
quorum

Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, unknown expected votes
3 Resources configured.
============

Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail)
Online: [ node1 ]

vir-ip  (ocf::heartbeat:IPaddr2):       Started node1
slony-fail      (lsb:slony_failover):   Started node1
Clone Set: pgclone
        Started: [ node1 ]
        Stopped: [ pgsql:0 ]

Failed actions:
    pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): not
running
--------------------------------------------------------------------------------

Is there any way to run crm resource cleanup <resource> periodically??

I dont know if there is any mistake in pgsql ocf script sir.. I have given
all parameters correctly but its is giving an error " syntax error" all the
time when I use it.. I put the same meta attributes as for the current lsb
as shown below...

Please help me out... should I reinstall the nodes again??

On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:

> Hi,
>
> On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:
> > sir,
> >
> > I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip
> > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also
> > added a manually created script for slony database replication.
> >
> > Now every thing works fine but I am not able to use the ocf resource
> > scripts. I mean fail over is not taking place or else even resource is
> not
> > even taking. My ha.cf file and cib configuration is attached with this
> mail
> >
> > My ha.cf file
> >
> > autojoin none
> > keepalive 2
> > deadtime 15
> > warntime 5
> > initdead 64
> > udpport 694
> > bcast eth0
> > auto_failback off
> > node node1
> > node node2
> > crm respawn
> > use_logd yes
> >
> >
> > My cib.xml configuration file in cli format:
> >
> > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \
> >     attributes standby="off"
> > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \
> >     attributes standby="off"
> > primitive pgsql lsb:postgresql-8.4 \
> >     meta target-role="Started" resource-stickness="inherited" \
> >     op monitor interval="15s" timeout="25s" on-fail="standby"
> > primitive slony-fail lsb:slony_failover \
> >     meta target-role="Started"
> > primitive vir-ip ocf:heartbeat:IPaddr2 \
> >     params ip="192.168.10.10" nic="eth0" cidr_netmask="24"
> > broadcast="192.168.10.255" \
> >     op monitor interval="15s" timeout="25s" on-fail="standby" \
> >     meta target-role="Started"
> > clone pgclone pgsql \
> >     meta notify="true" globally-unique="false" interleave="true"
> > target-role="Started"
> > colocation ip-with-slony inf: slony-fail vir-ip
> > order slony-b4-ip inf: vir-ip slony-fail
> > property $id="cib-bootstrap-options" \
> >     dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> >     cluster-infrastructure="Heartbeat" \
> >     no-quorum-policy="ignore" \
> >     stonith-enabled="false" \
> >     last-lrm-refresh="1266488780"
> > rsc_defaults $id="rsc-options" \
> >     resource-stickiness="INFINITY"
> >
> >
> >
> > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip
> 192.168.10.129
> > in one machine and 192.168.10.130 in another machine.
> >
> > When I pull out the eth0 interface cable fail-over is not taking place.
>
> That's split brain. More than a resource failure. Without
> stonith, you'll have both nodes running all resources.
>
> > This is the log message i am getting while I pull out the cable:
> >
> > "Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now OFF
> > (device state 1)"
> >
> > and after a miniute or two
> >
> > log snippet:
> > -------------------------------------------------------------------
> > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3
> operations
> > (13333.00us average, 0% utilization) in the last 10min
> > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine
> Recheck
> > Timer (I_PE_CALC) just popped!
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED
> > origin=crm_timer_popped ]
> > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition:
> Progressed
> > to state S_POLICY_ENGINE after C_TIMER_POPPED
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2
> > cluster nodes are eligible to run resources.
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:
> > Requesting the current CIB: S_POLICY_ENGINE
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback:
> Invoking
> > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores:
> > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:
> Node
> > node2 is online
> > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected
> value:
> > 7 (not running)
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
> > slony-fail_monitor_0 found resource slony-fail active on node2
> > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value:
> 7
> > (not running)
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
> > pgsql:0_monitor_0 found resource pgsql:0 active on node2
> > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:
> Node
> > node1 is online
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > slony-fail#011(lsb:slony_failover):#011Started node2
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set:
> > pgclone
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started:
> [
> > node2 node1 ]
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start
> > recurring monitor (15s) for pgsql:1 on node1
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> resource
> > vir-ip#011(Started node2)
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> resource
> > slony-fail#011(Started node2)
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> resource
> > pgsql:0#011(Started node2)
> > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> resource
> > pgsql:1#011(Started node1)
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > cause=C_IPC_MESSAGE origin=handle_response ]
> > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked
> transition
> > 26: 1 actions in 1 synapses
> > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph
> 26
> > (ref=pe_calc-dc-1266492773-121) derived from
> > /var/lib/pengine/pe-input-125.bz2
> > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating
> action
> > 15: monitor pgsql:1_monitor_15000 on node1
> > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence:
> Cannout
> > open series file /var/lib/pengine/pe-input.last for writing
>
> This is probably a permission problem. /var/lib/pengine should be
> owned by haclient:hacluster.
>
> > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message:
> Transition
> > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
> > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action
> > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)
> > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:
> > ====================================================
> > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26
> > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pengine/pe-input-125.bz2): Complete
> > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition
> 26
> > is now complete
> > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26
> > status: done - <null>
> > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting
> > PEngine Recheck Timer
> >
> ------------------------------------------------------------------------------
>
> Don't see anything in the logs about the IP address resource.
>
> > Also I am not able to use the pgsql ocf script and hence I am using the
> init
>
> Why is that? Something wrong with pgsql? If so, then it should be
> fixed. It's always much better to use the OCF instead of LSB RA.
>
> Thanks,
>
> Dejan
>
> > script and cloned it as  I need to run it on both nodes for slony data
> base
> > replication.
> >
> > I am using the heartbeat and pacemaker debs from the updated ubuntu
> karmic
> > repo. (Heartbeat 2.99)
> >
> > Please check my configuration and tell me where I am missing....[?][?][?]
> > --
> > Regards,
> >
> > Jayakrishnan. L
> >
> > Visit: www.jayakrishnan.bravehost.com
>
>
>
>
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

-- 
Regards,

Jayakrishnan. L

Visit: www.jayakrishnan.bravehost.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100218/d70faca1/attachment-0002.htm>