[Pacemaker] Need help!!! resources fail-over not taking place properly...

Thu Feb 18 19:47:24 UTC 2010

Hi,

On Thu, Feb 18, 2010 at 09:53:13PM +0530, Jayakrishnan wrote:
> Sir,
> 
> You got the point...... I thought that there is some mistakes in my
> configurations after 2 weeks trying...  If I enable stonith and manage to
> shut down the failed machine, most of my problems will be solved.. Now I
> feel much confident...

Good.

> But sir I need to clear the resource failures for my slony_failover script..
> Because when the slony failover takes place it will give a warning message
> stating
> 
> Feb 16 14:50:01 node1 lrmd: [2477]: info: RA output:
> (slony-fail:start:stderr) <stdin>:4: NOTICE:  failedNode: set 1 has no other
> direct receivers - move now
> 
>  to stderr or stdout and this warning messages are treated as resource
> failures by heartbeat-pacemaker.

It is treated as a failure because the script exited with an
error code. Whether that makes sense or not, I don't know, it's
up to the script. This being a LSB init script, you should
evaluate it and test it thoroughly: unfortunately they are
usually not robust enough for use in clusters.

> So If I want to add another script for
> second database failover, I am afraid the first script may block the
> execution of the second.. Now I have only one database replication for
> testing and the slony-failover script is running last while failovers..

You lost me. I guess that "add another script" means "another
resource". Well, if the two are independent, then they can't
influence each other (unless the stop action fails in which case
the node is fenced).

> And I still dont believe that I am chatting with the person who made the
> "crm-cli".

Didn't know that the cli is so famous.

Thanks,

Dejan

> 
> ---------- Forwarded message ----------
> From: Dejan Muhamedagic <dejanmm at fastmail.fm>
> Date: Thu, Feb 18, 2010 at 9:02 PM
> Subject: Re: [Pacemaker] Need help!!! resources fail-over not taking place
> properly...
> To: pacemaker at oss.clusterlabs.org
> 
> 
> Hi,
> 
> On Thu, Feb 18, 2010 at 08:22:26PM +0530, Jayakrishnan wrote:
> > Hello Dejan,
> >
> > First of all thank you very much for your reply. I found that one of my
> node
> > is having the permission problem. There the permission of /var/lib/pengine
> > file was set to "999:999" I am not sure how!!!!!! However i changed it...
> >
> > sir, when I pull out the interface cable i am getting only this log
> message:
> >
> > Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF
> (device
> > state 1)
> >
> > And the resource ip is not moving any where at all. It is still there in
> the
> > same machine... I acn view that the IP is still assigned to the eth0
> > interface via "# ip addr show", even though the interface status is
> 'down.'.
> > Is this the split-brain?? If so how can I clear it??
> 
> With fencing (stonith). Please read some documentation available
> here: http://clusterlabs.org/wiki/Documentation
> 
> > Because of the on-fail=standy in pgsql part in my cib I am able to do a
> > failover to another node when I manuallyu stop the postgres service in tha
> > active machine. however even after restarting the postgres service via
> > "/etc/init.d/postgresql-8.4 start " I have to run
> > crm resource cleanup <pgclone>
> 
> Yes, that's necessary.
> 
> > to make the crm_mon or cluster identify that the service on. Till then It
> is
> > showing as a failed action
> >
> > crm_mon snippet
> > --------------------------------------------------------------------
> > Last updated: Thu Feb 18 20:17:28 2010
> > Stack: Heartbeat
> > Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with
> > quorum
> >
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, unknown expected votes
> > 3 Resources configured.
> > ============
> >
> > Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail)
> > Online: [ node1 ]
> >
> > vir-ip  (ocf::heartbeat:IPaddr2):       Started node1
> > slony-fail      (lsb:slony_failover):   Started node1
> > Clone Set: pgclone
> >         Started: [ node1 ]
> >         Stopped: [ pgsql:0 ]
> >
> > Failed actions:
> >     pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete):
> not
> > running
> >
> --------------------------------------------------------------------------------
> >
> > Is there any way to run crm resource cleanup <resource> periodically??
> 
> Why would you want to do that? Do you expect your resources to
> fail regularly?
> 
> > I dont know if there is any mistake in pgsql ocf script sir.. I have given
> > all parameters correctly but its is giving an error " syntax error" all
> the
> > time when I use it..
> 
> Best to report such a case, it's either a configuration problem
> (did you read its metadata) or perhaps a bug in the RA.
> 
> Thanks,
> 
> Dejan
> 
> > I put the same meta attributes as for the current lsb
> > as shown below...
> >
> > Please help me out... should I reinstall the nodes again??
> >
> >
> > On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <dejanmm at fastmail.fm
> >wrote:
> >
> > > Hi,
> > >
> > > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:
> > > > sir,
> > > >
> > > > I have set up a two node cluster in Ubuntu 9.1. I have added a
> cluster-ip
> > > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and
> also
> > > > added a manually created script for slony database replication.
> > > >
> > > > Now every thing works fine but I am not able to use the ocf resource
> > > > scripts. I mean fail over is not taking place or else even resource is
> > > not
> > > > even taking. My ha.cf file and cib configuration is attached with this
> > > mail
> > > >
> > > > My ha.cf file
> > > >
> > > > autojoin none
> > > > keepalive 2
> > > > deadtime 15
> > > > warntime 5
> > > > initdead 64
> > > > udpport 694
> > > > bcast eth0
> > > > auto_failback off
> > > > node node1
> > > > node node2
> > > > crm respawn
> > > > use_logd yes
> > > >
> > > >
> > > > My cib.xml configuration file in cli format:
> > > >
> > > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \
> > > >     attributes standby="off"
> > > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \
> > > >     attributes standby="off"
> > > > primitive pgsql lsb:postgresql-8.4 \
> > > >     meta target-role="Started" resource-stickness="inherited" \
> > > >     op monitor interval="15s" timeout="25s" on-fail="standby"
> > > > primitive slony-fail lsb:slony_failover \
> > > >     meta target-role="Started"
> > > > primitive vir-ip ocf:heartbeat:IPaddr2 \
> > > >     params ip="192.168.10.10" nic="eth0" cidr_netmask="24"
> > > > broadcast="192.168.10.255" \
> > > >     op monitor interval="15s" timeout="25s" on-fail="standby" \
> > > >     meta target-role="Started"
> > > > clone pgclone pgsql \
> > > >     meta notify="true" globally-unique="false" interleave="true"
> > > > target-role="Started"
> > > > colocation ip-with-slony inf: slony-fail vir-ip
> > > > order slony-b4-ip inf: vir-ip slony-fail
> > > > property $id="cib-bootstrap-options" \
> > > >     dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> > > >     cluster-infrastructure="Heartbeat" \
> > > >     no-quorum-policy="ignore" \
> > > >     stonith-enabled="false" \
> > > >     last-lrm-refresh="1266488780"
> > > > rsc_defaults $id="rsc-options" \
> > > >     resource-stickiness="INFINITY"
> > > >
> > > >
> > > >
> > > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip
> > > 192.168.10.129
> > > > in one machine and 192.168.10.130 in another machine.
> > > >
> > > > When I pull out the eth0 interface cable fail-over is not taking
> place.
> > >
> > > That's split brain. More than a resource failure. Without
> > > stonith, you'll have both nodes running all resources.
> > >
> > > > This is the log message i am getting while I pull out the cable:
> > > >
> > > > "Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now OFF
> > > > (device state 1)"
> > > >
> > > > and after a miniute or two
> > > >
> > > > log snippet:
> > > > -------------------------------------------------------------------
> > > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3
> > > operations
> > > > (13333.00us average, 0% utilization) in the last 10min
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine
> > > Recheck
> > > > Timer (I_PE_CALC) just popped!
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > > cause=C_TIMER_POPPED
> > > > origin=crm_timer_popped ]
> > > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition:
> > > Progressed
> > > > to state S_POLICY_ENGINE after C_TIMER_POPPED
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2
> > > > cluster nodes are eligible to run resources.
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:
> > > > Requesting the current CIB: S_POLICY_ENGINE
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback:
> > > Invoking
> > > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss
> of
> > > > CCM Quorum: Ignore
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node
> scores:
> > > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:
> > > Node
> > > > node2 is online
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected
> > > value:
> > > > 7 (not running)
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op:
> Operation
> > > > slony-fail_monitor_0 found resource slony-fail active on node2
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected
> value:
> > > 7
> > > > (not running)
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op:
> Operation
> > > > pgsql:0_monitor_0 found resource pgsql:0 active on node2
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:
> > > Node
> > > > node1 is online
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > > > slony-fail#011(lsb:slony_failover):#011Started node2
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone
> Set:
> > > > pgclone
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list:
> #011Started:
> > > [
> > > > node2 node1 ]
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start
> > > > recurring monitor (15s) for pgsql:1 on node1
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > resource
> > > > vir-ip#011(Started node2)
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > resource
> > > > slony-fail#011(Started node2)
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > resource
> > > > pgsql:0#011(Started node2)
> > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > resource
> > > > pgsql:1#011(Started node1)
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked
> > > transition
> > > > 26: 1 actions in 1 synapses
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing
> graph
> > > 26
> > > > (ref=pe_calc-dc-1266492773-121) derived from
> > > > /var/lib/pengine/pe-input-125.bz2
> > > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating
> > > action
> > > > 15: monitor pgsql:1_monitor_15000 on node1
> > > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence:
> > > Cannout
> > > > open series file /var/lib/pengine/pe-input.last for writing
> > >
> > > This is probably a permission problem. /var/lib/pengine should be
> > > owned by haclient:hacluster.
> > >
> > > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message:
> > > Transition
> > > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action
> > > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:
> > > > ====================================================
> > > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26
> > > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > Source=/var/lib/pengine/pe-input-125.bz2): Complete
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger:
> Transition
> > > 26
> > > > is now complete
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26
> > > > status: done - <null>
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State
> > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition:
> Starting
> > > > PEngine Recheck Timer
> > > >
> > >
> ------------------------------------------------------------------------------
> > >
> > > Don't see anything in the logs about the IP address resource.
> > >
> > > > Also I am not able to use the pgsql ocf script and hence I am using
> the
> > > init
> > >
> > > Why is that? Something wrong with pgsql? If so, then it should be
> > > fixed. It's always much better to use the OCF instead of LSB RA.
> > >
> > > Thanks,
> > >
> > > Dejan
> > >
> > > > script and cloned it as  I need to run it on both nodes for slony data
> > > base
> > > > replication.
> > > >
> > > > I am using the heartbeat and pacemaker debs from the updated ubuntu
> > > karmic
> > > > repo. (Heartbeat 2.99)
> > > >
> > > > Please check my configuration and tell me where I am
> missing....[?][?][?]
> > > > --
> > > > Regards,
> > > >
> > > > Jayakrishnan. L
> > > >
> > > > Visit: www.jayakrishnan.bravehost.com
> > >
> > >
> > >
> > >
> > > > _______________________________________________
> > > > Pacemaker mailing list
> > > > Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list
> > > Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Jayakrishnan. L
> >
> > Visit: www.jayakrishnan.bravehost.com
> 
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> 
> -- 
> Regards,
> 
> Jayakrishnan. L
> 
> Visit: www.jayakrishnan.bravehost.com

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker