[Pacemaker] Need help!!! resources fail-over not taking place properly...

Thu Feb 18 23:58:18 EST 2010

Yes sir, the slony resource problem is solved.. I added another resource and
give colocation to my cluster-ip. Thus my first slon-script (resource) is
not influencing the functioning of the second one. I have also added a
cleanup command at the end of each slony-resource --- let it be there!!!..

My slony resource snippet.

___________________________________________

#!/bin/bash
logger $0 called with $1
HOSTNAME=`uname -n`

NODE1_HOST=192.168.10.129
NODE2_HOST=192.168.10.130
slony_USER=postgres
slony_PASSWORD=hcl123
DATABASE_NAME=test2
CLUSTER_NAME=cluster2
PRIMARY_NAME=node1
PORT=5432
NUM=1

#
# Returns 1 (TRUE) If the local database is the master
#
is_master () {
export PGPASSWORD=$slony_PASSWORD
RESULT=`psql $DATABASE_NAME -h 127.0.0.1 --user $slony_USER -q -t <<_EOF_
SELECT count(*) FROM _$CLUSTER_NAME.sl_set WHERE
set_origin=_$CLUSTER_NAME.getlocalnodeid('_$CLUSTER_NAME');
_EOF_`

   return $RESULT;
}

case "$1" in
start)
# start commands go here
is_master;
IS_MASTER=$?
if [ $IS_MASTER -eq 1 ]; then
#Already the master.  Nothing to do here.
echo "The local database is already the master"
exit 0;
fi
if [ "$HOSTNAME" == "$PRIMARY_NAME" ]; then
OLD_MASTER=2
OLD_SLAVE=1
else
OLD_MASTER=1
OLD_SLAVE=2
fi

if [ $NUM -eq 1 ]; then
/usr/lib/postgresql/8.4/bin/slonik<<_EOF_
cluster name=$CLUSTER_NAME;
node 1 admin conninfo = 'dbname=$DATABASE_NAME host=$NODE1_HOST
user=$slony_USER port=$PORT password=$slony_PASSWORD';
node 2 admin conninfo = 'dbname=$DATABASE_NAME host=$NODE2_HOST
user=$slony_USER port=$PORT password=$slony_PASSWORD';
failover (id=$OLD_MASTER, backup node = $OLD_SLAVE);
_EOF_
sleep 8s
crm resource cleanup slony-fail2 node2
echo "Done"
fi
#fi;

;;
stop)
# stop commands go here
;;

status)
# status commands go here
# If LOCALHOST reports itself as the master then status is 0
# otherwise status is 3
is_master;
RESULT=$?
if [ "$RESULT" -eq "1" ];  then
          echo "Local database is the master"
          exit 0
else
echo "Local database is a slave"
exit 3
fi

         ;;
esac

____________________________________________________________

But sir I am still afraid of the cluster-ip. I am not getting any error
messages other than

Feb 19 10:14:43 node2 NetworkManager: <info>  (eth0): carrier now OFF
(device state 1)

And when I give #ip addr show
(I am in node2 now)
____________________________________

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state
DOWN qlen 100
    link/ether 00:07:e9:2e:5c:26 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.130/24 brd 192.168.10.255 scope global eth0
    inet 192.168.10.10/24 brd 192.168.10.255 scope global secondary eth0
    inet6 fe80::207:e9ff:fe2e:5c26/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
UP qlen 1000
    link/ether 00:07:e9:2e:5c:27 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.2/24 brd 192.168.1.255 scope global eth1
    inet6 fe80::207:e9ff:fe2e:5c27/64 scope link
       valid_lft forever preferred_lft forever
_____________________________________________
And when I connect the cable back my ha-log wll show

Feb 19 10:23:29 node2 NetworkManager: <info>  (eth0): carrier now ON (device
state 1)

eth0 is my cluster-ip interface-- 192.168.10.10 is the cluster-ip. and rth1
is my heartbeat interface connected via cross-over cable to node1. sir, how
can I confirm that the heartbeat-pacemaker is detecting it as a resource
failure? Then only fencing take place no?

With lots of thanks and regards,
Jayakrishnan.L

On Fri, Feb 19, 2010 at 1:17 AM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:

> Hi,
>
> On Thu, Feb 18, 2010 at 09:53:13PM +0530, Jayakrishnan wrote:
> > Sir,
> >
> > You got the point...... I thought that there is some mistakes in my
> > configurations after 2 weeks trying...  If I enable stonith and manage to
> > shut down the failed machine, most of my problems will be solved.. Now I
> > feel much confident...
>
> Good.
>
> > But sir I need to clear the resource failures for my slony_failover
> script..
> > Because when the slony failover takes place it will give a warning
> message
> > stating
> >
> > Feb 16 14:50:01 node1 lrmd: [2477]: info: RA output:
> > (slony-fail:start:stderr) <stdin>:4: NOTICE:  failedNode: set 1 has no
> other
> > direct receivers - move now
> >
> >  to stderr or stdout and this warning messages are treated as resource
> > failures by heartbeat-pacemaker.
>
> It is treated as a failure because the script exited with an
> error code. Whether that makes sense or not, I don't know, it's
> up to the script. This being a LSB init script, you should
> evaluate it and test it thoroughly: unfortunately they are
> usually not robust enough for use in clusters.
>
> > So If I want to add another script for
> > second database failover, I am afraid the first script may block the
> > execution of the second.. Now I have only one database replication for
> > testing and the slony-failover script is running last while failovers..
>
> You lost me. I guess that "add another script" means "another
> resource". Well, if the two are independent, then they can't
> influence each other (unless the stop action fails in which case
> the node is fenced).
>
> > And I still dont believe that I am chatting with the person who made the
> > "crm-cli".
>
> Didn't know that the cli is so famous.
>
> Thanks,
>
> Dejan
>
> >
> > ---------- Forwarded message ----------
> > From: Dejan Muhamedagic <dejanmm at fastmail.fm>
> > Date: Thu, Feb 18, 2010 at 9:02 PM
> > Subject: Re: [Pacemaker] Need help!!! resources fail-over not taking
> place
> > properly...
> > To: pacemaker at oss.clusterlabs.org
> >
> >
> > Hi,
> >
> > On Thu, Feb 18, 2010 at 08:22:26PM +0530, Jayakrishnan wrote:
> > > Hello Dejan,
> > >
> > > First of all thank you very much for your reply. I found that one of my
> > node
> > > is having the permission problem. There the permission of
> /var/lib/pengine
> > > file was set to "999:999" I am not sure how!!!!!! However i changed
> it...
> > >
> > > sir, when I pull out the interface cable i am getting only this log
> > message:
> > >
> > > Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF
> > (device
> > > state 1)
> > >
> > > And the resource ip is not moving any where at all. It is still there
> in
> > the
> > > same machine... I acn view that the IP is still assigned to the eth0
> > > interface via "# ip addr show", even though the interface status is
> > 'down.'.
> > > Is this the split-brain?? If so how can I clear it??
> >
> > With fencing (stonith). Please read some documentation available
> > here: http://clusterlabs.org/wiki/Documentation
> >
> > > Because of the on-fail=standy in pgsql part in my cib I am able to do a
> > > failover to another node when I manuallyu stop the postgres service in
> tha
> > > active machine. however even after restarting the postgres service via
> > > "/etc/init.d/postgresql-8.4 start " I have to run
> > > crm resource cleanup <pgclone>
> >
> > Yes, that's necessary.
> >
> > > to make the crm_mon or cluster identify that the service on. Till then
> It
> > is
> > > showing as a failed action
> > >
> > > crm_mon snippet
> > > --------------------------------------------------------------------
> > > Last updated: Thu Feb 18 20:17:28 2010
> > > Stack: Heartbeat
> > > Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition
> with
> > > quorum
> > >
> > > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > > 2 Nodes configured, unknown expected votes
> > > 3 Resources configured.
> > > ============
> > >
> > > Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail)
> > > Online: [ node1 ]
> > >
> > > vir-ip  (ocf::heartbeat:IPaddr2):       Started node1
> > > slony-fail      (lsb:slony_failover):   Started node1
> > > Clone Set: pgclone
> > >         Started: [ node1 ]
> > >         Stopped: [ pgsql:0 ]
> > >
> > > Failed actions:
> > >     pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete):
> > not
> > > running
> > >
> >
> --------------------------------------------------------------------------------
> > >
> > > Is there any way to run crm resource cleanup <resource> periodically??
> >
> > Why would you want to do that? Do you expect your resources to
> > fail regularly?
> >
> > > I dont know if there is any mistake in pgsql ocf script sir.. I have
> given
> > > all parameters correctly but its is giving an error " syntax error" all
> > the
> > > time when I use it..
> >
> > Best to report such a case, it's either a configuration problem
> > (did you read its metadata) or perhaps a bug in the RA.
> >
> > Thanks,
> >
> > Dejan
> >
> > > I put the same meta attributes as for the current lsb
> > > as shown below...
> > >
> > > Please help me out... should I reinstall the nodes again??
> > >
> > >
> > > On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <
> dejanmm at fastmail.fm
> > >wrote:
> > >
> > > > Hi,
> > > >
> > > > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:
> > > > > sir,
> > > > >
> > > > > I have set up a two node cluster in Ubuntu 9.1. I have added a
> > cluster-ip
> > > > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4"
> and
> > also
> > > > > added a manually created script for slony database replication.
> > > > >
> > > > > Now every thing works fine but I am not able to use the ocf
> resource
> > > > > scripts. I mean fail over is not taking place or else even resource
> is
> > > > not
> > > > > even taking. My ha.cf file and cib configuration is attached with
> this
> > > > mail
> > > > >
> > > > > My ha.cf file
> > > > >
> > > > > autojoin none
> > > > > keepalive 2
> > > > > deadtime 15
> > > > > warntime 5
> > > > > initdead 64
> > > > > udpport 694
> > > > > bcast eth0
> > > > > auto_failback off
> > > > > node node1
> > > > > node node2
> > > > > crm respawn
> > > > > use_logd yes
> > > > >
> > > > >
> > > > > My cib.xml configuration file in cli format:
> > > > >
> > > > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \
> > > > >     attributes standby="off"
> > > > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \
> > > > >     attributes standby="off"
> > > > > primitive pgsql lsb:postgresql-8.4 \
> > > > >     meta target-role="Started" resource-stickness="inherited" \
> > > > >     op monitor interval="15s" timeout="25s" on-fail="standby"
> > > > > primitive slony-fail lsb:slony_failover \
> > > > >     meta target-role="Started"
> > > > > primitive vir-ip ocf:heartbeat:IPaddr2 \
> > > > >     params ip="192.168.10.10" nic="eth0" cidr_netmask="24"
> > > > > broadcast="192.168.10.255" \
> > > > >     op monitor interval="15s" timeout="25s" on-fail="standby" \
> > > > >     meta target-role="Started"
> > > > > clone pgclone pgsql \
> > > > >     meta notify="true" globally-unique="false" interleave="true"
> > > > > target-role="Started"
> > > > > colocation ip-with-slony inf: slony-fail vir-ip
> > > > > order slony-b4-ip inf: vir-ip slony-fail
> > > > > property $id="cib-bootstrap-options" \
> > > > >     dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> > > > >     cluster-infrastructure="Heartbeat" \
> > > > >     no-quorum-policy="ignore" \
> > > > >     stonith-enabled="false" \
> > > > >     last-lrm-refresh="1266488780"
> > > > > rsc_defaults $id="rsc-options" \
> > > > >     resource-stickiness="INFINITY"
> > > > >
> > > > >
> > > > >
> > > > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip
> > > > 192.168.10.129
> > > > > in one machine and 192.168.10.130 in another machine.
> > > > >
> > > > > When I pull out the eth0 interface cable fail-over is not taking
> > place.
> > > >
> > > > That's split brain. More than a resource failure. Without
> > > > stonith, you'll have both nodes running all resources.
> > > >
> > > > > This is the log message i am getting while I pull out the cable:
> > > > >
> > > > > "Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now
> OFF
> > > > > (device state 1)"
> > > > >
> > > > > and after a miniute or two
> > > > >
> > > > > log snippet:
> > > > > -------------------------------------------------------------------
> > > > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3
> > > > operations
> > > > > (13333.00us average, 0% utilization) in the last 10min
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped:
> PEngine
> > > > Recheck
> > > > > Timer (I_PE_CALC) just popped!
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition:
> State
> > > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > > > cause=C_TIMER_POPPED
> > > > > origin=crm_timer_popped ]
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition:
> > > > Progressed
> > > > > to state S_POLICY_ENGINE after C_TIMER_POPPED
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All
> 2
> > > > > cluster nodes are eligible to run resources.
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:
> > > > > Requesting the current CIB: S_POLICY_ENGINE
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback:
> > > > Invoking
> > > > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On
> loss
> > of
> > > > > CCM Quorum: Ignore
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node
> > scores:
> > > > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info:
> determine_online_status:
> > > > Node
> > > > > node2 is online
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > > > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the
> expected
> > > > value:
> > > > > 7 (not running)
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op:
> > Operation
> > > > > slony-fail_monitor_0 found resource slony-fail active on node2
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> > > > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected
> > value:
> > > > 7
> > > > > (not running)
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op:
> > Operation
> > > > > pgsql:0_monitor_0 found resource pgsql:0 active on node2
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info:
> determine_online_status:
> > > > Node
> > > > > node1 is online
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > > > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> > > > > slony-fail#011(lsb:slony_failover):#011Started node2
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone
> > Set:
> > > > > pgclone
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list:
> > #011Started:
> > > > [
> > > > > node2 node1 ]
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start
> > > > > recurring monitor (15s) for pgsql:1 on node1
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > > resource
> > > > > vir-ip#011(Started node2)
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > > resource
> > > > > slony-fail#011(Started node2)
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > > resource
> > > > > pgsql:0#011(Started node2)
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave
> > > > resource
> > > > > pgsql:1#011(Started node1)
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition:
> State
> > > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS
> > > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked
> > > > transition
> > > > > 26: 1 actions in 1 synapses
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing
> > graph
> > > > 26
> > > > > (ref=pe_calc-dc-1266492773-121) derived from
> > > > > /var/lib/pengine/pe-input-125.bz2
> > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command:
> Initiating
> > > > action
> > > > > 15: monitor pgsql:1_monitor_15000 on node1
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence:
> > > > Cannout
> > > > > open series file /var/lib/pengine/pe-input.last for writing
> > > >
> > > > This is probably a permission problem. /var/lib/pengine should be
> > > > owned by haclient:hacluster.
> > > >
> > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message:
> > > > Transition
> > > > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event:
> Action
> > > > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:
> > > > > ====================================================
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition
> 26
> > > > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > > Source=/var/lib/pengine/pe-input-125.bz2): Complete
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger:
> > Transition
> > > > 26
> > > > > is now complete
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition
> 26
> > > > > status: done - <null>
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition:
> State
> > > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition:
> > Starting
> > > > > PEngine Recheck Timer
> > > > >
> > > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > Don't see anything in the logs about the IP address resource.
> > > >
> > > > > Also I am not able to use the pgsql ocf script and hence I am using
> > the
> > > > init
> > > >
> > > > Why is that? Something wrong with pgsql? If so, then it should be
> > > > fixed. It's always much better to use the OCF instead of LSB RA.
> > > >
> > > > Thanks,
> > > >
> > > > Dejan
> > > >
> > > > > script and cloned it as  I need to run it on both nodes for slony
> data
> > > > base
> > > > > replication.
> > > > >
> > > > > I am using the heartbeat and pacemaker debs from the updated ubuntu
> > > > karmic
> > > > > repo. (Heartbeat 2.99)
> > > > >
> > > > > Please check my configuration and tell me where I am
> > missing....[?][?][?]
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Jayakrishnan. L
> > > > >
> > > > > Visit: www.jayakrishnan.bravehost.com
> > > >
> > > >
> > > >
> > > >
> > > > > _______________________________________________
> > > > > Pacemaker mailing list
> > > > > Pacemaker at oss.clusterlabs.org
> > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > > >
> > > > _______________________________________________
> > > > Pacemaker mailing list
> > > > Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Jayakrishnan. L
> > >
> > > Visit: www.jayakrishnan.bravehost.com
> >
> > > _______________________________________________
> > > Pacemaker mailing list
> > > Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >
> >
> > --
> > Regards,
> >
> > Jayakrishnan. L
> >
> > Visit: www.jayakrishnan.bravehost.com
>
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

-- 
Regards,

Jayakrishnan. L

Visit: www.jayakrishnan.bravehost.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100219/bd9c44a1/attachment-0001.html>