[Pacemaker] Need help!!! resources fail-over not taking place properly...

Thu Feb 18 11:39:09 UTC 2010

sir,

I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip
using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also
added a manually created script for slony database replication.

Now every thing works fine but I am not able to use the ocf resource
scripts. I mean fail over is not taking place or else even resource is not
even taking. My ha.cf file and cib configuration is attached with this mail

My ha.cf file

autojoin none
keepalive 2
deadtime 15
warntime 5
initdead 64
udpport 694
bcast eth0
auto_failback off
node node1
node node2
crm respawn
use_logd yes

My cib.xml configuration file in cli format:

node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \
    attributes standby="off"
node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \
    attributes standby="off"
primitive pgsql lsb:postgresql-8.4 \
    meta target-role="Started" resource-stickness="inherited" \
    op monitor interval="15s" timeout="25s" on-fail="standby"
primitive slony-fail lsb:slony_failover \
    meta target-role="Started"
primitive vir-ip ocf:heartbeat:IPaddr2 \
    params ip="192.168.10.10" nic="eth0" cidr_netmask="24"
broadcast="192.168.10.255" \
    op monitor interval="15s" timeout="25s" on-fail="standby" \
    meta target-role="Started"
clone pgclone pgsql \
    meta notify="true" globally-unique="false" interleave="true"
target-role="Started"
colocation ip-with-slony inf: slony-fail vir-ip
order slony-b4-ip inf: vir-ip slony-fail
property $id="cib-bootstrap-options" \
    dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
    cluster-infrastructure="Heartbeat" \
    no-quorum-policy="ignore" \
    stonith-enabled="false" \
    last-lrm-refresh="1266488780"
rsc_defaults $id="rsc-options" \
    resource-stickiness="INFINITY"

I am assigning the cluster-ip (192.168.10.10) in eth0 with ip 192.168.10.129
in one machine and 192.168.10.130 in another machine.

When I pull out the eth0 interface cable fail-over is not taking place.

This is the log message i am getting while I pull out the cable:

"Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now OFF
(device state 1)"

and after a miniute or two

log snippet:
-------------------------------------------------------------------
Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 operations
(13333.00us average, 0% utilization) in the last 10min
Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped!
Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:
Requesting the current CIB: S_POLICY_ENGINE
Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: Invoking
the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1
Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node
node2 is online
Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected value:
7 (not running)
Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
slony-fail_monitor_0 found resource slony-fail active on node2
Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value: 7
(not running)
Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
pgsql:0_monitor_0 found resource pgsql:0 active on node2
Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node
node1 is online
Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2
Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
slony-fail#011(lsb:slony_failover):#011Started node2
Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set:
pgclone
Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started: [
node2 node1 ]
Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start
recurring monitor (15s) for pgsql:1 on node1
Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
vir-ip#011(Started node2)
Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
slony-fail#011(Started node2)
Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
pgsql:0#011(Started node2)
Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
pgsql:1#011(Started node1)
Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked transition
26: 1 actions in 1 synapses
Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph 26
(ref=pe_calc-dc-1266492773-121) derived from
/var/lib/pengine/pe-input-125.bz2
Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating action
15: monitor pgsql:1_monitor_15000 on node1
Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: Cannout
open series file /var/lib/pengine/pe-input.last for writing
Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: Transition
26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action
pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)
Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:
====================================================
Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-125.bz2): Complete
Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition 26
is now complete
Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26
status: done - <null>
Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting
PEngine Recheck Timer
------------------------------------------------------------------------------

Also I am not able to use the pgsql ocf script and hence I am using the init
script and cloned it as  I need to run it on both nodes for slony data base
replication.

I am using the heartbeat and pacemaker debs from the updated ubuntu karmic
repo. (Heartbeat 2.99)

Please check my configuration and tell me where I am missing....[?][?][?]
-- 
Regards,

Jayakrishnan. L

Visit: www.jayakrishnan.bravehost.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100218/62649c46/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 33A.gif
Type: image/gif
Size: 581 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100218/62649c46/attachment-0006.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 33C.gif
Type: image/gif
Size: 104 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100218/62649c46/attachment-0007.gif>