[Pacemaker] Fwd: Java application failover problem

Fri Jul 5 05:09:54 EDT 2013

Hello,
we are facing the problem with the simple (I hope) cluster configuration
with 2 nodes ims0 and ims1 and 3 primitives (no shared storage or
something like this where data corruption is a danger):

- master-slave Java application ims (to be run normally on both nodes in
as master/slave, with our own OCF script) with embedded web server (to
be accessed by clients)

- ims-ip and ims-ip-src: shared IP address and outgoing address to be
run on the ims master solely

Below are listed the software versions, crm configuration and portions
of corosync log.

The problem is that although most of the time the setup works (i.e if
master ims application dies, slave one is promoted and ip addresses are
remapped) but sometimes when master ims application stops (fails or is
killed), the failover does not occur - the slave ims application remains
the slave and the shared IP address remains mapped on the node with died
ims.

I even created a testbed of 2 servers, killing the ims application from
cron every 15 minutes on supposed MAIN server to simulate the failure
and observe the failover and to replicate the problem (sometimes it
works properly for hours/days).

For example today (July 4, 23:45 local time) the ims at ims0 was killed,
but remained Master - no failover of IP addresses was performed and ims
on ims1 remained Slave:
============
Last updated: Fri Jul  5 02:07:18 2013
Last change: Thu Jul  4 23:33:46 2013
Stack: openais
Current DC: ims0 - partition with quorum
Version: 1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563
2 Nodes configured, 2 expected votes
6 Resources configured.
============

Online: [ ims1 ims0 ]

 Master/Slave Set: ms-ims [ims]
     Masters: [ ims0 ]
     Slaves: [ ims1 ]
 Clone Set: clone-cluster-mon [cluster-mon]
     Started: [ ims0 ims1 ]
 Resource Group: on-ims-master
     ims-ip     (ocf::heartbeat:IPaddr2):       Started ims0
     ims-ip-src (ocf::heartbeat:IPsrcaddr):     Started ims0

The command 'crm node standby' on ims0 did not fix the thing: ims0
remained master (although standby):

Node ims0: standby
Online: [ ims1 ]

 Master/Slave Set: ms-ims [ims]
     ims:0      (ocf::microstepmis:imsMS):      Slave ims0 FAILED
     Slaves: [ ims1 ]
 Clone Set: clone-cluster-mon [cluster-mon]
     Started: [ ims1 ]
     Stopped: [ cluster-mon:0 ]

Failed actions:
    ims:0_demote_0 (node=ims0, call=3179, rc=7, status=complete): not
running

Stoppping openais service on ims0 completely did the thing.

Could someone provide me with a hint, what to do ?
- provide more information (logs, ocf script) ?
- change something in configuration ?
- change the environment / versions ?

Thanks a lot

Martin Gazak

Software versions:
------------------
libpacemaker3-1.1.7-42.1
pacemaker-1.1.7-42.1
corosync-1.4.3-21.1
libcorosync4-1.4.3-21.1
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 2

Configuration:
--------------
node ims0 \
        attributes standby="off"
node ims1 \
        attributes standby="off"
primitive cluster-mon ocf:pacemaker:ClusterMon \
        params htmlfile="/opt/ims/tomcat/webapps/ims/html/crm_status.html" \
        op monitor interval="10"
primitive ims ocf:microstepmis:imsMS \
        op monitor interval="1" role="Master" timeout="20" \
        op monitor interval="2" role="Slave" timeout="20" \
        op start interval="0" timeout="1800s" \
        op stop interval="0" timeout="120s" \
        op promote interval="0" timeout="180s" \
        meta failure-timeout="360s"
primitive ims-ip ocf:heartbeat:IPaddr2 \
        params ip="192.168.141.13" nic="bond1" iflabel="ims"
cidr_netmask="24" \
        op monitor interval="15s" \
        meta failure-timeout="60s"
primitive ims-ip-src ocf:heartbeat:IPsrcaddr \
        params ipaddress="192.168.141.13" cidr_netmask="24" \
        op monitor interval="15s" \
        meta failure-timeout="60s"
group on-ims-master ims-ip ims-ip-src
ms ms-ims ims \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
migration-threshold="1"
clone clone-cluster-mon cluster-mon
colocation ims_master inf: on-ims-master ms-ims:Master
order ms-ims-before inf: ms-ims:promote on-ims-master:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        cluster-recheck-interval="1m" \
        default-resource-stickiness="1000" \
        last-lrm-refresh="1372951736" \
        maintenance-mode="false"

corosync.log from ims0:
-----------------------
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims:0_monitor_1000 (call=3046, rc=7, cib-update=6229,
confirmed=false) not running
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_graph_event: Detected
action ims:0_monitor_1000 from a different transition: 4024 vs. 4035
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=ims:0_last_failure_0,
magic=0:7;7:4024:8:e3f096a7-4eb5-4810-9310-eb144f595e20, cib=0.717.6) :
Old event
Jul 04 23:45:02 ims0 crmd: [3935]: WARN: update_failcount: Updating
failcount for ims:0 on ims0 after failed monitor: rc=7 (update=value++,
time=1372952702)
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-ims:0 (1)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 04 23:45:02 ims0 pengine: [3933]: WARN: unpack_rsc_op: Processing
failed op ims:0_last_failure_0 on ims0: not running (7)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Recover ims:0
(Master ims0)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
ims-ip	(Started ims0)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
ims-ip-src	(Started ims0)
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
4036 (ref=pe_calc-dc-1372952702-11907) derived from
/var/lib/pengine/pe-input-2819.bz2
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 51: stop ims-ip-src_stop_0 on ims0 (local)
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
update 4439: fail-count-ims:0=1
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-ims:0 (1372952702)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: cancel_op: operation
monitor[3049] on ims-ip-src for client 3935, its parameters:
CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6]
CRM_meta_timeout=[20000] CRM_meta_interval=[15000]
ipaddress=[192.168.141.13]  cancelled
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
update 4441: last-failure-ims:0=1372952702
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src stop[3052] (pid
12111)
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair,
id=status-ims0-fail-count-ims.0, name=fail-count-ims:0, value=1,
magic=NA, cib=0.717.7) : Transient attribute: update
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair,
id=status-ims0-last-failure-ims.0, name=last-failure-ims:0,
value=1372952702, magic=NA, cib=0.717.8) : Transient attribute: update
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip-src_monitor_15000 (call=3049, status=1, cib-update=0,
confirmed=true) Cancelled
Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
Transition 4036: PEngine Input stored in: /var/lib/pengine/pe-input-2819.bz2
Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation stop[3052] on
ims-ip-src for client 3935: pid 12111 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip-src_stop_0 (call=3052, rc=0, cib-update=6231,
confirmed=true) ok
Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
4036 (Complete=3, Pending=0, Fired=0, Skipped=32, Incomplete=19,
Source=/var/lib/pengine/pe-input-2819.bz2): Stopped
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
for ms-ims on ims0 has expired (limit was 360s)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_rsc_op: Clearing
expired failcount for ims:0 on ims0
Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
for ms-ims on ims0 has expired (limit was 360s)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_rsc_op: Clearing
expired failcount for ims:0 on ims0
Jul 04 23:45:02 ims0 pengine: [3933]: WARN: unpack_rsc_op: Processing
failed op ims:0_last_failure_0 on ims0: not running (7)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
for ms-ims on ims0 has expired (limit was 360s)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
for ms-ims on ims0 has expired (limit was 360s)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Recover ims:0
(Master ims0)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
ims-ip	(Started ims0)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
ims-ip-src	(ims0)
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
4037 (ref=pe_calc-dc-1372952702-11909) derived from
/var/lib/pengine/pe-input-2820.bz2
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_crm_command: Executing
crm-event (3): clear_failcount on ims0
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 49: stop ims-ip_stop_0 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: cancel_op: operation
monitor[3047] on ims-ip for client 3935, its parameters:
cidr_netmask=[24] nic=[bond1] crm_feature_set=[3.0.6]
ip=[192.168.141.13] iflabel=[ims] CRM_meta_name=[monitor]
CRM_meta_timeout=[20000] CRM_meta_interval=[15000]  cancelled
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip stop[3053] (pid 12154)
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip_monitor_15000 (call=3047, status=1, cib-update=0,
confirmed=true) Cancelled
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 72: notify ims:0_pre_notify_demote_0 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims:0 notify[3054] (pid 12155)
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 74: notify ims:1_pre_notify_demote_0 on ims1
Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation notify[3054] on ims:0
for client 3935: pid 12155 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims:0_notify_0 (call=3054, rc=0, cib-update=0, confirmed=true) ok
Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
Transition 4037: PEngine Input stored in: /var/lib/pengine/pe-input-2820.bz2
Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output: (ims-ip:stop:stderr)
2013/07/04_23:45:02 INFO: IP status = ok, IP_CIP=

Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation stop[3053] on ims-ip
for client 3935: pid 12154 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip_stop_0 (call=3053, rc=0, cib-update=6233,
confirmed=true) ok
Jul 04 23:45:02 ims0 crmd: [3935]: info: handle_failcount_op: Removing
failcount for ims:0
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-ims:0 (<null>)
Jul 04 23:45:02 ims0 cib: [3929]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_state[@uname='ims0']//lrm_resource[@id='ims:0']/lrm_rsc_op[@id='ims:0_last_failure_0']
(origin=local/crmd/6234, version=0.717.11): ok (rc=0)
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
te_update_diff:321 - Triggered transition abort (complete=0,
tag=lrm_rsc_op, id=ims:0_last_failure_0,
magic=0:7;7:4024:8:e3f096a7-4eb5-4810-9310-eb144f595e20, cib=0.717.11) :
Resource op removal
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
delete 4443: node=ims0, attr=fail-count-ims:0, id=<n/a>, set=(null),
section=status
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
te_update_diff:194 - Triggered transition abort (complete=0,
tag=transient_attributes, id=ims0, magic=NA, cib=0.717.12) : Transient
attribute: removal
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-ims:0 (<null>)
Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
delete 4445: node=ims0, attr=last-failure-ims:0, id=<n/a>, set=(null),
section=status
Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
te_update_diff:194 - Triggered transition abort (complete=0,
tag=transient_attributes, id=ims0, magic=NA, cib=0.717.13) : Transient
attribute: removal
Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
4037 (Complete=7, Pending=0, Fired=0, Skipped=28, Incomplete=19,
Source=/var/lib/pengine/pe-input-2820.bz2): Stopped
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
ims-ip	(ims0)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
ims-ip-src	(ims0)
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
4038 (ref=pe_calc-dc-1372952702-11915) derived from
/var/lib/pengine/pe-input-2821.bz2
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 47: start ims-ip_start_0 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip start[3055] (pid 12197)
Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
Transition 4038: PEngine Input stored in: /var/lib/pengine/pe-input-2821.bz2
Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
(ims-ip:start:stderr) 2013/07/04_23:45:02 INFO: Adding IPv4 address
192.168.141.13/24 with broadcast address 192.168.141.255 to device bond1
(with label bond1:ims)

Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
(ims-ip:start:stderr) 2013/07/04_23:45:02 INFO: Bringing device bond1 up

Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
(ims-ip:start:stderr) 2013/07/04_23:45:02 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-192.168.141.13 bond1 192.168.141.13
auto not_used not_used

Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation start[3055] on ims-ip
for client 3935: pid 12197 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip_start_0 (call=3055, rc=0, cib-update=6236,
confirmed=true) ok
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 48: monitor ims-ip_monitor_15000 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip monitor[3056] (pid
12255)
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 49: start ims-ip-src_start_0 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src start[3057] (pid
12256)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation monitor[3056] on
ims-ip for client 3935: pid 12255 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip_monitor_15000 (call=3056, rc=0, cib-update=6237,
confirmed=false) ok
Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation start[3057] on
ims-ip-src for client 3935: pid 12256 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip-src_start_0 (call=3057, rc=0, cib-update=6238,
confirmed=true) ok
Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
action 50: monitor ims-ip-src_monitor_15000 on ims0 (local)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src monitor[3058]
(pid 12336)
Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation monitor[3058] on
ims-ip-src for client 3935: pid 12336 exited with return code 0
Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
operation ims-ip-src_monitor_15000 (call=3058, rc=0, cib-update=6239,
confirmed=false) ok
Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
4038 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-2821.bz2): Complete
Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 04 23:46:02 ims0 crmd: [3935]: info: crm_timer_popped: PEngine
Recheck Timer (I_PE_CALC) just popped (60000ms)
Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jul 04 23:46:02 ims0 crmd: [3935]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Jul 04 23:46:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 04 23:46:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
4039 (ref=pe_calc-dc-1372952762-11920) derived from
/var/lib/pengine/pe-input-2822.bz2
Jul 04 23:46:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
4039 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-2822.bz2): Complete
Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 04 23:46:02 ims0 pengine: [3933]: notice: process_pe_message:
Transition 4039: PEngine Input stored in: /var/lib/pengine/pe-input-2822.bz2
Jul 04 23:47:02 ims0 crmd: [3935]: info: crm_timer_popped: PEngine
Recheck Timer (I_PE_CALC) just popped (60000ms)
Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jul 04 23:47:02 ims0 crmd: [3935]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Jul 04 23:47:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 04 23:47:02 ims0 pengine: [3933]: notice: process_pe_message:
Transition 4040: PEngine Input stored in: /var/lib/pengine/pe-input-2822.bz2
Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 04 23:47:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
4040 (ref=pe_calc-dc-1372952822-11921) derived from
/var/lib/pengine/pe-input-2822.bz2
Jul 04 23:47:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
4040 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-2822.bz2): Complete
Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

corosync.log from ims1:
-----------------------
Jul 04 23:45:02 ims1 lrmd: [3913]: info: rsc:ims:1 notify[1424] (pid 25381)
Jul 04 23:45:02 ims1 lrmd: [3913]: info: operation notify[1424] on ims:1
for client 3917: pid 25381 exited with return code 0
Jul 04 23:45:02 ims1 crmd: [3917]: info: process_lrm_event: LRM
operation ims:1_notify_0 (call=1424, rc=0, cib-update=0, confirmed=true) ok
Jul 04 23:49:35 ims1 cib: [3911]: info: cib_stats: Processed 324
operations (92.00us average, 0% utilization) in the last 10min
Jul 04 23:59:35 ims1 cib: [3911]: info: cib_stats: Processed 295
operations (67.00us average, 0% utilization) in the last 10min
Jul 05 00:00:03 ims1 crmd: [3917]: info: process_lrm_event: LRM
operation ims:1_monitor_2000 (call=1423, rc=7, cib-update=778,
confirmed=false) not running
Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_ais_dispatch: Update
relayed from ims0
Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-ims:1 (1)
Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_perform_update: Sent
update 2037: fail-count-ims:1=1
Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_ais_dispatch: Update
relayed from ims0

-- 

Regards,

Martin Gazak
MicroStep-MIS, spol. s r.o.
System Development Manager
Tel.: +421 2 602 00 128
Fax: +421 2 602 00 180
martin.gazak at microstep-mis.sk
http://www.microstep-mis.com