[ClusterLabs] BUG in crmd/ccm membership handling when crmd registers with cluster a bit late

Fri Nov 27 05:14:02 UTC 2015

Hi Folks,

I am using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty & hit
upon a bug in CRMD/CCM. This bug exists in mainline pacemaker as well.

The bug can be easily exposed in a 2-node cluster by adding a sleep(5)
within do_ha_control() just before it does crm_cluster_connect(cluster) &
reboot one of the node. This node will remain in pending state & never join
the cluster.

The bug is this
# cib/ccm connects into heartbeat cluster earlier & also crmd connects into
cib/ccm on this node. Because of this peer gets membership events &
considers crmd on the incoming node as online, though crmd on it is not
connected to heartbeat cluster. So the node remaining as DC when it sends
join-offers they are lost & election doesnt progress. Later when the
incoming node actually connects to heartbeat cluster it generates node
events that are wrongly considered at CRMD as offline events & we remove
the membership about the incoming node. After this point all messages from
peer are completely ignored.

Can you pls suggest what is the best way to fix this problem? Would making
crmd connecting to heartbeat first & then to rest of cib/ccm help?

Here are detailed logs on the problem

node1 is coming up & starts heartbeat
Nov 26 19:07:08 node1 heartbeat: [2937]: info: Configuration validated.
Starting heartbeat 3.0.5

as part of this the running node2/DC gets link up event & also gets crm
event that peer heartbeat is online

Nov 26 19:07:08 node2 heartbeat: [3393]: info: Link node1:eth0 up.
Nov 26 19:07:08 node2 heartbeat: [3393]: info: Status update for node
node1: status init

Nov 26 19:07:08 node2 crmd[3523]:   notice: crmd_ha_status_callback: Status
update: Node node1 now has status [init]
Nov 26 19:07:08 node2 crmd[3523]:     info: crm_update_peer_proc:
crmd_ha_status_callback: Node node1[0] - heartbeat is now online
Nov 26 19:07:08 node2 crmd[3523]:     crit: peer_update_callback: Client
node1/peer now has status [offline] (DC=true)
Nov 26 19:07:08 node2 crmd[3523]:     crit: peer_update_callback: No change
     1 1000001 4000200
Nov 26 19:07:08 node2 crmd[3523]:   notice: crmd_ha_status_callback: Status
update: Node node1 now has status [up]

while crmd/cib on node1 starts up + crmd connects into cib
Nov 26 19:07:09 node1 crmd[2969]:   notice: main: CRM Git Version: 42f2063
Nov 26 19:07:09 node1 crmd[2969]:    debug: crmd_init: Starting crmd

Nov 26 19:07:09 node1 cib[2965]:     info: crm_get_peer: Node 0 has uuid
00000432-0432-0000-2b91-000000000000
Nov 26 19:07:09 node1 cib[2965]:     info: register_heartbeat_conn:
Hostname: node1
Nov 26 19:07:09 node1 cib[2965]:     info: register_heartbeat_conn: UUID:
00000432-0432-0000-2b91-000000000000
Nov 26 19:07:09 node1 cib[2965]:     info: ccm_connect: Registering with
CCM...
Nov 26 19:07:09 node1 cib[2965]:    debug: ccm_connect: CCM Activation
passed... all set to go!
Nov 26 19:07:09 node1 cib[2965]:     info: cib_init: Requesting the list of
configured nodes

Nov 26 19:07:10 node1 crmd[2969]:    debug: cib_native_signon_raw:
Connection to CIB successful

Nov 26 19:07:10 node1 cib[2965]:     info: cib_process_request: Completed
cib_query operation for section 'all': OK (rc=0, origin=local/crmd/2,
version=1.3.0)
Nov 26 19:07:10 node1 cib[2965]:     info: crm_client_new: Connecting
0x7fcbfab10480 for uid=0 gid=0 pid=2967
id=276f6f8e-14cc-426e-bf63-b79674f1bfaa
Nov 26 19:07:10 node1 cib[2965]:    debug: handle_new_connection: IPC
credentials authenticated (2965-2967-12)
Nov 26 19:07:10 node1 cib[2965]:    debug: qb_ipcs_shm_connect: connecting
to client [2967]
Nov 26 19:07:10 node1 cib[2965]:    debug: qb_rb_open_2: shm size:524301;
real_size:528384; rb->word_size:132096

cib on DC gets notified/marked online
Nov 26 19:07:09 node2 cib[3519]:     info: cib_client_status_callback:
Status update: Client node1/cib now has status [join]
Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_ha_status_callback: Status
update: Node node1 now has status [active]
Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_client_status_callback:
Status update: Client node1/crmd now has status [join] (DC=true)
Nov 26 19:07:09 node2 cib[3519]:     info: crm_get_peer: Created entry
5886699f-30f8-4b7d-9d38-4e661d99b2d9/0x7fa985e33b70 for node node1/0 (2
total)
Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_client_status_callback:
Status update: Client node1/crmd now has status [leave] (DC=true)
Nov 26 19:07:10 node2 cib[3519]:     info: crm_get_peer: Node 0 has uuid
00000432-0432-0000-2b91-000000000000
Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
cib_client_status_callback: Node node1[0] - cib is now online
Nov 26 19:07:10 node2 cib[3519]:     info: cib_client_status_callback:
Status update: Client node1/cib now has status [leave]
Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
cib_client_status_callback: Node node1[0] - cib is now offline
Nov 26 19:07:10 node2 cib[3519]:     info: cib_client_status_callback:
Status update: Client node1/cib now has status [join]
Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
cib_client_status_callback: Node node1[0] - cib is now online
Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=local/crmd/58,
version=1.3.31)
Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=local/crmd/59,
version=1.3.31)
Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=local/crmd/60,
version=1.3.31)

as part of CCM processing, the DC considers the peer node for membership &
marks its crmd online
Nov 26 19:07:12 node2 ccm: [3518]: debug: quorum plugin: majority
Nov 26 19:07:12 node2 ccm: [3518]: debug: cluster:linux-ha, member_count=2,
member_quorum_votes=200
Nov 26 19:07:12 node2 ccm: [3518]: debug: total_node_count=2,
total_quorum_votes=200
Nov 26 19:07:12 node2 crmd[3523]:     info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=2)
Nov 26 19:07:12 node2 cib[3519]:     info: cib_ccm_msg_callback: Processing
CCM event=NEW MEMBERSHIP (id=2)
Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: NEW
MEMBERSHIP: trans=2, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011CURRENT:
node2 [nodeid=1, born=1]
Nov 26 19:07:12 node2 cib[3519]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node1[0] - state is now member (was (null))
Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011CURRENT:
node1 [nodeid=0, born=2]
Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011NEW:
node1 [nodeid=0, born=2]
Nov 26 19:07:12 node2 crmd[3523]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node1[0] - state is now member (was lost)
Nov 26 19:07:12 node2 cib[3519]:     info: crm_update_peer_proc:
crm_update_ccm_node: Node node1[0] - heartbeat is now online
Nov 26 19:07:12 node2 cib[3519]:     info: crm_update_peer_proc:
crm_update_ccm_node: Node node1[0] - crmd is now online
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: node1 is
now member (was lost)
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Alive=0,
appear=1, down=(nil)
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Other
(nil)
Nov 26 19:07:12 node2 crmd[3523]:     info: crm_update_peer_proc:
crm_update_ccm_node: Node node1[0] - crmd is now online
Nov 26 19:07:12 node2 cib[3519]:     info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=local/crmd/61,
version=1.3.31)
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Client
node1/peer now has status [online] (DC=true)
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Alive=1,
appear=1, down=(nil)
Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Other
(nil)

subsequently this triggers the join processing. but since the peer node
crmd has not yet connected to heartbeat cluster it cannot get the
join-offers that is being sent by the DC
Nov 26 19:07:12 node2 crmd[3523]:    debug: post_cache_update: Updated
cache after membership event 2.
Nov 26 19:07:12 node2 crmd[3523]:    debug: post_cache_update:
post_cache_update added action A_ELECTION_CHECK to the FSA
Nov 26 19:07:12 node2 crmd[3523]:     crit: s_crmd_fsa: Processing
I_NODE_JOIN: [ state=S_IDLE cause=C_FSA_INTERNAL
origin=peer_update_callback ]
Nov 26 19:07:12 node2 crmd[3523]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=peer_update_callback ]
Nov 26 19:07:12 node2 crmd[3523]:    debug: crm_timer_start: Started
Integration Timer (I_INTEGRATED:180000ms), src=95
Nov 26 19:07:12 node2 crmd[3523]:    debug: do_election_check: Ignore
election check: we not in an election
Nov 26 19:07:12 node2 crmd[3523]:     crit: do_dc_join_offer_one: An
unknown node joined - (re-)offer to any unconfirmed nodes
Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: about to make
an offer to node2
Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: Making join
offers based on membership 2
Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: Skipping
node2: already known 4
Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: about to make
an offer to node1
Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: join-1:
Sending offer to node1
Nov 26 19:07:12 node2 crmd[3523]:     info: crm_update_peer_join:
join_make_offer: Node node1[0] - join-1 phase 0 -> 1
Nov 26 19:07:12 node2 crmd[3523]:    debug: check_join_state: Invoked by
do_dc_join_offer_one in state: S_INTEGRATION
Nov 26 19:07:12 node2 crmd[3523]:    debug: do_te_invoke: Halting the
transition: inactive
Nov 26 19:07:12 node2 crmd[3523]:     info: abort_transition_graph:
do_te_invoke:158 - Triggered transition abort (complete=1) : Peer Halt
Nov 26 19:07:12 node2 crmd[3523]:     crit: s_crmd_fsa: Processing
I_PE_CALC: [ state=S_INTEGRATION cause=C_FSA_INTERNAL
origin=abort_transition_graph ]

later crmd on the node that is coming up connects to heartbeat
Nov 26 19:07:15 node1 crmd[2969]:   notice: crm_cluster_connect: Connecting
to cluster infrastructure: heartbeat
Nov 26 19:07:15 node1 crmd[2969]:    debug: register_heartbeat_conn:
Signing in with Heartbeat
Nov 26 19:07:15 node1 crmd[2969]:     info: crm_get_peer: Created entry
75b26770-ac15-461c-889d-190e6c8139ac/0x7f2fd194f870 for node node1/0 (1
total)
Nov 26 19:07:15 node1 crmd[2969]:     crit: peer_update_callback: node1 is
now (null)
Nov 26 19:07:15 node1 crmd[2969]:     info: crm_get_peer: Node 0 has uuid
00000432-0432-0000-2b91-000000000000
Nov 26 19:07:15 node1 crmd[2969]:     info: register_heartbeat_conn:
Hostname: node1
Nov 26 19:07:15 node1 crmd[2969]:     info: register_heartbeat_conn: UUID:
00000432-0432-0000-2b91-000000000000

& at this point heartbeat delivers a notification into crmd as part of
which crmd wrongly moves the status of the incoming node as
offline/removing out of membership
Nov 26 19:07:15 node2 crmd[3523]:   notice: crmd_client_status_callback:
Status update: Client node1/crmd now has status [join] (DC=true)
Nov 26 19:07:15 node2 crmd[3523]:     info: crm_update_peer_proc:
crmd_client_status_callback: Node node1[0] - crmd is now join
Nov 26 19:07:15 node2 crmd[3523]:     crit: peer_update_callback: Client
node1/peer now has status [offline] (DC=true)

Nov 26 19:07:15 node2 crmd[3523]:  warning: match_down_event: No match for
shutdown action on 00000432-0432-0000-2b91-000000000000
Nov 26 19:07:15 node2 crmd[3523]:     crit: peer_update_callback: Alive=0,
appear=0, down=(nil)
Nov 26 19:07:15 node2 crmd[3523]:   notice: peer_update_callback:
Stonith/shutdown of node1 not matched
Nov 26 19:07:15 node2 crmd[3523]:     info: crm_update_peer_join:
peer_update_callback: Node node1[0] - join-1 phase 1 -> 0
Nov 26 19:07:15 node2 crmd[3523]:    debug: check_join_state: Invoked by
peer_update_callback in state: S_INTEGRATION
Nov 26 19:07:15 node2 crmd[3523]:    debug: check_join_state: join-1:
Integration of 0 peers complete: peer_update_callback
Nov 26 19:07:15 node2 crmd[3523]:     info: abort_transition_graph:
peer_update_callback:214 - Triggered transition abort (complete=1) : Node
failure

so bottom-line due to a timing issue when crmd connects into hearbeat
cluster we wrongly marked in the DC that the incoming node is not a member.
this causes any message from incoming node from this point onwards to be
ignored like

Nov 26 19:07:18 node2 crmd[3523]:    debug: crmd_ha_msg_callback: Ignoring
HA message (op=noop) from node1: not in our membership list (size=1)
Nov 26 19:07:19 node2 crmd[3523]:  warning: crmd_ha_msg_callback: Ignoring
HA message (op=join_announce) from node1: not in our membership list
(size=1)

--Shyam

On Mon, Nov 23, 2015 at 9:30 PM, Shyam <shyam.kaushik at gmail.com> wrote:

> One note on this.
>
> This problem doesnt happen if
>
> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [leave] (DC=true)
> Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
>
> join happens before
> Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
>
> I am not sure why crmd join/leave & then join happens. With previous
> version of heartbeat/pacemaker this doesnt happen.
>
> here are more logs when the problem doesnt happen
>
> Nov 19 08:36:29 node1 heartbeat: [3143]: info: Link node2:eth0 up.
> Nov 19 08:36:29 node1 heartbeat: [3143]: info: Status update for node
> node2: status init
> Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [init]
> Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [up]
> Nov 19 08:36:30 node1 heartbeat: [3143]: debug: get_delnodelist:
> delnodelist=
> Nov 19 08:36:30 node1 heartbeat: [3143]: info: Status update for node
> node2: status active
> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [active]
> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [leave] (DC=true)
> Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 19 08:36:34 node1 ccm: [3293]: debug: quorum plugin: majority
> Nov 19 08:36:34 node1 ccm: [3293]: debug: cluster:linux-ha,
> member_count=2, member_quorum_votes=200
> Nov 19 08:36:34 node1 ccm: [3293]: debug: total_node_count=2,
> total_quorum_votes=200
> Nov 19 08:36:34 node1 cib[3294]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[1] - state is now member (was (null))
> Nov 19 08:36:34 node1 crmd[3298]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[1] - state is now member (was lost)
> Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
>
> after the above, resources are started alright
> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_local_callback: Sending
> full refresh (origin=crmd)
> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: probe_complete (true)
> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-MYSQL:0 (100)
> Nov 19 08:36:41 node1 pengine[3299]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Nov 19 08:36:41 node1 pengine[3299]:   notice: LogActions: Start
> IPADDR:1#011(node2)
>
> Any help/pointers greatly apprecited.
>
> Thanks.
>
> --Shyam
>
> On Mon, Nov 23, 2015 at 12:14 PM, Shyam <shyam.kaushik at gmail.com> wrote:
>
>> Hi All,
>>
>> Need a help w.r.t. a timing issue that I hit when using pacemaker-1.1.10
>> with heartbeat-3.0.5 on Ubuntu Trusty.
>>
>> As seen in logs below, issue appears to be w.r.t. membership that CRMD
>> looks at that it wrongly moves to INTEGRATION phase when the peer CRMD
>> hasnt joined the cluster fully yet.
>>
>> We have a 2 node cluster & occasionally when one of the node comes up
>> after reboot it keeps remaining in S_PENDING state/until
>> heartbeat/pacemaker on that node is restarted. This happens periodically &
>> not always.
>>
>> Below logs shows the problem clearly
>>
>> ndoe1 has been running & sees node2 coming up
>>
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [init]
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [up]
>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Link node2:eth0 up.
>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>> node2: status init
>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>> node2: status up
>> Nov 20 09:09:10 node1 heartbeat: [2943]: debug: get_delnodelist:
>> delnodelist=
>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: all clients are now paused
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [active]
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>> node2: status active
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [leave] (DC=true)
>> Nov 20 09:09:11 node1 heartbeat: [2943]: info: all clients are now resumed
>>
>>
>> as can be seen above, w.r.t. CRMD on peer, it registered at heartbeat &
>> de-registered (not sure why)
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [leave] (DC=true)
>>
>>
>> while this is the state CRMD on node1 moves to INTEGRATION phase & there
>> are join requests flowing through, but presume I cannot be handled by peer
>>
>> Nov 20 09:09:13 node1 cib[3901]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[0] - state is now member (was lost)
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>> origin=peer_update_callback ]
>> Nov 20 09:09:13 node1 ccm: [3900]: debug: quorum plugin: majority
>> Nov 20 09:09:13 node1 ccm: [3900]: debug: cluster:linux-ha,
>> member_count=2, member_quorum_votes=200
>> Nov 20 09:09:13 node1 ccm: [3900]: debug: total_node_count=2,
>> total_quorum_votes=200
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 20 09:09:13 node1 crmd[3905]:  warning: match_down_event: No match
>> for shutdown action on 00000432-0432-0000-2b91-000000000000
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: peer_update_callback:
>> Stonith/shutdown of node2 not matched
>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_local_callback:
>> Sending full refresh (origin=crmd)
>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (true)
>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: master-MYSQL:0 (100)
>> Nov 20 09:09:13 node1 pengine[3906]:   notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Nov 20 09:09:13 node1 pengine[3906]:  warning: custom_action: Action
>> IPADDR:0_monitor_0 on node2 is unrunnable (pending)
>>
>>
>> As can be seen we had
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>>
>> & only then CRMD on peer seems to be alright
>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>>
>> so pengine decided not being able to run any resources & we are stuck in
>> this state. While node2 has this trace
>> Nov 20 09:09:13 node2 cib[2964]:   notice: cib_server_process_diff: Not
>> applying diff 1.3.33 -> 1.3.34 (sync in progress)
>> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
>> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>> Nov 20 09:09:13 node2 crmd[2968]:   notice: crm_cluster_connect:
>> Connecting to cluster infrastructure: heartbeat
>> Nov 20 09:09:14 node2 heartbeat: [2938]: info: the send queue length from
>> heartbeat to client crmd is set to 1024
>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=false)
>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [online] (DC=false)
>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>> Status update: Client node1/crmd now has status [online] (DC=false)
>> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
>> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_started: The local CRM is
>> operational
>> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_state_transition: State
>> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
>> origin=do_started ]
>> Nov 20 09:09:27 node2 crmd[2968]:  warning: do_log: FSA: Input
>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>
>>
>> i.e. it moves to S_PENDING state & keeps receiving I_DC_TIMEOUT
>>
>> I tried using latest heartbeat, but that doesnt seem to be the problem.
>> Can anyone suggest if this issue has already been fixed in latest pacemaker
>> or any other suggestions how to debug this issue?
>>
>> If I enable higher debug level (both in heartbeat/pacemaker), this
>> problem doesnt show up. Any help/pointers on how to go forward is greatly
>> appreciated.
>>
>> Thanks!
>>
>> --Shyam
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151127/b1e29705/attachment-0003.html>