[Pacemaker] [Openais] Corosync goes into endless loop when same hostname is used on more than one node

Thu May 12 21:15:53 UTC 2011

On 05/12/2011 07:04 AM, Dan Frincu wrote:
> Hi,
> 
> When using the same hostname on 2 nodes (debian squeeze, corosync
> 1.3.0-3 from unstable) the following happens:
> 
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/84,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-29:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/86,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-30: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-30:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-31: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/88,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-31:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-32: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> 
> Basically it goes into an endless loop. This is a improperly configured
> option, but it would help the users if there was a handling of this or a
> relevant message printed in the logfile, such as "duplicate hostname found".
> 

Dan,

I believe this is a pacemaker RFE.  corosync operates entirely on IP
addresses and never does any hostname to IP resolution (because the
resolver can block and cause bad things to happen).

> Regards.
> Dan
> 
> -- 
> Dan Frincu
> CCNA, RHCE
> 
> 
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais