[Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot
Darren.Mansell at opengi.co.uk
Darren.Mansell at opengi.co.uk
Mon Apr 2 13:53:53 UTC 2012
Hi everyone.
I have 2 nodes running on ESX hosts in 2 geographically diverse data
centres. The link between them is a DWDM fibre link which is the only
thing I can think of as being the cause of this.
SLES 11 SP1 with HAE. All latest updates.
If Corosync is set to Multicast on the default address, there are no
comms between Corosync on the nodes. If I use broadcast, it will
communicate and let the nodes join.
If I reboot node 2, it rejoins fine. If I reboot node 1, it enters a
pending phase for a while then just drops to offline. I can then clear
the config out again and let the nodes rejoin. Node 1 always seems to be
the DC.
Pending - logs from node 1, loops this every second:
-02: id=336371722 state=member (new) addr=r(0) ip(10.160.12.20) votes=1
born=7912 seen=7920 proc=00000000000000000000000000151312
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: crm_update_peer: Node
PPS-VMAIL-01: id=168599562 state=member (new) addr=r(0) ip(10.160.12.10)
(new) votes=1 (new) born=7920 seen=7920
proc=00000000000000000000000000151312 (new)
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: WARN: do_log: FSA: Input
I_SHUTDOWN from revision_check_callback() received in state S_STARTING
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_state_transition:
State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN
cause=C_FSA_INTERNAL origin=revision_check_callback ]
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_lrm_control:
Disconnected from the LRM
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_ha_control:
Disconnected from OpenAIS
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_cib_control:
Disconnecting CIB
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: Performing
A_EXIT_0 - gracefully exiting the CRMd
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_NULL: [ state=S_STOPPING cause=C_FSA_INTERNAL
origin=register_fsa_error_adv ]
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: [crmd] stopped
(0)
Offline - logs from node 1, loops every second:
Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_replace_notify:
Local-only Replace: 0.0.0 from PP2-VMAIL-02
Apr 2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: do_cib_replaced:
Sending full refresh
Apr 2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (<null>)
Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: apply_xml_diff: Digest
mis-match: expected 0cf389141d344ca552679f9924d281c5, calculated
818a100a0e3b725068393624381c9d4f
Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: notice: cib_process_diff: Diff
0.13.642 -> 0.0.0 not applied to 0.13.642: Failed application of an
update diff
Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_server_process_diff:
Requesting re-sync from peer
Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: WARN: cib_diff_notify:
Local-only Change (client:attrd, call: 1221): 0.0.0 (Application of an
update diff failed, requesting a full refresh)
Offline - logs from node 2, loops every second:
Apr 2 14:39:05 PP2-VMAIL-02 corosync[3794]: [TOTEM ] Retransmit List:
29b7 29b8 29b9
Apr 2 14:39:05 PP2-VMAIL-02 corosync[3794]: [TOTEM ] Retransmit List:
29bb 29bc
Apr 2 14:39:05 PP2-VMAIL-02 cib: [3801]: info: cib_process_request:
Operation complete: op cib_sync_one for section 'all'
(origin=PPS-VMAIL-01/PPS-VMAIL-01/(null), version=0.13.1538): ok (rc=0)
Any ideas please?
Thanks.
Darren Mansell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120402/739169ac/attachment-0003.html>
More information about the Pacemaker
mailing list