[Pacemaker] Node fails to rejoin cluster
Tal Yalon
yalon at infinidat.com
Wed Feb 6 10:11:10 UTC 2013
Hi all,
I have a 2-node cluster, where node-2 got fenced and now after reboot tries
to rejoin the cluster but fails and gets stuck in a loop for hours and
never joins back.
After another reboot it managed to join, and there was no time difference
between the nodes.
Below is corosync/pacemaker log of node-2 (the one that was stuck in the
loop). Any help would be appreciated, since I have no clue as to what
happened.
Thanks,
Tal
Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync Cluster Engine
('1.4.1'): started and ready to provide service.
Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync built-in
features: nss dbus rdma snmp
Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Successfully read main
configuration file '/etc/corosync/corosync.conf'.
Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing transport
(UDP/IP Unicast).
Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] The network interface
[9.151.142.20] is now up.
Feb 6 01:39:32 node-2 corosync[27428]: [pcmk ] Logging: Initialized
pcmk_startup
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
Pacemaker Cluster Manager 1.1.6
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync extended virtual synchrony service
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync configuration service
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync cluster closed process group service v1.01
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync cluster config database access v1.01
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync profile loading service
Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded:
corosync cluster quorum service v0.1
Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Compatibility mode set
to whitetank. Using V1 and V2 of the synchronization engine.
Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb 6 01:39:32 node-2 corosync[27428]: [CPG ] chosen downlist: sender
r(0) ip(9.151.142.20) ; members(old:0 left:0)
Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Completed service
synchronization, ready to provide service.
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: main: Starting
Pacemaker 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14):
generated-manpages agent-manpages ascii-docs publican-docs ncurses
trace-logging libqb corosync-plugin cman
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Maximum core file
size is: 18446744073709551615
Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: update_node_processes:
0xb31fe0 Node 2 now known as node-2, was:
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27470 for process cib
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27471 for process stonith-ng
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27472 for process lrmd
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27473 for process attrd
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27474 for process pengine
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27475 for process crmd
Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Starting mainloop
Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 15
Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_cluster_type:
Cluster type is: 'openais'
Feb 6 01:39:37 node-2 stonith-ng[27471]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb 6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_classic: Creating connection to our Corosync plugin
Feb 6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_classic: AIS connection established
Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb 6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established
Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node
node-2 now has id: 2
Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node 2 is
now known as node-2
Feb 6 01:39:37 node-2 crmd[27475]: info: crm_log_init_worker: Changed
active directory to /var/lib/heartbeat/cores/hacluster
Feb 6 01:39:37 node-2 crmd[27475]: notice: main: CRM Hg Version:
148fccfd5985c5590cc601123c6c16e966b85d14
Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 17
Feb 6 01:39:37 node-2 lrmd: [27472]: info: enabling coredumps
Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 10
Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 12
Feb 6 01:39:37 node-2 lrmd: [27472]: info: Started.
Feb 6 01:39:37 node-2 cib[27470]: info: crm_log_init_worker: Changed
active directory to /var/lib/heartbeat/cores/hacluster
Feb 6 01:39:37 node-2 cib[27470]: info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
Feb 6 01:39:37 node-2 attrd[27473]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb 6 01:39:37 node-2 attrd[27473]: notice: main: Starting mainloop...
Feb 6 01:39:37 node-2 cib[27470]: info: validate_with_relaxng:
Creating RNG parser context
Feb 6 01:39:37 node-2 cib[27470]: info: startCib: CIB Initialization
completed successfully
Feb 6 01:39:37 node-2 cib[27470]: info: get_cluster_type: Cluster type
is: 'openais'
Feb 6 01:39:37 node-2 cib[27470]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb 6 01:39:37 node-2 cib[27470]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Feb 6 01:39:37 node-2 cib[27470]: info: init_ais_connection_classic:
AIS connection established
Feb 6 01:39:37 node-2 cib[27470]: info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb 6 01:39:37 node-2 cib[27470]: info: init_ais_connection_once:
Connection to 'classic openais (with plugin)': established
Feb 6 01:39:37 node-2 cib[27470]: info: crm_new_peer: Node node-2 now
has id: 2
Feb 6 01:39:37 node-2 cib[27470]: info: crm_new_peer: Node 2 is now
known as node-2
Feb 6 01:39:37 node-2 cib[27470]: info: cib_init: Starting cib mainloop
Feb 6 01:39:37 node-2 cib[27470]: info: ais_dispatch_message:
Membership 900: quorum still lost
Feb 6 01:39:37 node-2 cib[27470]: info: crm_update_peer: Node node-2:
id=2 state=member (new) addr=r(0) ip(9.151.142.20) (new) votes=1 (new)
born=0 seen=900 proc=00000000000000000000000000000000
Feb 6 01:39:37 node-2 cib[27470]: info: ais_dispatch_message:
Membership 904: quorum still lost
Feb 6 01:39:37 node-2 cib[27470]: info: crm_new_peer: Node <null> now
has id: 1
Feb 6 01:39:37 node-2 cib[27470]: info: crm_update_peer: Node (null):
id=1 state=member (new) addr=r(0) ip(9.151.142.10) votes=0 born=0 seen=904
proc=00000000000000000000000000000000
Feb 6 01:39:37 node-2 corosync[27428]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb 6 01:39:37 node-2 cib[27470]: notice: ais_dispatch_message:
Membership 904: quorum acquired
Feb 6 01:39:37 node-2 cib[27470]: info: crm_get_peer: Node 1 is now
known as node-1
Feb 6 01:39:37 node-2 cib[27470]: info: crm_update_peer: Node node-1:
id=1 state=member addr=r(0) ip(9.151.142.10) votes=1 (new) born=896
seen=904 proc=00000000000000000000000000000000
Feb 6 01:39:37 node-2 corosync[27428]: [CPG ] chosen downlist: sender
r(0) ip(9.151.142.10) ; members(old:1 left:0)
Feb 6 01:39:37 node-2 corosync[27428]: [MAIN ] Completed service
synchronization, ready to provide service.
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (30) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (31) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (32) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (33) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (34) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (35) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (36) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (37) from node-1: not in our membership
Feb 6 01:39:38 node-2 cib[27470]: warning: cib_peer_callback: Discarding
cib_apply_diff message (38) from node-1: not in our membership
Feb 6 01:39:38 node-2 stonith-ng[27471]: notice: setup_cib: Watching for
stonith topology changes
Feb 6 01:39:38 node-2 stonith-ng[27471]: info: main: Starting
stonith-ng mainloop
Feb 6 01:39:38 node-2 crmd[27475]: info: do_cib_control: CIB
connection established
Feb 6 01:39:38 node-2 crmd[27475]: info: get_cluster_type: Cluster
type is: 'openais'
Feb 6 01:39:38 node-2 crmd[27475]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb 6 01:39:38 node-2 crmd[27475]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Feb 6 01:39:38 node-2 crmd[27475]: info: init_ais_connection_classic:
AIS connection established
Feb 6 01:39:38 node-2 crmd[27475]: info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb 6 01:39:38 node-2 crmd[27475]: info: init_ais_connection_once:
Connection to 'classic openais (with plugin)': established
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_new_peer: Node node-2 now
has id: 2
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_new_peer: Node 2 is now
known as node-2
Feb 6 01:39:38 node-2 crmd[27475]: info: ais_status_callback: status:
node-2 is now unknown
Feb 6 01:39:38 node-2 crmd[27475]: info: do_ha_control: Connected to
the cluster
Feb 6 01:39:38 node-2 crmd[27475]: info: do_started: Delaying start,
no membership data (0000000000100000)
Feb 6 01:39:38 node-2 crmd[27475]: notice: ais_dispatch_message:
Membership 904: quorum acquired
Feb 6 01:39:38 node-2 crmd[27475]: info: ais_status_callback: status:
node-2 is now member (was unknown)
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_update_peer: Node node-2:
id=2 state=member (new) addr=r(0) ip(9.151.142.20) (new) votes=1 (new)
born=904 seen=904 proc=00000000000000000000000000000000
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_new_peer: Node node-1 now
has id: 1
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_new_peer: Node 1 is now
known as node-1
Feb 6 01:39:38 node-2 crmd[27475]: info: ais_status_callback: status:
node-1 is now unknown
Feb 6 01:39:38 node-2 crmd[27475]: info: ais_status_callback: status:
node-1 is now member (was unknown)
Feb 6 01:39:38 node-2 crmd[27475]: info: crm_update_peer: Node node-1:
id=1 state=member (new) addr=r(0) ip(9.151.142.10) votes=1 born=896
seen=904 proc=00000000000000000000000000000000
Feb 6 01:39:38 node-2 crmd[27475]: info: ais_dispatch_message:
Membership 904: quorum retained
Feb 6 01:39:38 node-2 crmd[27475]: notice: do_started: The local CRM is
operational
Feb 6 01:39:38 node-2 crmd[27475]: notice: do_state_transition: State
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_started ]
Feb 6 01:39:38 node-2 pacemakerd[27466]: notice: update_node_processes:
0xb3cdd0 Node 1 now known as node-1, was:
Feb 6 01:39:38 node-2 stonith-ng[27471]: info: crm_new_peer: Node
node-1 now has id: 1
Feb 6 01:39:38 node-2 stonith-ng[27471]: info: crm_new_peer: Node 1 is
now known as node-1
Feb 6 01:39:39 node-2 crmd[27475]: notice: crmd_peer_update: Status
update: Client node-2/crmd now has status [online] (DC=<null>)
Feb 6 01:39:39 node-2 crmd[27475]: notice: crmd_peer_update: Status
update: Client node-1/crmd now has status [online] (DC=<null>)
Feb 6 01:39:39 node-2 cib[27470]: info: cib_replace_notify: Replaced:
-1.-1.-1 -> 0.108.191 from node-1
Feb 6 01:39:59 node-2 crmd[27475]: info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb 6 01:39:59 node-2 crmd[27475]: warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb 6 01:39:59 node-2 crmd[27475]: notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 6 01:39:59 node-2 crmd[27475]: info: do_election_count_vote:
Election 3 (owner: node-1) lost: vote from node-1 (Uptime)
Feb 6 01:39:59 node-2 crmd[27475]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb 6 01:39:59 node-2 crmd[27475]: info: do_dc_release: DC role
released
Feb 6 01:39:59 node-2 crmd[27475]: info: do_te_control: Transitioner
is now inactive
Feb 6 01:40:19 node-2 crmd[27475]: info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb 6 01:40:19 node-2 crmd[27475]: warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb 6 01:40:19 node-2 crmd[27475]: notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 6 01:40:19 node-2 crmd[27475]: info: do_election_count_vote:
Election 4 (owner: node-1) lost: vote from node-1 (Uptime)
Feb 6 01:40:19 node-2 crmd[27475]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb 6 01:40:19 node-2 crmd[27475]: info: do_dc_release: DC role
released
Feb 6 01:40:19 node-2 crmd[27475]: info: do_te_control: Transitioner
is now inactive
Feb 6 01:40:39 node-2 crmd[27475]: info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb 6 01:40:39 node-2 crmd[27475]: warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb 6 01:40:39 node-2 crmd[27475]: notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 6 01:40:39 node-2 crmd[27475]: info: do_election_count_vote:
Election 5 (owner: node-1) lost: vote from node-1 (Uptime)
Feb 6 01:40:39 node-2 crmd[27475]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb 6 01:40:39 node-2 crmd[27475]: info: do_dc_release: DC role
released
Feb 6 01:40:39 node-2 crmd[27475]: info: do_te_control: Transitioner
is now inactive
Feb 6 01:40:59 node-2 crmd[27475]: info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb 6 01:40:59 node-2 crmd[27475]: warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb 6 01:40:59 node-2 crmd[27475]: notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 6 01:40:59 node-2 crmd[27475]: info: do_election_count_vote:
Election 6 (owner: node-1) lost: vote from node-1 (Uptime)
Feb 6 01:40:59 node-2 crmd[27475]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb 6 01:40:59 node-2 crmd[27475]: info: do_dc_release: DC role
released
Feb 6 01:40:59 node-2 crmd[27475]: info: do_te_control: Transitioner
is now inactive
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130206/145f4e08/attachment-0003.html>
More information about the Pacemaker
mailing list