[Pacemaker] unable to join cluster
Hisashi Osanai
osanai.hisashi at jp.fujitsu.com
Thu Mar 22 04:07:59 UTC 2012
Hello,
I have three nodes cluster using pacemaker/corosync. When I reboot one node,
the node unable to join cluster. I can see that kind of split brain 10-20%
(recall ration) if I shutdown a node.
What do you think of this problem?
My questions are:
- Is this known problem?
- Any work around to avoid the this?
- How can I solve this problem?
[testserver001]
============
Last updated: Sat Mar 10 14:18:49 2012
Stack: openais
Current DC: NONE
3 Nodes configured, 3 expected votes
4 Resources configured.
============
OFFLINE: [ testserver001 testserver002 testserver003 ]
Migration summary:
[testserver002]
============
Last updated: Sat Mar 10 14:15:17 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.
============
Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]
Resource Group: testgroup
testrsc (lsb:testmgr): Started testserver002
stonith-testserver002 (stonith:external/ipmi): Started
testserver003
stonith-testserver003 (stonith:external/ipmi): Started
testserver002
stonith-testserver001 (stonith:external/ipmi): Started
testserver003
Migration summary:
* Node testserver003:
* Node testserver002:
[testserver003]
============
Last updated: Sat Mar 10 14:19:07 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.
============
Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]
Resource Group: testgroup
testrsc (lsb:testmgr): Started testserver002
stonith-testserver002 (stonith:external/ipmi): Started
testserver003
stonith-testserver003 (stonith:external/ipmi): Started
testserver002
stonith-testserver001 (stonith:external/ipmi): Started
testserver003
Migration summary:
* Node testserver003:
* Node testserver002:
- Checked information
+ https://bugzilla.redhat.com/show_bug.cgi?id=525589
It looks the packages which I used already support this.
+ http://comments.gmane.org/gmane.linux.highavailability.user/36101
I checked entries in /etc/hosts but I didn't find out the wrong entry.
===
127.0.0.1 testserver001 localhost
::1 localhost6.localdomain6 localhost6
===
- Look into this from tcpdump
OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
MESSAGE_TYPE_MCAST.
I took the information from VMware env.
+ MESSAGE_TYPE_ORF_TOKEN
No. Time Source Destination
Protocol Length Info
119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2
UDP 112 Source port: 23489 Destination port: 23490
Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
Vmware_8e:74:92 (00:0c:29:8e:74:92)
Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
172.27.4.2 (172.27.4.2)
User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
Data (70 bytes)
0000 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
..".............
0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
................
(snip)
+ MESSAGE_TYPE_MCAST
No. Time Source Destination
Protocol Length Info
5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16
UDP 1486 Source port: 23489 Destination port: 23490
Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
(11888 bits)
Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
226.94.16.16 (226.94.16.16)
User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
Data (1444 bytes)
0000 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
..".............
0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
................
(snip)
NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
the
message in pacemaker.log.
+ MESSAGE_TYPE_ORF_TOKEN
No. Time Source Destination
Protocol Length Info
39605 2012-03-10 14:18:13.826778 172.27.4.2 172.27.4.3
UDP 112 Source port: 23489 Destination port: 23490
Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896
bits)
Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst:
FujitsuT_97:8d:15 (00:19:99:97:8d:15)
Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
172.27.4.3 (172.27.4.3)
User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
Data (70 bytes)
0000 00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00
..".............
0010 ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b
................
(snip)
+ pacemaker.log
Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped:
Election Trigger (I_DC_TIMEOUT) just popped!
Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition:
State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped:
Election Timeout (I_ELECTION_DC) just popped!
Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition:
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control:
Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae
Mar 10 14:22:09 testserver001 crmd: [7551]: WARN:
cib_client_add_notify_callback: Callback already present
Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions:
Setting custom graph functions
Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph:
Unpacked transition -1: 0 actions in 0 synapses
Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover:
Taking over DC status for this partition
Mar 10 14:22:09 testserver001 cib: [7547]: info:
cib_process_readwrite: We are now in R/W mode
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_master for section 'all' (origin=local/crmd/6,
version=0.143.0): ok (rc=0)
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section cib (origin=local/crmd/7,
version=0.143.0): ok (rc=0)
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/9, version=0.143.0): ok (rc=0)
Mar 10 14:22:09 testserver001 crmd: [7551]: info:
do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
Membership 516: quorum still lost
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/11, version=0.143.0): ok (rc=0)
Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
Setting expected votes to 3
Mar 10 14:22:09 testserver001 crmd: [7551]: info:
config_query_callback: Checking for expired actions every 900000ms
Mar 10 14:22:09 testserver001 crmd: [7551]: info:
config_query_callback: Sending expected-votes=3 to corosync
Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
Membership 516: quorum still lost
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/14, version=0.143.0): ok (rc=0)
Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
Setting expected votes to 3
Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith:
Attempting connection to fencing daemon...
Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/16, version=0.143.0): ok (rc=0)
Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith:
Connected
+ enum message_type {
MESSAGE_TYPE_ORF_TOKEN = 0, /* Ordering, Reliability,
Flow (ORF) control Token */
MESSAGE_TYPE_MCAST = 1, /* ring ordered multicast
message */
MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there
are available rings */
MESSAGE_TYPE_MEMB_JOIN = 3, /* membership join message
*/
MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token
*/
MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of
the token */
};
- packages on CentOS 5.6
+ pacemaker-1.0.10-1.4.el5
+ corosync-1.2.5-1.3.el5
Thank you in advance,
Hisashi Osanai
Hisashi Osanai (osanai.hisashi at jp.fujitsu.com)
More information about the Pacemaker
mailing list