[Pacemaker] unable to join cluster
Andrew Beekhof
andrew at beekhof.net
Thu Mar 29 06:04:00 UTC 2012
On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai
<osanai.hisashi at jp.fujitsu.com> wrote:
>
> Hello,
>
> I have three nodes cluster using pacemaker/corosync. When I reboot one node,
>
> the node unable to join cluster. I can see that kind of split brain 10-20%
> (recall ration) if I shutdown a node.
>
> What do you think of this problem?
It depends whether corosync sees all three nodes (in which case its a
pacemaker problem), if not its a corosync problem.
There are newer versions of both, perhaps try an upgrade?
>
> My questions are:
> - Is this known problem?
> - Any work around to avoid the this?
> - How can I solve this problem?
>
> [testserver001]
> ============
> Last updated: Sat Mar 10 14:18:49 2012
> Stack: openais
> Current DC: NONE
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> OFFLINE: [ testserver001 testserver002 testserver003 ]
>
>
> Migration summary:
>
> [testserver002]
> ============
> Last updated: Sat Mar 10 14:15:17 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
> Resource Group: testgroup
> testrsc (lsb:testmgr): Started testserver002
> stonith-testserver002 (stonith:external/ipmi): Started
> testserver003
> stonith-testserver003 (stonith:external/ipmi): Started
> testserver002
> stonith-testserver001 (stonith:external/ipmi): Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> [testserver003]
> ============
> Last updated: Sat Mar 10 14:19:07 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
> Resource Group: testgroup
> testrsc (lsb:testmgr): Started testserver002
> stonith-testserver002 (stonith:external/ipmi): Started
> testserver003
> stonith-testserver003 (stonith:external/ipmi): Started
> testserver002
> stonith-testserver001 (stonith:external/ipmi): Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> - Checked information
> + https://bugzilla.redhat.com/show_bug.cgi?id=525589
> It looks the packages which I used already support this.
> + http://comments.gmane.org/gmane.linux.highavailability.user/36101
> I checked entries in /etc/hosts but I didn't find out the wrong entry.
> ===
> 127.0.0.1 testserver001 localhost
> ::1 localhost6.localdomain6 localhost6
> ===
>
> - Look into this from tcpdump
> OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
> MESSAGE_TYPE_MCAST.
> I took the information from VMware env.
>
> + MESSAGE_TYPE_ORF_TOKEN
> No. Time Source Destination
> Protocol Length Info
> 119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2
> UDP 112 Source port: 23489 Destination port: 23490
>
> Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
> Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
> Vmware_8e:74:92 (00:0c:29:8e:74:92)
> Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
> 172.27.4.2 (172.27.4.2)
> User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
> Data (70 bytes)
>
> 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
> ..".............
> 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
> ................
> (snip)
>
> + MESSAGE_TYPE_MCAST
> No. Time Source Destination
> Protocol Length Info
> 5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16
> UDP 1486 Source port: 23489 Destination port: 23490
>
> Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
> (11888 bits)
> Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
> IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
> Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
> 226.94.16.16 (226.94.16.16)
> User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
> Data (1444 bytes)
>
> 0000 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
> ..".............
> 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
> ................
> (snip)
>
> NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
> the
> message in pacemaker.log.
>
> + MESSAGE_TYPE_ORF_TOKEN
> No. Time Source Destination
> Protocol Length Info
> 39605 2012-03-10 14:18:13.826778 172.27.4.2 172.27.4.3
> UDP 112 Source port: 23489 Destination port: 23490
>
> Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896
> bits)
> Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst:
> FujitsuT_97:8d:15 (00:19:99:97:8d:15)
> Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
> 172.27.4.3 (172.27.4.3)
> User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
> Data (70 bytes)
>
> 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00
> ..".............
> 0010 ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b
> ................
> (snip)
>
> + pacemaker.log
> Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped:
> Election Trigger (I_DC_TIMEOUT) just popped!
> Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition:
> State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped:
> Election Timeout (I_ELECTION_DC) just popped!
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition:
> State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control:
> Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae
> Mar 10 14:22:09 testserver001 crmd: [7551]: WARN:
> cib_client_add_notify_callback: Callback already present
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions:
> Setting custom graph functions
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph:
> Unpacked transition -1: 0 actions in 0 synapses
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover:
> Taking over DC status for this partition
> Mar 10 14:22:09 testserver001 cib: [7547]: info:
> cib_process_readwrite: We are now in R/W mode
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_master for section 'all' (origin=local/crmd/6,
> version=0.143.0): ok (rc=0)
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section cib (origin=local/crmd/7,
> version=0.143.0): ok (rc=0)
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/9, version=0.143.0): ok (rc=0)
> Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
> Membership 516: quorum still lost
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/11, version=0.143.0): ok (rc=0)
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
> Setting expected votes to 3
> Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> config_query_callback: Checking for expired actions every 900000ms
> Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> config_query_callback: Sending expected-votes=3 to corosync
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
> Membership 516: quorum still lost
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/14, version=0.143.0): ok (rc=0)
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
> Setting expected votes to 3
> Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith:
> Attempting connection to fencing daemon...
> Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/16, version=0.143.0): ok (rc=0)
> Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith:
> Connected
>
> + enum message_type {
> MESSAGE_TYPE_ORF_TOKEN = 0, /* Ordering, Reliability,
> Flow (ORF) control Token */
> MESSAGE_TYPE_MCAST = 1, /* ring ordered multicast
> message */
> MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there
> are available rings */
> MESSAGE_TYPE_MEMB_JOIN = 3, /* membership join message
> */
> MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token
> */
> MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of
> the token */
> };
>
> - packages on CentOS 5.6
> + pacemaker-1.0.10-1.4.el5
> + corosync-1.2.5-1.3.el5
>
> Thank you in advance,
> Hisashi Osanai
>
> Hisashi Osanai (osanai.hisashi at jp.fujitsu.com)
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list