[Pacemaker] Node remains offline (was Node remains online)
Andrew Beekhof
andrew at beekhof.net
Fri Mar 11 09:57:36 UTC 2011
On Thu, Mar 10, 2011 at 9:10 PM, Bart Coninckx <bart.coninckx at telenet.be> wrote:
> Hi all,
>
> I have a three node cluster and while introducing the third node, it
> remains offline no matter what I do.
Nothing you've shown here seems to indicate its offline - what leads
you to that conclusion?
> Another symptom is that stopping
> openais takes forever on that node, while it is waiting for crmd to unload.
>
> The logfile shows this node (xen3) to be online however:
>
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Recorded connection
> 0x6987c0 for attrd/10120
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Recorded connection
> 0x69cb20 for cib/10118
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Sending membership
> update 4100 to cib
> Mar 10 20:55:26 corosync [CLM ] CLM CONFIGURATION CHANGE
> Mar 10 20:55:26 corosync [CLM ] New Configuration:
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.13) r(1)
> ip(10.0.2.13)
> Mar 10 20:55:26 corosync [CLM ] Members Left:
> Mar 10 20:55:26 corosync [CLM ] Members Joined:
> Mar 10 20:55:26 corosync [pcmk ] notice: pcmk_peer_update: Transitional
> membership event on ring 4104: memb=1, new=0, lost=0
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: memb: xen3
> 218169354
> Mar 10 20:55:26 corosync [CLM ] CLM CONFIGURATION CHANGE
> Mar 10 20:55:26 corosync [CLM ] New Configuration:
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.11) r(1)
> ip(10.0.2.11)
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.12) r(1)
> ip(10.0.2.12)
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.13) r(1)
> ip(10.0.2.13)
> Mar 10 20:55:26 corosync [CLM ] Members Left:
> Mar 10 20:55:26 corosync [CLM ] Members Joined:
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.11) r(1)
> ip(10.0.2.11)
> Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.12) r(1)
> ip(10.0.2.12)
> Mar 10 20:55:26 corosync [pcmk ] notice: pcmk_peer_update: Stable
> membership event on ring 4104: memb=3, new=2, lost=0
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Creating entry
> for node 184614922 born on 4104
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node
> 184614922/unknown is now: member
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: NEW:
> .pending. 184614922
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Creating entry
> for node 201392138 born on 4104
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node
> 201392138/unknown is now: member
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: NEW:
> .pending. 201392138
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB:
> .pending. 184614922
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB:
> .pending. 201392138
> Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB: xen3
> 218169354
> Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification:
> Sending membership update 4104 to 1 children
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000c80
> Node 218169354 ((null)) born on: 4104
> Mar 10 20:55:26 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268001120
> Node 201392138 (xen2) born on: 3800
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268001120
> Node 201392138 now known as xen2 (was: (null))
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen2 now has
> process list: 00000000000000000000000000151312 (1381138)
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen2 now has
> 1 quorum votes (was 0)
> Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification:
> Sending membership update 4104 to 1 children
> Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending
> message to local.crmd failed: ipc delivery failed (rc=-2)
> Mar 10 20:55:26 xen3 cib: [10118]: notice: ais_dispatch_message:
> Membership 4104: quorum acquired
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000aa0
> Node 184614922 (xen1) born on: 3792
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000aa0
> Node 184614922 now known as xen1 (was: (null))
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen1 now has
> process list: 00000000000000000000000000151312 (1381138)
> Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen1 now has
> 1 quorum votes (was 0)
> Mar 10 20:55:26 corosync [pcmk ] info: update_expected_votes: Expected
> quorum votes 2 -> 3
> Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification:
> Sending membership update 4104 to 1 children
> Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending
> message to local.crmd failed: ipc delivery failed (rc=-2)
> Mar 10 20:55:26 corosync [TOTEM ] Marking ringid 1 interface 10.0.2.13
> FAULTY - adminisrtative intervention required.
> Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending
> message to local.crmd failed: ipc delivery failed (rc=-2)
> Mar 10 20:55:26 xen3 cib: [10118]: WARN: cib_diff_notify: Local-only
> Change (client:crmd, call: 1742): -1.-1.-1 (Application of an update
> diff failed, requesting a full refresh)
> Mar 10 20:55:27 corosync [pcmk ] info: pcmk_ipc: Recorded connection
> 0x7f4268002040 for crmd/10122
> Mar 10 20:55:27 corosync [pcmk ] info: pcmk_ipc: Sending membership
> update 4104 to crmd
> Mar 10 20:55:27 xen3 crmd: [10122]: notice: ais_dispatch_message:
> Membership 4104: quorum acquired
> Mar 10 20:55:27 xen3 crmd: [10122]: notice: crmd_peer_update: Status
> update: Client xen3/crmd now has status [online] (DC=<null>)
> Mar 10 20:55:27 corosync [MAIN ] Completed service synchronization,
> ready to provide service.
> Mar 10 20:55:27 xen3 cib: [10118]: WARN: cib_server_process_diff: Not
> applying diff 0.1672.12 -> 0.1672.13 (sync in progress)
> Mar 10 20:55:27 xen3 mgmtd: [10123]: debug: main: run the loop...
> Mar 10 20:55:27 xen3 mgmtd: [10123]: info: Started.
> Mar 10 20:55:27 xen3 lrmd: [10119]: info: setting max-children to 4
>
>
> ps afx shows all relevant processes in a normal state though:
>
> 10111 ? Ssl 0:00 /usr/sbin/corosync
> 10117 ? S 0:00 \_ /usr/lib64/heartbeat/stonithd
> 10118 ? S 0:00 \_ /usr/lib64/heartbeat/cib
> 10119 ? S 0:00 \_ /usr/lib64/heartbeat/lrmd
> 10120 ? S 0:00 \_ /usr/lib64/heartbeat/attrd
> 10121 ? S 0:00 \_ /usr/lib64/heartbeat/pengine
> 10122 ? S 0:00 \_ /usr/lib64/heartbeat/crmd
> 10123 ? S 0:00 \_ /usr/lib64/heartbeat/mgmtd
>
>
> I tried to remove the node with crm_node -R= to no avail.
>
> The used versions are :
>
> corosync-1.2.6-0.2.2
> openais-1.1.3-0.2.3
> pacemaker-1.1.2-0.7.1
>
> corosync.conf looks like this:
>
> aisexec {
> group: root
> user: root
> }
> service {
> use_mgmtd: yes
> ver: 0
> name: pacemaker
> }
> totem {
> rrp_mode: passive
> token_retransmits_before_loss_const: 10
> join: 1000
> max_messages: 20
> vsftype: none
> token: 5000
> consensus: 7500
> secauth: off
> version: 2
>
> interface {
> bindnetaddr: 10.0.1.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> ringnumber: 0
>
> }
> interface {
> bindnetaddr: 10.0.2.0
> mcastaddr: 226.84.2.1
> mcastport: 5406
> ringnumber: 1
> }
> clear_node_high_bit: yes
> }
> logging {
> to_logfile: yes
> logfile: /var/log/ha-log
> timestamp: on
> syslog_facility: daemon
> to_syslog: no
> debug: on
> to_stderr: yes
> fileline: off
>
> }
> amf {
> mode: disable
> }
>
>
> Does anyone have any suggestions on how to proceed?
>
> Thank you!!
>
> B.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
More information about the Pacemaker
mailing list