[Pacemaker] Node name problems after upgrading to 1.1.9

Thu Jun 27 06:20:08 EDT 2013

Hello,

Our cluster was working OK on corosync stack, with corosync 2.3.0 and
pacemaker 1.1.8.

After upgrading (full versions and configs below), we began to have
problems with node names.
It's a two node cluster, with node names "turifel" (DC) and "selavi".

When selavi joins cluster, we have this warning at selavi log:

-----
Jun 27 11:54:29 selavi attrd[11998]:   notice: corosync_node_name:
Unable to get node name for nodeid 168385827
Jun 27 11:54:29 selavi attrd[11998]:   notice: get_node_name: Defaulting
to uname -n for the local corosync node name
-----

This is ok, and also happenned with version 1.1.8.

At corosync level, all seems ok:
----
Jun 27 11:51:18 turifel corosync[6725]:   [TOTEM ] A processor joined or
left the membership and a new membership (10.9.93.35:1184) was formed.
Jun 27 11:51:18 turifel corosync[6725]:   [QUORUM] Members[2]: 168385827
168385835
Jun 27 11:51:18 turifel corosync[6725]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 27 11:51:18 turifel crmd[19526]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node selavi[168385827] - state is now member
(was lost)
-------

But when starting pacemaker on selavi (the new node), turifel log shows
this:

----
Jun 27 11:54:28 turifel crmd[19526]:   notice: do_state_transition:
State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN
cause=C_FSA_INTERNAL origin=peer_update_callback ]
Jun 27 11:54:28 turifel crmd[19526]:  warning: crm_get_peer: Node
'selavi' and 'selavi' share the same cluster nodeid: 168385827
Jun 27 11:54:28 turifel crmd[19526]:  warning: crmd_cs_dispatch:
Recieving messages from a node we think is dead: selavi[0]
Jun 27 11:54:29 turifel crmd[19526]:  warning: crm_get_peer: Node
'selavi' and 'selavi' share the same cluster nodeid: 168385827
Jun 27 11:54:29 turifel crmd[19526]:  warning: do_state_transition: Only
1 of 2 cluster nodes are eligible to run resources - continue 0
Jun 27 11:54:29 turifel attrd[19524]:   notice: attrd_local_callback:
Sending full refresh (origin=crmd)
----

And selavi remains on pending state. Some times turifel (DC) fences
selavi, but other times remains pending forever.

On turifel node, all resources gives warnings like this one:
 warning: custom_action: Action p_drbd_ha0:0_monitor_0 on selavi is
unrunnable (pending)

On both nodes, uname -n and crm_node -n gives correct node names (selavi
and turifel respectively)

¿Do you think it's a configuration problem?

Below I give information about versions and configurations.

Best regards,
Bernardo.

-----
Versions (git/hg compiled versions):

corosync: 2.3.0.66-615d
pacemaker: 1.1.9-61e4b8f
cluster-glue: 1.0.11
libqb:  0.14.4.43-bb4c3
resource-agents: 3.9.5.98-3b051
crmsh: 1.2.5

Cluster also has drbd, dlm and gfs2, but I think versions are unrelevant
here.

--------
Output of pacemaker configuration:
./configure --prefix=/opt/ha --without-cman \
    --without-heartbeat --with-corosync \
    --enable-fatal-warnings=no --with-lcrso-dir=/opt/ha/libexec/lcrso

pacemaker configuration:
  Version                  = 1.1.9 (Build: 61e4b8f)
  Features                 = generated-manpages ascii-docs ncurses
libqb-logging libqb-ipc lha-fencing upstart nagios  corosync-native snmp
libesmtp

  Prefix                   = /opt/ha
  Executables              = /opt/ha/sbin
  Man pages                = /opt/ha/share/man
  Libraries                = /opt/ha/lib
  Header files             = /opt/ha/include
  Arch-independent files   = /opt/ha/share
  State information        = /opt/ha/var
  System configuration     = /opt/ha/etc
  Corosync Plugins         = /opt/ha/lib

  Use system LTDL          = yes

  HA group name            = haclient
  HA user name             = hacluster

  CFLAGS                   = -I/opt/ha/include -I/opt/ha/include
-I/opt/ha/include/heartbeat    -I/opt/ha/include   -I/opt/ha/include
-ggdb  -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return
-Wbad-function-cast -Wcast-align -Wdeclaration-after-statement
-Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security
-Wformat-nonliteral -Wmissing-prototypes -Wmissing-declarations
-Wnested-externs -Wno-long-long -Wno-strict-aliasing
-Wunused-but-set-variable -Wpointer-arith -Wstrict-prototypes
-Wwrite-strings
  Libraries                = -lgnutls -lcorosync_common -lplumb -lpils
-lqb -lbz2 -lxslt -lxml2 -lc -luuid -lpam -lrt -ldl  -lglib-2.0   -lltdl
-L/opt/ha/lib -lqb -ldl -lrt -lpthread
  Stack Libraries          =   -L/opt/ha/lib -lqb -ldl -lrt -lpthread
-L/opt/ha/lib -lcpg   -L/opt/ha/lib -lcfg   -L/opt/ha/lib -lcmap
-L/opt/ha/lib -lquorum

----
Corosync config:

totem {
        version: 2
        crypto_cipher: none
        crypto_hash: none
        cluster_name: fiestaha
        interface {
                ringnumber: 0
                ttl: 1
                bindnetaddr: 10.9.93.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        syslog_facility: local7
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}
quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
        wait_for_all: 0
}

-- 
APSL
*Bernardo Cabezas Serra*
*Responsable Sistemas*
Camí Vell de Bunyola 37, esc. A, local 7
07009 Polígono de Son Castelló, Palma
Mail: bcabezas at apsl.net
Skype: bernat.cabezas
Tel: 971439771