[Pacemaker] compression with heartbeat doesn't seem to work

Fri Aug 19 14:02:24 CET 2011

Hi,
  We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 3.0.4-41.1.  We are experiencing what seems like network issues and cannot make heartbeat recover.  We are experiencing "message too long" and the systems can no longer sync.

Our ha.cf is as follows:
autojoin none
use_logd false
logfacility daemon
debug 0

# use the v2 cluster resource manager
crm yes

# the cluster communication happens via unicast on bond0 and hb1
# hb1 is direct connect
ucast hb1 169.254.1.3
ucast hb1 169.254.1.4
ucast bond0 172.28.102.21
ucast bond0 172.28.102.51
compression zlib
compression_threshold 30

# msgfmt
msgfmt netstring

# a node will be flagged as dead if there is not response for 20 seconds
deadtime 30
initdead 30
keepalive 250ms
uuidfrom nodename

# these are the node names participating in the cluster
# the names should match "uname -n" output on the system
node usrv-qpr2
node usrv-qpr5

We can ping all interfaces from both nodes.  One of the bonded NICs had some trouble, but we believe we have enough redundancy built in that it should be fine.
The issue we see that if we reboot the non DC node it can no longer sync with the DC.  The log from the non-dc node shows remote node cannot be reached.  Crm_mon of the non-dc node shows:

Last updated: Fri Aug 19 07:39:05 2011
Stack: Heartbeat
Current DC: NONE
2 Nodes configured, 2 expected votes
26 Resources configured.
============

Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline)
Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)