[Pacemaker] 2-node cluster: one node refusing to join with "not in our membership"

Tue Jun 26 01:30:34 EDT 2012

Seems like bug http://bugs.clusterlabs.org/show_bug.cgi?id=5040 and and 
earlier thread: 
http://thread.gmane.org/gmane.linux.highavailability.pacemaker/13185/focus=13321

According to that bug, 1.4.3 may have solved it, yet still open and a 
comment from Andrew Beekhof saying he'd reproduced again on 4/18. From 
the thread, maybe pacemaker 1.1.7 with a commit by Andrew, but he sees 
some behavior.

OS: CentOS 5.7 x86_64
pacemaker 1.1.6
glue: 1.0.9
corosync 1.4.2
  - all RPMs were built from source and stored locally for deployment.

nodes: omc1 and omc2: both virtual machines on CentOS 5.7.

Resources: mainly a floating IP, mysql and httpd along with a few custom 
services - seemed simple.  No shared storage.

This seems like a pretty critical bug.  I've not been able to reproduce 
it in the lab (of course not) but my production cluster is running on a 
single cylinder.  I do have logs from the event that seemed to cause it 
if they'd help (prefer pastebin?  here on the list?); I've tried to dump 
the collection of logs with crm_report but never seem to wind up with 
anything in the archives it creates .  I'm currently building and 
testing 1.4.3 but since I can't reproduce, I'm less than thrilled about 
the prospects and feeling confident.

Secondly - any recommended process to bring the messed up node back into 
the cluster game?  I've probably horked it beyond recognition with 
shutdowns/crm commands/rm crm configs, editing the cib with cibadmin and 
trying to replace it based on other threads and advice.  I currently 
have pacemaker and corosync services shut off - too terrifying to 
contemplate it killing my active node by interacting with it.

Let me know what info would help...

Thanks,

Brent