[Pacemaker] 2-node cluster: one node refusing to join with "not in our membership"
Brent Harsh
pacemaker at brentharsh.com
Tue Jun 26 07:30:34 CEST 2012
Seems like bug http://bugs.clusterlabs.org/show_bug.cgi?id=5040 and and
earlier thread:
http://thread.gmane.org/gmane.linux.highavailability.pacemaker/13185/focus=13321
According to that bug, 1.4.3 may have solved it, yet still open and a
comment from Andrew Beekhof saying he'd reproduced again on 4/18. From
the thread, maybe pacemaker 1.1.7 with a commit by Andrew, but he sees
some behavior.
OS: CentOS 5.7 x86_64
pacemaker 1.1.6
glue: 1.0.9
corosync 1.4.2
- all RPMs were built from source and stored locally for deployment.
nodes: omc1 and omc2: both virtual machines on CentOS 5.7.
Resources: mainly a floating IP, mysql and httpd along with a few custom
services - seemed simple. No shared storage.
This seems like a pretty critical bug. I've not been able to reproduce
it in the lab (of course not) but my production cluster is running on a
single cylinder. I do have logs from the event that seemed to cause it
if they'd help (prefer pastebin? here on the list?); I've tried to dump
the collection of logs with crm_report but never seem to wind up with
anything in the archives it creates . I'm currently building and
testing 1.4.3 but since I can't reproduce, I'm less than thrilled about
the prospects and feeling confident.
Secondly - any recommended process to bring the messed up node back into
the cluster game? I've probably horked it beyond recognition with
shutdowns/crm commands/rm crm configs, editing the cib with cibadmin and
trying to replace it based on other threads and advice. I currently
have pacemaker and corosync services shut off - too terrifying to
contemplate it killing my active node by interacting with it.
Let me know what info would help...
Thanks,
Brent
More information about the Pacemaker
mailing list