[Pacemaker] 2-node cluster: one node refusing to join with "not in our membership"

Wed Jun 27 01:33:52 UTC 2012

On Tue, Jun 26, 2012 at 3:30 PM, Brent Harsh <pacemaker at brentharsh.com> wrote:
> Seems like bug http://bugs.clusterlabs.org/show_bug.cgi?id=5040 and and
> earlier thread:
> http://thread.gmane.org/gmane.linux.highavailability.pacemaker/13185/focus=13321

I believe we've finally got to the bottom of this one.
Looks like it was a symptom of this corosync bug:
   https://bugzilla.redhat.com/show_bug.cgi?id=820821

The good news is that its been fixed, I don't think its in any
packages yet though.

>
> According to that bug, 1.4.3 may have solved it, yet still open and a
> comment from Andrew Beekhof saying he'd reproduced again on 4/18. From the
> thread, maybe pacemaker 1.1.7 with a commit by Andrew, but he sees some
> behavior.
>
> OS: CentOS 5.7 x86_64
> pacemaker 1.1.6
> glue: 1.0.9
> corosync 1.4.2
>  - all RPMs were built from source and stored locally for deployment.
>
> nodes: omc1 and omc2: both virtual machines on CentOS 5.7.
>
> Resources: mainly a floating IP, mysql and httpd along with a few custom
> services - seemed simple.  No shared storage.
>
> This seems like a pretty critical bug.  I've not been able to reproduce it
> in the lab (of course not) but my production cluster is running on a single
> cylinder.  I do have logs from the event that seemed to cause it if they'd
> help (prefer pastebin?  here on the list?); I've tried to dump the
> collection of logs with crm_report but never seem to wind up with anything
> in the archives it creates .  I'm currently building and testing 1.4.3 but
> since I can't reproduce, I'm less than thrilled about the prospects and
> feeling confident.
>
> Secondly - any recommended process to bring the messed up node back into the
> cluster game?  I've probably horked it beyond recognition with shutdowns/crm
> commands/rm crm configs, editing the cib with cibadmin and trying to replace
> it based on other threads and advice.  I currently have pacemaker and
> corosync services shut off - too terrifying to contemplate it killing my
> active node by interacting with it.
>
> Let me know what info would help...
>
> Thanks,
>
> Brent
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org