[Pacemaker] node is offline; can't bring online

Wed Nov 7 23:16:50 EST 2012

On Thu, Nov 8, 2012 at 2:56 PM, Paul Archer <paul at paularcher.org> wrote:
> I don't really know when the trouble started.
> I ended up restarting pacemaker on all nodes, and it cleared things
> up. I'm not sure why, though.

You /may/ have been experiencing a known membership issue in older
versions of pacemaker and corosync.
But I can't say for sure based on your email.

I'd highly encourage an upgrade of at least corosync.

> If I have the same issue come up, I'll run the crm_report and open a bug.

Great.

>
> Thanks,
>
> Paul
>
> On Wed, Nov 7, 2012 at 9:22 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> On Thu, Nov 8, 2012 at 1:55 PM, Paul Archer <paul at paularcher.org> wrote:
>>> I'm fairly new to pacemaker, and this is hurting my head.
>>> I have a four-node cluster, and one of my nodes (for no reason that I
>>> can discern) has gone offline, and I can't get it to come back online.
>>>
>>> Offline node:
>>> root at vmhost2:/var/lib/heartbeat# crm_mon -1
>>> ============
>>> Last updated: Wed Nov  7 20:52:16 2012
>>> Last change: Wed Nov  7 20:28:06 2012 via cibadmin on vmhost2
>>> Stack: openais
>>> Current DC: NONE
>>> 4 Nodes configured, 4 expected votes
>>> 7 Resources configured.
>>> ============
>>>
>>> OFFLINE: [ vgs1 vgs2 vmhost1 vmhost2 ]
>>>
>>>
>>>
>>> One of the online nodes:
>>> root at vmhost1:/var/lib/heartbeat/crm# crm_mon -1
>>> ============
>>> Last updated: Wed Nov  7 20:45:32 2012
>>> Last change: Wed Nov  7 20:44:59 2012 via crm_attribute on vgs2
>>> Stack: openais
>>> Current DC: vgs1 - partition with quorum
>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>> 4 Nodes configured, 4 expected votes
>>> 7 Resources configured.
>>> ============
>>>
>>> Node vmhost2: standby
>>> Online: [ vgs1 vgs2 vmhost1 ]
>>>
>>>  focus  (ocf::heartbeat:VirtualDomain): Started vmhost1
>>>  logger (ocf::heartbeat:VirtualDomain): Started vmhost1
>>>  mother (ocf::heartbeat:VirtualDomain): Started vmhost2
>>>  vgsIP  (ocf::heartbeat:IPaddr2):       Started vgs2
>>>  vgsWebServer   (ocf::heartbeat:apache):        Started vgs2
>>>
>>>
>>> I don't know what's relevant as far as log files, so I will post as
>>> people ask for specifics, rather than just dumping everything here to
>>> start with.
>>
>> You should have crm_report and/or hb_report.
>> Use it to gather everything from around about the time the node went offline.
>> Probably best to open a bug at http://bugs.clusterlabs.org and attach
>> the resulting tarball there.
>>
>> If the cluster is still in this state, it would also be useful to see
>> the corosync-objctl -a output from vgs1 and vmhost2.
>> As well as the output from cibadmin -Ql from vgs1.
>>
>>>
>>>
>>> Thanks for any help,
>>>
>>> Paul
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org