[Pacemaker] Remove a "ghost" node

Sun Nov 10 18:27:49 EST 2013

On 8 Nov 2013, at 12:59 pm, Sean Lutner <sean at rentul.net> wrote:

> 
> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
>> 
>> On 8 Nov 2013, at 4:45 am, Sean Lutner <sean at rentul.net> wrote:
>> 
>>> I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a "ghost" node in my cluster. I'm looking to understand the best way to remove this node from the config.
>>> 
>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both
>>> 
>>> # stonith_admin -I
>>> # pcs stonith list
>>> 
>>> I then configured STONITH per the Clusters from Scratch doc
>>> 
>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>> 
>>> Here are my commands:
>>> # pcs cluster cib stonith_cfg
>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" timeout="150s" op start start-delay="30s" interval="0"
>>> # pcs -f stonith_cfg stonith
>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>> # pcs -f stonith_cfg property
>>> # pcs cluster push cib stonith_cfg
>>> 
>>> After that I saw that STONITH appears to be functioning but a new node listed in pcs status output:
>> 
>> Do the EC2 instances have fixed IPs?
>> I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this.
> 
> The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there. 
> 
> Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
> 
> Even after that, the ghost node is still there? Would pcs cluster cib > /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config?

No. If its coming back then pacemaker is holding it in one of its internal caches.
The only way to clear it out in your version is to restart pacemaker on the DC.

Actually... are you sure someone didn't just slip while editing cluster.conf?  [...].1251 does not look like a valid IP :)

>>> Version: 1.1.8-7.el6-394e906

There is now an update to 1.1.10 available for 6.4, that _may_ help in the future.

> 
> I may have to go back to the drawing board on a fencing device for the nodes. Are there any other recommendations for a cluster on EC2 nodes?
> 
> Thanks very much
> 
>> 
>>> 
>>> # pcs status
>>> Last updated: Thu Nov  7 17:41:21 2013
>>> Last change: Thu Nov  7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>> Stack: cman
>>> Current DC: ip-10-50-3-122 - partition with quorum
>>> Version: 1.1.8-7.el6-394e906
>>> 3 Nodes configured, unknown expected votes
>>> 11 Resources configured.
>>> 
>>> 
>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>> 
>>> Full list of resources:
>>> 
>>> ClusterEIP_54.215.143.166      (ocf::pacemaker:EIP):   Started ip-10-50-3-122
>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>    Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>    Stopped: [ EIP-AND-VARNISH:2 ]
>>> ec2-fencing    (stonith:fence_ec2):    Stopped 
>>> 
>>> I have no idea where the node that is marked UNCLEAN came from, though it's a clear typo is a proper cluster node.
>>> 
>>> The only command I ran with the bad node ID was:
>>> 
>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node ip-10-50-3-1251
>>> 
>>> Is there any possible way that could have caused the the node to be added?
>>> 
>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is no node and thus no pcsd that failed. Is there a way I can safely remove this ghost node from the cluster? I can provide logs from pacemaker or corosync as needed.
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org