[Pacemaker] Remove a "ghost" node
Andrew Beekhof
andrew at beekhof.net
Thu Nov 14 23:47:38 UTC 2013
On 15 Nov 2013, at 10:24 am, Sean Lutner <sean at rentul.net> wrote:
>
> On Nov 14, 2013, at 6:14 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>>
>> On 14 Nov 2013, at 2:55 pm, Sean Lutner <sean at rentul.net> wrote:
>>
>>>
>>> On Nov 13, 2013, at 10:51 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>
>>>>
>>>> On 14 Nov 2013, at 1:12 pm, Sean Lutner <sean at rentul.net> wrote:
>>>>
>>>>>
>>>>> On Nov 10, 2013, at 8:03 PM, Sean Lutner <sean at rentul.net> wrote:
>>>>>
>>>>>>
>>>>>> On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>
>>>>>>>
>>>>>>> On 11 Nov 2013, at 11:44 am, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a "ghost" node in my cluster. I'm looking to understand the best way to remove this node from the config.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both
>>>>>>>>>>>>
>>>>>>>>>>>> # stonith_admin -I
>>>>>>>>>>>> # pcs stonith list
>>>>>>>>>>>>
>>>>>>>>>>>> I then configured STONITH per the Clusters from Scratch doc
>>>>>>>>>>>>
>>>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>>>>>>>>>>>
>>>>>>>>>>>> Here are my commands:
>>>>>>>>>>>> # pcs cluster cib stonith_cfg
>>>>>>>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" timeout="150s" op start start-delay="30s" interval="0"
>>>>>>>>>>>> # pcs -f stonith_cfg stonith
>>>>>>>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>>>>>>>>>>> # pcs -f stonith_cfg property
>>>>>>>>>>>> # pcs cluster push cib stonith_cfg
>>>>>>>>>>>>
>>>>>>>>>>>> After that I saw that STONITH appears to be functioning but a new node listed in pcs status output:
>>>>>>>>>>>
>>>>>>>>>>> Do the EC2 instances have fixed IPs?
>>>>>>>>>>> I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this.
>>>>>>>>>>
>>>>>>>>>> The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there.
>>>>>>>>>>
>>>>>>>>>> Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
>>>>>>>>>>
>>>>>>>>>> Even after that, the ghost node is still there? Would pcs cluster cib > /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config?
>>>>>>>>>
>>>>>>>>> No. If its coming back then pacemaker is holding it in one of its internal caches.
>>>>>>>>> The only way to clear it out in your version is to restart pacemaker on the DC.
>>>>>>>>>
>>>>>>>>> Actually... are you sure someone didn't just slip while editing cluster.conf? [...].1251 does not look like a valid IP :)
>>>>>>>>
>>>>>>>> In the end this fixed it
>>>>>>>>
>>>>>>>> # pcs cluster cib > /tmp/cib-tmp.xml
>>>>>>>> # vi /tmp/cib-tmp.xml # remove bad node
>>>>>>>> # pcs cluster push cib /tmp/cib-tmp.xml
>>>>>>>>
>>>>>>>> Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned.
>>>>>>>>
>>>>>>>> I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :)
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>
>>>>>>>>> There is now an update to 1.1.10 available for 6.4, that _may_ help in the future.
>>>>>>>>
>>>>>>>> That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well?
>>>>>>>
>>>>>>> yes and yes
>>>>>>>
>>>>>>> you might want to check if you're using any OCF resource agents that didn't make it into the first supported release though.
>>>>>>>
>>>>>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>>>>>>
>>>>>> Thanks, I'll give that a read. All the resource agents are custom so I'm thinking I'm okay (I'll back them up before upgrading).
>>>>>>
>>>>>> One last question related to the fence_ec2 script. Should crm_mon -VW show it running on both nodes or just one?
>>>>>
>>>>> I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10?
>>>>>
>>>>> # crm_verify -L -V
>>>>> error: unpack_order_template: Invalid constraint 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template named 'Varnish'
>>>>
>>>> Is that true?
>>>
>>> No, it's not. The resource exists and the script for the resource exists.
>>>
>>> I rolled back to 1.1.8 and the cluster started up without issue.
>>
>> Can you send us your config? (cibadmin -Ql)
>>
>> Is Varnish in a group or cloned? That might also explain things.
>
> The cibadmin output is attached.
>
> Yes the varnish resources are in a group which is then cloned.
-EDONTDOTHAT
You cant refer to the things inside a clone.
1.1.8 will have just been ignoring those constraints.
>
> <cluster-config.out>
>
>
>>
>>>
>>>>
>>>>> error: unpack_order_template: Invalid constraint 'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish'
>>>>> error: unpack_order_template: Invalid constraint 'order-Varnishlog-Varnishncsa-mandatory': No resource or template named 'Varnishlog'
>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or template named 'Varnish'
>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnishlog-Varnish-INFINITY': No resource or template named 'Varnishlog'
>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template named 'Varnishncsa'
>>>>> Errors found during check: config not valid
>>>>>
>>>>> The cluster doesn't start. I'd prefer to figure out how to fix this rather than roll back to 1.1.8. Any help is appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I may have to go back to the drawing board on a fencing device for the nodes. Are there any other recommendations for a cluster on EC2 nodes?
>>>>>>>>>>
>>>>>>>>>> Thanks very much
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # pcs status
>>>>>>>>>>>> Last updated: Thu Nov 7 17:41:21 2013
>>>>>>>>>>>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>>>>>>>>>>> Stack: cman
>>>>>>>>>>>> Current DC: ip-10-50-3-122 - partition with quorum
>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>>>> 3 Nodes configured, unknown expected votes
>>>>>>>>>>>> 11 Resources configured.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>>>>>>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>>
>>>>>>>>>>>> Full list of resources:
>>>>>>>>>>>>
>>>>>>>>>>>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started ip-10-50-3-122
>>>>>>>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>>>>>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>> Stopped: [ EIP-AND-VARNISH:2 ]
>>>>>>>>>>>> ec2-fencing (stonith:fence_ec2): Stopped
>>>>>>>>>>>>
>>>>>>>>>>>> I have no idea where the node that is marked UNCLEAN came from, though it's a clear typo is a proper cluster node.
>>>>>>>>>>>>
>>>>>>>>>>>> The only command I ran with the bad node ID was:
>>>>>>>>>>>>
>>>>>>>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node ip-10-50-3-1251
>>>>>>>>>>>>
>>>>>>>>>>>> Is there any possible way that could have caused the the node to be added?
>>>>>>>>>>>>
>>>>>>>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is no node and thus no pcsd that failed. Is there a way I can safely remove this ghost node from the cluster? I can provide logs from pacemaker or corosync as needed.
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131115/e66b7f84/attachment-0004.sig>
More information about the Pacemaker
mailing list