[Pacemaker] Remove a "ghost" node
Sean Lutner
sean at rentul.net
Fri Nov 15 03:28:36 UTC 2013
On Nov 14, 2013, at 7:19 PM, Sean Lutner <sean at rentul.net> wrote:
>
>
>> On Nov 14, 2013, at 6:47 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>
>>
>>> On 15 Nov 2013, at 10:24 am, Sean Lutner <sean at rentul.net> wrote:
>>>
>>>
>>>> On Nov 14, 2013, at 6:14 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>
>>>>
>>>>> On 14 Nov 2013, at 2:55 pm, Sean Lutner <sean at rentul.net> wrote:
>>>>>
>>>>>
>>>>>> On Nov 13, 2013, at 10:51 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>
>>>>>>
>>>>>>> On 14 Nov 2013, at 1:12 pm, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 10, 2013, at 8:03 PM, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11 Nov 2013, at 11:44 am, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <sean at rentul.net> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a "ghost" node in my cluster. I'm looking to understand the best way to remove this node from the config.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # stonith_admin -I
>>>>>>>>>>>>>> # pcs stonith list
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I then configured STONITH per the Clusters from Scratch doc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are my commands:
>>>>>>>>>>>>>> # pcs cluster cib stonith_cfg
>>>>>>>>>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" timeout="150s" op start start-delay="30s" interval="0"
>>>>>>>>>>>>>> # pcs -f stonith_cfg stonith
>>>>>>>>>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>>>>>>>>>>>>> # pcs -f stonith_cfg property
>>>>>>>>>>>>>> # pcs cluster push cib stonith_cfg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After that I saw that STONITH appears to be functioning but a new node listed in pcs status output:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do the EC2 instances have fixed IPs?
>>>>>>>>>>>>> I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this.
>>>>>>>>>>>>
>>>>>>>>>>>> The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there.
>>>>>>>>>>>>
>>>>>>>>>>>> Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
>>>>>>>>>>>>
>>>>>>>>>>>> Even after that, the ghost node is still there? Would pcs cluster cib > /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config?
>>>>>>>>>>>
>>>>>>>>>>> No. If its coming back then pacemaker is holding it in one of its internal caches.
>>>>>>>>>>> The only way to clear it out in your version is to restart pacemaker on the DC.
>>>>>>>>>>>
>>>>>>>>>>> Actually... are you sure someone didn't just slip while editing cluster.conf? [...].1251 does not look like a valid IP :)
>>>>>>>>>>
>>>>>>>>>> In the end this fixed it
>>>>>>>>>>
>>>>>>>>>> # pcs cluster cib > /tmp/cib-tmp.xml
>>>>>>>>>> # vi /tmp/cib-tmp.xml # remove bad node
>>>>>>>>>> # pcs cluster push cib /tmp/cib-tmp.xml
>>>>>>>>>>
>>>>>>>>>> Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned.
>>>>>>>>>>
>>>>>>>>>> I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>>>
>>>>>>>>>>> There is now an update to 1.1.10 available for 6.4, that _may_ help in the future.
>>>>>>>>>>
>>>>>>>>>> That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well?
>>>>>>>>>
>>>>>>>>> yes and yes
>>>>>>>>>
>>>>>>>>> you might want to check if you're using any OCF resource agents that didn't make it into the first supported release though.
>>>>>>>>>
>>>>>>>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>>>>>>>>
>>>>>>>> Thanks, I'll give that a read. All the resource agents are custom so I'm thinking I'm okay (I'll back them up before upgrading).
>>>>>>>>
>>>>>>>> One last question related to the fence_ec2 script. Should crm_mon -VW show it running on both nodes or just one?
>>>>>>>
>>>>>>> I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10?
>>>>>>>
>>>>>>> # crm_verify -L -V
>>>>>>> error: unpack_order_template: Invalid constraint 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template named 'Varnish'
>>>>>>
>>>>>> Is that true?
>>>>>
>>>>> No, it's not. The resource exists and the script for the resource exists.
>>>>>
>>>>> I rolled back to 1.1.8 and the cluster started up without issue.
>>>>
>>>> Can you send us your config? (cibadmin -Ql)
>>>>
>>>> Is Varnish in a group or cloned? That might also explain things.
>>>
>>> The cibadmin output is attached.
>>>
>>> Yes the varnish resources are in a group which is then cloned.
>>
>> -EDONTDOTHAT
>>
>> You cant refer to the things inside a clone.
>> 1.1.8 will have just been ignoring those constraints.
>
> So the implicit order and colocation constraints in a group and clone will take care of those?
>
> Which means remove the constraints and retry the upgrade?
I was able to get the upgrade done. I also had to upgrade the libqb package. I know that's been mentioned in other threads, but I think that should either be a dependency of pacemaker or explicitly documented.
Second order of business is that failover is no longer working as expected. Because the order and colocation constraints are gone, if one of the varnish resources fails, the EIP resource does not move to the other node like it used to.
Is there a way I can create or re-create that behavior?
The resource group EIP-AND_VARNISH has the three varnish services and is then cloned so running on both nodes. If any of them fail I want the EIP resource to move to the other node.
Any advice for doing this?
Thanks
>
>>
>>>
>>> <cluster-config.out>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> error: unpack_order_template: Invalid constraint 'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish'
>>>>>>> error: unpack_order_template: Invalid constraint 'order-Varnishlog-Varnishncsa-mandatory': No resource or template named 'Varnishlog'
>>>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or template named 'Varnish'
>>>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnishlog-Varnish-INFINITY': No resource or template named 'Varnishlog'
>>>>>>> error: unpack_colocation_template: Invalid constraint 'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template named 'Varnishncsa'
>>>>>>> Errors found during check: config not valid
>>>>>>>
>>>>>>> The cluster doesn't start. I'd prefer to figure out how to fix this rather than roll back to 1.1.8. Any help is appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I may have to go back to the drawing board on a fencing device for the nodes. Are there any other recommendations for a cluster on EC2 nodes?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks very much
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # pcs status
>>>>>>>>>>>>>> Last updated: Thu Nov 7 17:41:21 2013
>>>>>>>>>>>>>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>>>>>>>>>>>>> Stack: cman
>>>>>>>>>>>>>> Current DC: ip-10-50-3-122 - partition with quorum
>>>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>>>>>> 3 Nodes configured, unknown expected votes
>>>>>>>>>>>>>> 11 Resources configured.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>>>>>>>>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Full list of resources:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started ip-10-50-3-122
>>>>>>>>>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>>>>>>>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>>>> Stopped: [ EIP-AND-VARNISH:2 ]
>>>>>>>>>>>>>> ec2-fencing (stonith:fence_ec2): Stopped
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have no idea where the node that is marked UNCLEAN came from, though it's a clear typo is a proper cluster node.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only command I ran with the bad node ID was:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node ip-10-50-3-1251
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any possible way that could have caused the the node to be added?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is no node and thus no pcsd that failed. Is there a way I can safely remove this ghost node from the cluster? I can provide logs from pacemaker or corosync as needed.
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131114/fd3cf7d4/attachment-0004.sig>
More information about the Pacemaker
mailing list