Tell us about your request
What do you want us to build?
Support for upgrading an existing EKS instance provisioned by Cloudformation rather than requiring replacement
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.
I'm trying to upgrade an EKS cluster between versions without replacing the cluster which introduces risk since the behavior of replacement is not well defined (i.e., is the etcd state migrated? Backups? What about requests that might be in-flight when the changeover happens?). The existing behavior would also likely require rolling the worker nodes since the cluster API would change, unless you put it behind a CNAME or something.
Instead, CloudFormation should simply upgrade the cluster via the API that is already available for doing so and which both the AWS CLI and Terraform support.
Are you currently working around this issue?
Yes
How are you currently solving this problem?
Managing the EKS cluster with Terraform
Additional context
Anything else we should know?
Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
@dcherman EKS supports in-place cluster upgrades via the EKS API (https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html) and worker node updates via Cloudformation (https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html) - shipped as part of #21.
Does this resolve your issue or are you thinking of additional/different functionality for cluster upgrades?
@tabern So part of what you can do with Cloudformation is specify the Kubernetes version that you want. If you change that in your template and re-apply the stack, it required replacement of the resource.
What I'm proposing is that Cloudformation should use the EKS API internally to perform these upgrades rather than replacing the resource.
Got it - so the idea is you can do an entire cluster incl. nodes with a single CF stack update?
Exactly; I want to avoid creating and updating clusters using different methods since the Cloudformation template is no longer the source of truth if you're updating the cluster outside of it.
+1 for this
Ok - good info. Thanks!
@dcherman if you are have used eksctl to provision your clusters we're actively working on the list of features necessary to make this a little easier, there might be some overlaps where you could use the APIs being built/already built to help reduce some of this time.
Check out - https://github.com/weaveworks/eksctl/issues/348 for more details.
@christopherhein is the goal, recommendation by Amazon, for people to use eksctl to create and manage EKS clusters instead of CloudFormation?
@christopherhein I'm actively monitoring eksctl and am actually looking for ways to contribute there :)
That said, eksctl builds the clusters by using cloudformation internally, and this issue was actually filed as a result of a discussion that we were having in the #eks slack channel about the usage of eksctl w/ GitOps, so this is actually a pre-requisite to implement cluster upgrades (correctly) in eksctl without having to have them draw outside the lines of cloudformation either and hit the upgrade API directly.
@christopherhein is the goal, recommendation by Amazon, for people to use
eksctlto create and manage EKS clusters instead of CloudFormation?
It's an option, we have contributed a handful of things to eksctl and have been working closely with Weaveworks and @errordeveloper on the project. There are other ways too for example if your organization uses Terraform there are Hashicorp supported ways of deploying.
@dcherman check out https://github.com/weaveworks/eksctl/issues/19 if you haven't seen it, at one point we were discussing using the ClusterAPI functions to support this style of deployment, still a lot to do if you want to help. :)
@christopherhein understood the option, I just wanted to confirm that wasn't a replacement. The OP was for better way to improve cluster updates in CloudFormation, just didn't want it to get lost. I'm very intrigued on how much AWS has used eksctl that it might be more of a preferred way than even using CF directly anymore... something to ponder over 馃 thanks as always!
CloudFormation now supports cluster upgrades!
What's new: https://aws.amazon.com/about-aws/whats-new/2019/03/amazon-eks-now-supports-kubernetes-version-1-12-and-cluster-vers/
Documentation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-cluster.html
@tabern - To clarify, do you mean that if I am managing my EKS cluster in CloudFormation, and it is 1.11 right now... then I update the template to 1.12, and also update my EKS ASG nodegroup AMI to the latest version, it will be Kubernetes-aware, and will upgrade the nodes and add taints like NoExecute automatically, so Pods drop off the terminating instances automatically?
Or will it just be the equivalent of pressing the 'upgrade cluster' button in the console, where it just controls the upgrade/rollout on the masters?
Because this ticket was specifically intended (at least in my reading?) to be the former. The latter is great and all... but there's still a huge pain point in rolling out the AMI update, because we basically have to build our own automation to cleanly and safely upgrade the EKS NodeGroup.
@geerlingguy pretty sure the CF upgrade only applies to the control plane. I notice when AWS says 'cluster' they are often only thinking of the bit they manage! 馃槃 And I think that was what this ticket was about, because before with CF, if you changed the EKS version, CF would delete and recreate the control plane. Not what you want! 馃槩
For the worker nodes, users can do anything the want, including any custom AMI's, so it wouldn't easily be possible for CF to identify the AMI to use to upgrade nodes in the general case. CF/EKS doesn't actually know what ASGs are relevant to the cluster, just which instances have registered, further complicating any possible upgrade.
One option, not just for EKS, is to bring up a new, upgraded node group ASG. Then once it is stable, drain the old node group nodes, and then delete that node group ASG. If you are using eksctl you can do this with roughly:
eksctl create nodegroup --cluster foo --name new
eksctl drain nodegroup --cluster foo --name old
eksctl delete nodegroup --cluster foo --name old
There is also discussion of adding a --replaces option to create nodegroup so that can all be a one-step process. weaveworks/eksctl#443
If you just want to update the AMI in your ASG and let it roll and update, then you can run an auto-drain Daemonset like kube-aws uses. It watches for ASG and Spot Fleet terminations and auto-drains the node before it is actually gets terminated. With that in place you can do a a regular ASG rolling update of the AMI.
@geerlingguy The feature that we shipped today is the later. When you update the version via CloudFormation, it triggers the updateClusterVersion API to begin the cluster update process.
Instead, CloudFormation should simply upgrade the cluster via the API that is already available for doing so and which both the AWS CLI and Terraform support.
What you describe makes a lot of sense:
then I update the template to 1.12, and also update my EKS ASG nodegroup AMI to the latest version, it will be Kubernetes-aware, and will upgrade the nodes and add taints like NoExecute automatically, so Pods drop off the terminating instances automatically
The functionality you (and @whereisaaron) are describing is a bit more complex and is most similar to https://github.com/aws/containers-roadmap/issues/139
I'm having an issue with the current implementation of this feature.
_Scenario 1_
The default behaviour of the CloudFormation EKSCluster resource is to create a cluster with the latest available Kubernetes version if no version property is specified. When you explicitly specify the version in a later version of the CloudFormation template due to a stack update the update fails. This is due to the fact that when explicitly specifying the version it's currently on the stack fails to update with the error "No Updated To be Performed" on the EKSCluster resource. The error itself is correct but it means we are unable to lock down the version in a newer CloudFormation template.
_CF Templates: https://gist.github.com/vincentheet/e826e39d0c47cdb79310866cccce2acd_
_Scenario 2_
If you initially create a EKSCluster with the version property on 1.11 and want to update this cluster to 1.12 with a new CloudFormation template. The CloudFormation stack can come in an erroneous / deadlock state if there is another resources in the new CloudFormation template that want's to rollback the whole CF stack. When the EKSCluster resource is successfully upgraded from 1.11 to 1.12 another resource in the same CF stack fails to update then the EKSCluster tries to rollback. The rollback on the EKSCluster fails because of the error "Update failed because of Kubernetes version is required". Since this rollback is not supported by EKS the CF stack comes in an error state. When then trying to rollout a fixed / correct CF template the EKSCluster update fails because it is already updated with the error: "The following resources failed to update: EksCluster"
_CF Templates: https://gist.github.com/vincentheet/f4047c3bb1461d9f05430cea1b74d681_
Suggested solution
When an EKSCluster resource is being requested to update its version from CloudFormation please verify if the EKSCluster already is on the requested version. So for example, if the EKSCluster already is on 1.12 then ignore the update request and report a successful state to Cloudformation instead of an error. This will result in the fact that other resources in the same CloudFormation stack can be updated.
FYI: I opened a case with support but they mentioned it would be good if I place my issue here as well.
It would be great if this issue can be fixed.
@Ivincentheet - 'm having an issue with the current implementation of this feature.
This is also impacting our automated rollout of EKS 1.12 (and controlling our version deployment automation). We are experiencing the same issue as shown above.
+1
I was able to reproduce @vincentheet issue #2 and you can not update the stack anymore once it in this state.
Here is the error I see.
Update failed because of Unsupported Kubernetes minor version update from 1.12 to 1.12 (Service: AmazonEKS; Status Code: 400; Error Code: InvalidParameterException
@tabern Do we need to create a new issue since this one is closed for it to be addressed?
@qthuy sorry for the delay on this - we are taking a look at this
How is this still a thing, wth? It's a major bug, reported almost a year ago now, still not fixed.
This breaks CF stacks that launch EKS Clusters in a really bad way.
Having this non-fixed for a year is just forcing your customers to look for alternatives to CF such as Terraform.
I was also able to reproduce @vincentheet issue #2 and can not update the stack anymore once it in this state.
Update failed because of Unsupported Kubernetes minor version update from 1.13 to 1.13 (Service: AmazonEKS; Status Code: 400; Error Code: InvalidParameterException
This error should not be thrown to fail stack update.
I contacted support, they told me to use other tools to manage EKS pretty much 馃う鈥嶁檪 I guess that's how much AWS is going to focus on CFN, time to drop it completely.
@tabern Any updates? Can this issue please be reopened while you/AWS are/is investigating?
@iAnomaly @vincentheet @jia2 can you please open a new issue to track this? My understanding here is that the CFN template may not be looking at the patch version during the update and is thus failing. Want to apologize for any anguish this has caused, we want CFN to be a first class citizen for EKS and we have work lined up for end of 2019/early 2020 to address this and other areas where we can improve the capabilities for CFN to manage EKS clusters.
@tabern I opened a new issue as requested: https://github.com/aws/containers-roadmap/issues/497 It's good to hear that CFN support is going to be improved.
Thanks @vincentheet - I'll pull that onto the roadmap and we can track status there.
@dcherman EKS supports in-place cluster upgrades via the EKS API (https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html) and worker node updates via Cloudformation (https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html) - shipped as part of #21.
Does this resolve your issue or are you thinking of additional/different functionality for cluster upgrades?
Hello Tabren ,
i am planning to deply EKS cluster with quickstart however want to know about future upgrade related problems and changes in the environment . How to do the further upgrades and migrations
Most helpful comment
I'm having an issue with the current implementation of this feature.
_Scenario 1_
The default behaviour of the CloudFormation EKSCluster resource is to create a cluster with the latest available Kubernetes version if no version property is specified. When you explicitly specify the version in a later version of the CloudFormation template due to a stack update the update fails. This is due to the fact that when explicitly specifying the version it's currently on the stack fails to update with the error "No Updated To be Performed" on the EKSCluster resource. The error itself is correct but it means we are unable to lock down the version in a newer CloudFormation template.
_CF Templates: https://gist.github.com/vincentheet/e826e39d0c47cdb79310866cccce2acd_
_Scenario 2_
If you initially create a EKSCluster with the version property on 1.11 and want to update this cluster to 1.12 with a new CloudFormation template. The CloudFormation stack can come in an erroneous / deadlock state if there is another resources in the new CloudFormation template that want's to rollback the whole CF stack. When the EKSCluster resource is successfully upgraded from 1.11 to 1.12 another resource in the same CF stack fails to update then the EKSCluster tries to rollback. The rollback on the EKSCluster fails because of the error "Update failed because of Kubernetes version is required". Since this rollback is not supported by EKS the CF stack comes in an error state. When then trying to rollout a fixed / correct CF template the EKSCluster update fails because it is already updated with the error: "The following resources failed to update: EksCluster"
_CF Templates: https://gist.github.com/vincentheet/f4047c3bb1461d9f05430cea1b74d681_
Suggested solution
When an EKSCluster resource is being requested to update its version from CloudFormation please verify if the EKSCluster already is on the requested version. So for example, if the EKSCluster already is on 1.12 then ignore the update request and report a successful state to Cloudformation instead of an error. This will result in the fact that other resources in the same CloudFormation stack can be updated.
FYI: I opened a case with support but they mentioned it would be good if I place my issue here as well.
It would be great if this issue can be fixed.