Containers-roadmap: [EKS] Managed worker nodes

Created on 30 Jan 2019 · 48Comments · Source: aws/containers-roadmap

Managed Kubernetes worker nodes will allow you to provision, scale, and update groups of EC2 worker nodes through EKS.

This feature fulfills https://github.com/aws/containers-roadmap/issues/57

EKS Managed Node Groups are now GA!

EKS

Source

tabern

👍146 ❤30 🚀10

Most helpful comment

As a quick note, can we make sure that this will interact/play nicely with cluster-autoscaler? If we can get managed, autoscaling worker nodes, this would be amazing.

dcherman on 30 Jan 2019

👍33

All 48 comments

As a quick note, can we make sure that this will interact/play nicely with cluster-autoscaler? If we can get managed, autoscaling worker nodes, this would be amazing.

dcherman on 30 Jan 2019

👍33

Along with draining nodes in an upgrade situation.

jaredeis on 5 Feb 2019

👍28

Shouldn't this one be solved as part of implementing Fargate for EKS? https://github.com/aws/containers-roadmap/issues/32

dnobre on 24 Feb 2019

👍6

You could use Virtual Kubelet

danmx on 29 Jul 2019

Fargate for EKS is a different thing @dnobre, with Fargate there are no worker nodes to manage. So this issue about managing worker nodes is not relevant to Fargate.

@tabern being able to support cluster-autoscaler is important to people. At the moment it expects to manipulate ASGs through that API. But if you add an EKS API they please contribute a patch or fork to cluster-autoscaler to be able to use the new API.

If you provide your own autoscaling instead, it has to be aware on the cluster workload and all ASGs, you’d need to a k8s service or daemonset to provide custom metrics to ASG. And in some way, when there are many ASGs, to choose which one to scale up/down next, as cluster-autoscaler does.

cluster-autoscaler has to use some tricks to scale to/from zero nodes in an ASG, because it don’t know what a node would look like when there are none. The improvement the EKS API could provide would be to expose what a node would be (instance type, AZ, node labels, node taints, tags) when the node group is scaled to zero.

cluster-autoscaler also has trouble when scaling up multi-AZ ASGs, because it can’t specify which AZ the new node will be in (e.g. when the un-scheduled workload is AZ-specific). The ability to specify an AZ when scaling up and EKS node group would be great.

whereisaaron on 29 Jul 2019

@whereisaaron interesting that's exactly what I would consider Fargate for EKS to be. Since it was never released 2 years ago it's implementation is pretty hypothetical but theoretically I would expect an endpoint for your kubeconfig and be able to deploy via kubectl apply or helm install.
I wouldn't expect things like "task" definitions because that's what Fargate for ECS already is.
How would "managed worker nodes" be any different?

Granted the fact that "Fargate for EKS" was never released means we are all just spit balling here.

cdenneen on 29 Jul 2019

With Fargate, whether ECS Fargate or EKS Fargate there are no worker nodes. That’s why you use a Fargate solution, so you do not have to manage worker nodes. So this issue has no overlap with a Fargate product.

@cdenneen not sure I understand, but what you describe sounds correct, just like EKS endpoint, except no (real) worker nodes, just a virtual-kubelet running as a sidecar to the Pod. The virtual-kubelet and Pod run who-knows-where, because the instances they run on are not our problem with a Fargate solution.

whereisaaron on 30 Jul 2019

Any updates on this issue?

groodt on 1 Oct 2019

👀2

@groodt coming soon.... we'll be sure to update when there are updates to share!

tabern on 1 Oct 2019

👍15 🎉8

Does this features will add capability to create worker nodes group from the AWS console (UI)?

ejlp12 on 6 Oct 2019

@ejlp12 yes.

tabern on 7 Oct 2019

👍4 🎉1

@tabern will there be an option to add a userdata script or otherwise modify the instances?

lilley2412 on 17 Oct 2019

👍3

I am curious about logging aggregation as well for managed workers. Any details on how we can aggregate logs as part of this feature?

pfremm on 18 Oct 2019

@lilley2412 not at launch, but we plan to add this in the future.

@pfremm yes. You'll be able to use EC2 Autoscaling for reporting group-level metrics. Since managed nodes are standard EC2 instances that run in your account, you will be able to implement any log forwarding/aggregation tooling that you are using today, such as FluentBit/S3 and Fluentd/CloudWatch.

tabern on 18 Oct 2019

👍2

@tabern will this support windows worker nodes?

brunojcm on 21 Oct 2019

👍1

Who manages security patches or addresses CVEs on these managed worker nodes. Will this still fall under "Security in the Cloud" customer responsibility?

hrushig on 13 Nov 2019

👍6

Released GA 11/18 👍

virgofx on 18 Nov 2019

❤13 🚀4

Can we have a link to the docs?

pkoch on 18 Nov 2019

👍2

Hi! The documentation is deploying now. It should be available shortly, and I'll update with a link here when it is.

nrdlngr on 18 Nov 2019

🎉14 🚀8 👍1

We're excited to announce that Amazon EKS Managed Node Groups are now generally available!

With Amazon EKS managed node groups you don’t need to separately provision or connect the EC2 instances that provide compute capacity to run your Kubernetes applications. You can create, update, or terminate nodes for your cluster with a single command. Nodes run using the latest EKS-optimized AMIs in your AWS account while node updates and terminations gracefully drain nodes to ensure your applications stay available.

Today, EKS managed node groups are available for new Amazon EKS clusters running Kubernetes version 1.14 with platform version eks.3. You can also update clusters (1.13 or lower) to version 1.14 to take advantage of this feature. Support for existing version 1.14 clusters is coming soon.

Learn more

tabern on 18 Nov 2019

🎉19

@tabern congrats on the release!
Is CF support in a future release or is doco just pending updates?
Ready to use this but can't use without CF support :/

robgott on 18 Nov 2019

@tabern and entire EKS team, thanks for working hard on this this is very good step in the right direction. I am assuming i can still run my user data(bootstrap) . Also as @robgott mentioned CF needs to support it. Also since some one mentioned about cluster-autoscaler. I am assuming that should continue to work. We have this problem of "How do i keep a node hot, to provision additional pods" and we were thinking of using https://github.com/helm/charts/tree/master/stable/cluster-overprovisioner .

pc-rshetty on 18 Nov 2019

I'm not seeing anything in the docs regarding user data. Is this available right now with the managed worker nodes?

nxf5025 on 19 Nov 2019

👍6

Does something need to be done to enable this on existing clusters?

Latest EKS 1.14

CleanShot 2019-11-19 at 06 20 07

Also, as @nxf5025 mentions, doesn't look like any ability to pass in userdata or kubelet flags?

Also, will there be support for spot instances?

MarcusNoble on 19 Nov 2019

Thanks all! We're pretty excited to introduce this new feature.

@robgott @pc-rshetty CloudFormation support for managed node groups is there today, its just that the documentation is taking a bit longer to publish than we had originally expected.

Specifically, EKS Managed node group introduces a new resource type ”AWS::EKS::Nodegroup“ and an update to existing resource type ”AWS::EKS::Cluster“ to add ClusterSecurityGroupId in Cloudformation. The documentation updates for these changes will be published by 11/21.

@pc-rshetty Cluster Autoscaler should continue to work just like it does today. The biggest change from our end is that we tag every node for auto discovery by cluster autoscaler. Overprovisioner should work. Seems like a helm chart that basically implements the method described here?

@nxf5025 @MarcusNoble today you cannot pass this to managed node groups. However! we're planning to add this in the future as part of support for EC2 Launch Templates https://github.com/aws/containers-roadmap/issues/585

Yes, we also will be working on spot support - tracking in https://github.com/aws/containers-roadmap/issues/583

The other feature we're currently tracking on the roadmap is Windows Support (https://github.com/aws/containers-roadmap/issues/584) but feel free to add more if there are important features you think we should be looking at.

tabern on 19 Nov 2019

👍3 🚀2

Are managed Ubuntu node groups also being worked on or should that be added to the roadmap? That was mentioned in the blog post when comparing EKS API with eksctl, it's a feature we need.

JanneEN on 19 Nov 2019

In addition to spot instances, being able to utilise mixed instances policy, as per https://github.com/kubernetes/autoscaler/pull/1886, i.e. t3.large and t3a.large or m5.large and m5d.large etc. This is to increase the probability of a successful instance fulfilment. We are currently using this functionality to good effect and would need to have the same ability with managed worker nodes, along with the ability to specify userdata.

In the UI, this would be simply represented by being able to select multiple instance types and preferably being able to sort them in order of preference. This is how launch template mixed instances policy and overrides currently work:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplateoverrides.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplate.html

drewhemm on 19 Nov 2019

Just one question can this feature utilise spot instances? Could not find it in documentation though

Sent from my mobile. Typos are possible!

On 19 Nov 2019, at 08:27, Andrew Hemming notifications@github.com wrote:

In addition to spot instances, being able to utilise mixed instances policy, as per kubernetes/autoscaler#1886, i.e. t3.large and t3a.large or m5.large and m5d.large etc. This is to increase the probability of a successful instance fulfilment. We are currently using this functionality to good effect and would need to have the same ability with managed worker nodes, along with the ability to specify userdata.

In the UI, this would be simply represented by being able to select multiple instance types and preferably being able to sort them in order of preference. This is launch template mixed instances policy and overrides currently work:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplateoverrides.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplate.html

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

omerfsen on 19 Nov 2019

👍2 👎1

@omerfsen we're tracking spot support in https://github.com/aws/containers-roadmap/issues/583

@drewhemm we're considering mixed instance groups part of spot support, agree that without them spot will be difficult to use properly.

@JanneEN that's a good call out, we'd love to make this happen. Thanks for adding as https://github.com/aws/containers-roadmap/issues/588

tabern on 21 Nov 2019

Cloudformation Documentation for EKS Managed Node Groups is now published - https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html

tabern on 22 Nov 2019

👍2

@tabern Doesn't look like docs have been updated still. That link redirects me to: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html

stevenoctopus on 22 Nov 2019

👍2

Interesting. The link worked when I clicked it 12 hours ago...

drewhemm on 22 Nov 2019

It's working for me now! :+1:

stevenoctopus on 22 Nov 2019

@tabern How are rolling updates supposed to work? Draining nodes actually works, but apparently leads to a downtime.

Let's say I have an existing node group and want to rotate the nodes. To do this (manually), I would replace the node group by creating a new one, waiting for it to become available and then delete the old one afterwards. When doing this, I can actually see that the nodes get drained before the instances get terminated. However, the running pods are more or less terminated simultaneously which leads to a downtime.

In terraform, the mechanism is basically the same, leading to the same result.

Am I doing something wrong?

Edit:
I can also see this behavior when just scaling an existing node group (e.g. by scaling from 3 nodes to 6 and back from 6 to 3).

2nd edit:
Downtime in this case means that I can see some failing requests.

splieth on 27 Nov 2019

@splieth look into pod disruption budgets, that's what you need to avoid all pods terminate at once.

reegnz on 28 Nov 2019

👍2

When will we see support for existing 1.14 clusters? My clusters are currently stuck on platform version eks.2.

groodt on 1 Dec 2019

@groodt I don't think there's much hope there. You could just create a new cluster and move your workloads there.
Hopefully everyone realizes that you're not supposed to build pet clusters either.

reegnz on 2 Dec 2019

👍1

I don't see why it couldn't support existing clusters. A little more involved maybe but a cluster can have multiple ASGs associated with it so the new managed nodes could be brought up alongside the existing self-managed and them remove the self-managed when the new nodes are stable.

MarcusNoble on 3 Dec 2019

This isn't a new Kubernetes version. Presumably it's some additional process running in the control plane that are aware of the ASGs and that's it.

I saw this in the original announcement:

Today, EKS managed node groups are available for new Amazon EKS clusters running Kubernetes version 1.14 with platform version eks.3. You can also update clusters (1.13 or lower) to version 1.14 to take advantage of this feature. Support for existing version 1.14 clusters is coming soon.

So presumably they do plan to upgrade existing clusters, I'm curious on the timelines. If it's too long, sure I can create a new cluster and migrate workloads easily enough, but it's still annoying to do without downtime.

groodt on 3 Dec 2019

👍3

My 1.14 clusters are still stuck in platform version eks.2. The newest platform version is eks.7 - seems that the rollout of new platform versions for existing clusters is really slow.
While it's fair to assume the creation of new clusters for new K8s versions, seems a bit excessive creating new clusters for new platform versions.
What expectations can we have on the timeline for updates of platform version for existing clusters?

PerGon on 3 Jan 2020

👍4

FWIW, my clusters weren’t updating either, but when I updated all my workers to a newer AMI ahead of the control plane version, all my control planes updated within 48 hours. Coincidence?

llamahunter on 3 Jan 2020

Been trying out managed worker nodes and unless I am missing something do I have no ability to see kubelet related logs unless I provision with an SSH key?

pfremm on 10 Feb 2020

👍1

@pfremm I didn't find another method apart from deploying the SSM agent as DaemonSet and accessing the logs via SSM rather than SSH. But imho that's the better option since the SSH key doesn't need to be shared

splieth on 27 Feb 2020

@pfremm I suggest you set up container insights to ship logs and metrics into CloudWatch Logs, that worked great for me.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html

All logs from pods and kubelet + kube proxy are then shipped and viewable in cloudwatch. You can then ship that further into elasticsearch as well, so that's also an option 8f you don't like cloudwatch queries.

reegnz on 27 Feb 2020

@tabern will there be an option to add a userdata script or otherwise modify the instances?

Is there any update on this?

I couldn't find any reference in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html

This is quite critical to setup nodes for different purposes

pdonorio on 24 Apr 2020

Hi @pdonorio, that feature request is being tracked in this issue #596

mikestef9 on 24 Apr 2020

👍1

Out of curiosity, does the current managed node group setup specify any kind of flags for --kube-reserved and friends? If so, what are the values based on? I know we'll be able to control those values once #596 has been addressed, but I'm wondering if managed nodegroups would be usable for us today.

dcherman on 24 Apr 2020

👍1

is there any support for setting taints on a nodegroup?
is there any support for customizing the bootstrap script that is being ran on the nodegroup instances?