Terraform-aws-eks: Extend current worker_group ASG creation behavior (1 AZ per ASG)

Created on 15 Apr 2019 · 17Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

[ ] bug report
[x] feature request
[ ] support request
[ ] kudos, thank you, warm fuzzy

What is the current behavior?

Currently, worker_groups will take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones) .

While this is a useful behavior in some conditions, it's incompatible with the use of cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler, will it be possible to consider extending the feature?

What is the desired behavior?

I'm suggesting switching the current behavior under a flag like:

single_asg = true

And implement the new default behavior to create as many ASG as the number of declared subnets.

Environment details

Affected module version: v2.3.1
OS: Linux
Terraform version: 0.11.13

Notes:

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

Example:

worker_groups = [
    # Multi AZ ASG
    {
      subnets              = "${join(",",data.aws_subnet_ids.private.ids)}"
      asg_desired_capacity = 1
      asg_max_size         = 1
      asg_min_size         = 1
      instance_type        = "t3.small"
    },
    # Single AZ ASGs
    {
      subnets              = "${data.aws_subnet_ids.private.ids[0]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[1]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },
    {
      subnets              = "${data.aws_subnet_ids.private.ids[2]}"
      asg_desired_capacity = 0
      asg_max_size         = 10
      asg_min_size         = 0
      instance_type        = "t3.small"
    },

Source

jphuynh

👍8

Most helpful comment

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

To avoid code duplication in worker_groups, we ended up doing this:

worker_groups = [
    for subnet in data.aws_subnet_ids.private.ids :
    {
      subnets              = [subnet],
      asg_desired_capacity = 3,
      instance_type        = "m5.large",
      ami_id               = data.aws_ami.eks_worker.id
    }
  ]

It is also useful in case you don't know how many subnets you're going to have initially.

mgar on 22 Oct 2019

👍6

All 17 comments

Interesting issue!

Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler

I would say almost everyone will be running the cluster-autoscaler.

cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones

This is the first I've heard of this and I have always run the cluster-autoscaler with multi-AZ ASGs. But I've never seen rebalancing happening or an instance terminated by the ASG (unless EC2 healthcheck fails).

take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones)

Interesting idea!

I'm keen to get other opinions here as I've never seen or heard of this issue before.

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

max-rocket-internet on 15 Apr 2019

Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?

This is a good point. I guess yes that would prevent the issues with rebalancing.

Like I said, it's not blocking at the moment but I still think that adding the flexibility could be interesting for the users of the module.

jphuynh on 15 Apr 2019

👍3

I would say this problem isn't avoided just by passing suspended_processes=["AZRebalance"], but it is in certain cases.

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

The issue has been explained/addressed by people who are much more intimate with CA and the use cases.

https://github.com/kubernetes-retired/contrib/pull/1552#discussion_r75532949
https://github.com/weaveworks/eksctl/pull/647#issuecomment-475923438

mascah on 25 Apr 2019

👍3

If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.

I think this is helped in 1.12: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

An common example is pods that have an EBS volume.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

max-rocket-internet on 26 Apr 2019

@mascah is correct, use of the cluster-autoscaler with AutoScaling Groups which span more than one Availability Zone is unsupported.

A separate ASG should be created per zone. As noted, this is primarily because EBS are scoped to a single zone.

But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.

When there is an ASG per AZ, cluster-autoscaler can increase the desired capacity in the group which is located in the appropriate zone. I believe it goes something like this:

Pod with PVC created
Volume Provisioner creates EBS in random zone
kube-scheduler schedules Pod into zone matching the volume
insufficient capacity in the group for that zone ==> cluster-autoscaler increases desired capacity.

The 1.12 _Topology-Aware Volume Provisioning_ comes at this from the other direction which primarily helps when the cluster-autoscaler is not in use.

_This new setting instructs the volume provisioner to not create a volume immediately, and instead, wait for a pod using an associated PVC to run through scheduling_

The default remains as before:

By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created

BrianChristie on 16 May 2019

A separate ASG should be created per zone.

Unless you suspend AZRebalance (which is now the default in this module), and don't use PVCs. Which I suspect is most use cases.

We have a large PVC for prometheus in 1 cluster so we defined a second worker group like this with a label for scheduling in k8s:

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

max-rocket-internet on 16 May 2019

👍4

I'm suggesting switching the current behavior under a flag like: single_asg = true

Maybe do the reverse so as not to break compatibility and see if there's a clean way of doing it in TF? All the worker group related code is quite complicated at the moment 😬

max-rocket-internet on 16 May 2019

👍1

  worker_groups_launch_template = [
    {
      name                     = "prom-1"
      instance_type       = "r5.4xlarge"
      kubelet_extra_args       = "--node-labels=kubernetes.io/prometheus=true"
      subnets                  = "${element(module.vpc1.private_subnets, 1)}"
    }
  ]

That's a reasonable work around for the time being. I'm also optimistic that CA supports multi AZ ASGs before we need to dump a bunch more worker group related code here to be able to handle it 😄

mascah on 16 May 2019

🚀2

It is required to have a separate ASG for each available zone configured in an EKS cluster. If you like, try to kill a pod or increate your workloads, to see whether the cluster would autoscale.

Stephanzf on 2 Oct 2019

This is still needed, as during autoscaling for resources pinned to AZ-A causes the autoscaler to have a chance of spawning the additional capacity in another AZ, as it does not have the ability to scale based on zone when multiple AZs are in an ASG.

eperdeme on 8 Oct 2019

👍1

This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.

To avoid code duplication in worker_groups, we ended up doing this:

worker_groups = [
    for subnet in data.aws_subnet_ids.private.ids :
    {
      subnets              = [subnet],
      asg_desired_capacity = 3,
      instance_type        = "m5.large",
      ami_id               = data.aws_ami.eks_worker.id
    }
  ]

It is also useful in case you don't know how many subnets you're going to have initially.

mgar on 22 Oct 2019

👍6