Currently, worker_groups will take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones) .
While this is a useful behavior in some conditions, it's incompatible with the use of cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones
Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler, will it be possible to consider extending the feature?
I'm suggesting switching the current behavior under a flag like:
single_asg = true
And implement the new default behavior to create as many ASG as the number of declared subnets.
This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.
Example:
worker_groups = [
# Multi AZ ASG
{
subnets = "${join(",",data.aws_subnet_ids.private.ids)}"
asg_desired_capacity = 1
asg_max_size = 1
asg_min_size = 1
instance_type = "t3.small"
},
# Single AZ ASGs
{
subnets = "${data.aws_subnet_ids.private.ids[0]}"
asg_desired_capacity = 0
asg_max_size = 10
asg_min_size = 0
instance_type = "t3.small"
},
{
subnets = "${data.aws_subnet_ids.private.ids[1]}"
asg_desired_capacity = 0
asg_max_size = 10
asg_min_size = 0
instance_type = "t3.small"
},
{
subnets = "${data.aws_subnet_ids.private.ids[2]}"
asg_desired_capacity = 0
asg_max_size = 10
asg_min_size = 0
instance_type = "t3.small"
},
Interesting issue!
Assuming that a fair number of people will use Kubernetes alongside with the cluster-autoscaler
I would say almost everyone will be running the cluster-autoscaler.
cluster-autoscaler that does not support Auto Scaling Groups which span multiple Availability Zones
This is the first I've heard of this and I have always run the cluster-autoscaler with multi-AZ ASGs. But I've never seen rebalancing happening or an instance terminated by the ASG (unless EC2 healthcheck fails).
take a list of subnets and will create a single ASG spreading across all the requested subnets (availability zones)
Interesting idea!
I'm keen to get other opinions here as I've never seen or heard of this issue before.
Also, couldn't this problem simply be avoided by passing suspended_processes=["AZRebalance"]? What are the implications of that?
Also, couldn't this problem simply be avoided by passing
suspended_processes=["AZRebalance"]? What are the implications of that?
This is a good point. I guess yes that would prevent the issues with rebalancing.
Like I said, it's not blocking at the moment but I still think that adding the flexibility could be interesting for the users of the module.
I would say this problem isn't avoided just by passing suspended_processes=["AZRebalance"], but it is in certain cases.
If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.
The issue has been explained/addressed by people who are much more intimate with CA and the use cases.
https://github.com/kubernetes-retired/contrib/pull/1552#discussion_r75532949
https://github.com/weaveworks/eksctl/pull/647#issuecomment-475923438
If you have AZ aware workloads, CA can request an ASG to schedule an instance, but the ASG will schedule it into a zone that does not meet the requirements of the pod.
I think this is helped in 1.12: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/
An common example is pods that have an EBS volume.
But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.
@mascah is correct, use of the cluster-autoscaler with AutoScaling Groups which span more than one Availability Zone is unsupported.
A separate ASG should be created per zone. As noted, this is primarily because EBS are scoped to a single zone.
But I don't think it's related to this issue? Whether we have an ASG per AZ or a single ASG, I don't think CA or pre 1.12 k8s can help with this.
When there is an ASG per AZ, cluster-autoscaler can increase the desired capacity in the group which is located in the appropriate zone. I believe it goes something like this:
The 1.12 _Topology-Aware Volume Provisioning_ comes at this from the other direction which primarily helps when the cluster-autoscaler is not in use.
_This new setting instructs the volume provisioner to not create a volume immediately, and instead, wait for a pod using an associated PVC to run through scheduling_
The default remains as before:
By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created
A separate ASG should be created per zone.
Unless you suspend AZRebalance (which is now the default in this module), and don't use PVCs. Which I suspect is most use cases.
We have a large PVC for prometheus in 1 cluster so we defined a second worker group like this with a label for scheduling in k8s:
worker_groups_launch_template = [
{
name = "prom-1"
instance_type = "r5.4xlarge"
kubelet_extra_args = "--node-labels=kubernetes.io/prometheus=true"
subnets = "${element(module.vpc1.private_subnets, 1)}"
}
]
I'm suggesting switching the current behavior under a flag like: single_asg = true
Maybe do the reverse so as not to break compatibility and see if there's a clean way of doing it in TF? All the worker group related code is quite complicated at the moment 馃槵
worker_groups_launch_template = [ { name = "prom-1" instance_type = "r5.4xlarge" kubelet_extra_args = "--node-labels=kubernetes.io/prometheus=true" subnets = "${element(module.vpc1.private_subnets, 1)}" } ]
That's a reasonable work around for the time being. I'm also optimistic that CA supports multi AZ ASGs before we need to dump a bunch more worker group related code here to be able to handle it 馃槃
It is required to have a separate ASG for each available zone configured in an EKS cluster. If you like, try to kill a pod or increate your workloads, to see whether the cluster would autoscale.
This is still needed, as during autoscaling for resources pinned to AZ-A causes the autoscaler to have a chance of spawning the additional capacity in another AZ, as it does not have the ability to scale based on zone when multiple AZs are in an ASG.
This is not currently impossible to achieve that with the existing code but it leads to a lot of code duplication for make it happen.
To avoid code duplication in worker_groups, we ended up doing this:
worker_groups = [
for subnet in data.aws_subnet_ids.private.ids :
{
subnets = [subnet],
asg_desired_capacity = 3,
instance_type = "m5.large",
ami_id = data.aws_ami.eks_worker.id
}
]
It is also useful in case you don't know how many subnets you're going to have initially.
we ended up doing this:
Nice 馃槂
Good to see this thread. Any plan to natively integrate that in the module without any need for "for loops"?
Do we need to? That solution by @mgar is very elegant?
I'd also need this, but I'll (hopefully) try @mgar's solution over the next few days and report back. :+1:
Quick update, @mgar's seems to be working well for us 馃
Great. I think we can close this issue then, right?
Most helpful comment
To avoid code duplication in
worker_groups, we ended up doing this:It is also useful in case you don't know how many subnets you're going to have initially.