Terraform-aws-eks: Changing worker instance type

Created on 9 Jan 2019 · 7Comments · Source: terraform-aws-modules/terraform-aws-eks

I want to change the instance type of a worker group

I'm submitting a...

[*] bug report
[*] support request

What is the current behavior?

After running terraform apply, terraform reports:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  ~ module.eks.aws_autoscaling_group.workers
      launch_configuration:                      "test-eks-cluster-02019010915335535610000000c" => "${element(aws_launch_configuration.workers.*.id, count.index)}"

-/+ module.eks.aws_launch_configuration.workers (new resource required)
      id:                                        "test-eks-cluster-02019010915335535610000000c" => <computed> (forces new resource)
      associate_public_ip_address:               "false" => "false"
      ebs_block_device.#:                        "0" => <computed>
      ebs_optimized:                             "true" => "true"
      enable_monitoring:                         "true" => "true"
      iam_instance_profile:                      "test-eks-cluster20190109153353893600000008" => "test-eks-cluster20190109153353893600000008"
      image_id:                                  "ami-0a9006fb385703b54" => "ami-0a9006fb385703b54"
      instance_type:                             "t3.micro" => "t3.large" (forces new resource)
      key_name:                                  "" => <computed>
      name:                                      "test-eks-cluster-02019010915335535610000000c" => <computed>
      name_prefix:                               "test-eks-cluster-0" => "test-eks-cluster-0"
      root_block_device.#:                       "1" => "1"
      root_block_device.0.delete_on_termination: "true" => "true"
      root_block_device.0.iops:                  "0" => "0"
      root_block_device.0.volume_size:           "100" => "100"
      root_block_device.0.volume_type:           "gp2" => "gp2"
      security_groups.#:                         "1" => "1"
      security_groups.1557716484:                "sg-0000ab3ece5ffdca6" => "sg-0000ab3ece5ffdca6"
      user_data_base64:                          "base64 data" => "base 64 data"


Plan: 1 to add, 1 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.eks.aws_launch_configuration.workers: Creating...
  associate_public_ip_address:               "" => "false"
  ebs_block_device.#:                        "" => "<computed>"
  ebs_optimized:                             "" => "true"
  enable_monitoring:                         "" => "true"
  iam_instance_profile:                      "" => "test-eks-cluster20190109153353893600000008"
  image_id:                                  "" => "ami-0a9006fb385703b54"
  instance_type:                             "" => "t3.large"
  key_name:                                  "" => "<computed>"
  name:                                      "" => "<computed>"
  name_prefix:                               "" => "test-eks-cluster-0"
  root_block_device.#:                       "" => "1"
  root_block_device.0.delete_on_termination: "" => "true"
  root_block_device.0.iops:                  "" => "0"
  root_block_device.0.volume_size:           "" => "100"
  root_block_device.0.volume_type:           "" => "gp2"
  security_groups.#:                         "" => "1"
  security_groups.1557716484:                "" => "sg-0000ab3ece5ffdca6"
  user_data_base64:                          "" => "base63data"
module.eks.aws_launch_configuration.workers: Creation complete after 1s (ID: test-eks-cluster-020190109162625609700000001)
module.eks.aws_autoscaling_group.workers: Modifying... (ID: test-eks-cluster-02019010915341192810000000e)
  launch_configuration: "test-eks-cluster-02019010915335535610000000c" => "test-eks-cluster-020190109162625609700000001"
module.eks.aws_autoscaling_group.workers: Modifications complete after 0s (ID: test-eks-cluster-02019010915341192810000000e)
module.eks.aws_launch_configuration.workers.deposed: Destroying... (ID: test-eks-cluster-02019010915335535610000000c)
module.eks.aws_launch_configuration.workers.deposed: Destruction complete after 1s

If this is a bug, how to reproduce? Please include a code sample if relevant.

module "eks" {
  source       = "terraform-aws-modules/eks/aws"
  cluster_name = "test-eks-cluster"
  subnets      = ["subnet-1", "subnet-2"]

  tags = {
    Environment = "test"
  }

  worker_groups = [
    {
      instance_type = "t3.micro" -> "t3.large"
      asg_max_size  = 3
    }
  ]

  vpc_id = "vpc"
}

What's the expected behavior?

To change the EC2 instance type

Are you able to fix this problem and submit a PR? Link here if you have already.

Most likely no... 😁

Environment details

Affected module version: v2.0.0
OS: MacOS Mojave
Terraform version: v0.11.10

Any other relevant info

This maybe useful:

```$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
gitlab-managed-apps install-helm 0/1 Pending 0 1h
kube-system aws-node-8qf8c 1/1 Running 0 2h
kube-system coredns-7554568866-fbv4x 1/1 Running 0 2h
kube-system coredns-7554568866-nbfsl 1/1 Running 0 2h
kube-system kube-proxy-v45wx 1/1 Running 0 2h
kube-system kubernetes-dashboard-5dd89b9875-f7wf8 0/1 Pending 0 1h

The pending status I think is because of this:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m51s (x362 over 63m) default-scheduler 0/1 nodes are available: 1 Insufficient pods.
```

Source

martimarkov

Most helpful comment

Ah OK, I'll try to explain then!

When you change the instance type and run TF, a new Launch Configuration (LC) is created with the new type and the Autoscaling Group (ASG) is updated to use the new LC. Nothing changes at this point. Existing instances remain running. Only when a new instance is created by the ASG will the new LC be used and this new instance will be of the new type.

If you want to replace all instances in the cluster with the new type then you need to scale up the amount of instances in the ASG, then drain the old smaller instances, then terminate them.

If you have pending pods that can't run because there is not enough CPU, then you could also run the cluster-autoscaler. This will increase the amount of instances in the cluster to accommodate new pods.

max-rocket-internet on 10 Jan 2019

👍3

All 7 comments

Micro is super tiny. It likely was never given enough resources for that pod to begin with. Start with a medium.

Also, Changing the instance type of the existing ASG is likely going to do things you don't like. The better way to do it would be to make a second worker group with the different size, then either get rid of the first worker group or scale it to 0.

RothAndrew on 9 Jan 2019

This module does not re-create the ASG when making changes to the Launch Configuration. After making such changes you then need to recycle the instances.

Whether this is desired behaviour will likely get mixed responses. It gives more control over draining and migrating pods in a k8s-friendly manor without needing to run blue/green deployments. Unfortunately you've found it also means changes to the terraform create drifts in the running environment

then either get rid of the first worker group or scale it to 0.

Removing or inserting items, apart from the last one, in an indexed resource in TF causes really bad things to happen.

dpiddockcmp on 10 Jan 2019

👍1

I don't understand. What's the actual problem here? The fact you have pods Pending state?

max-rocket-internet on 10 Jan 2019

The problem I was experiencing was that after changing the instance_type the work would still stay at the old type. I do think that the pods being in a Pending state (best guess is memory constants) is what caused the new Launch group to not spin up a new EC2 instance. I'd expect a new instance to spin up and be able to drain the current worker and migrate the pods to the new.

I just wasn't sure what is the expected behavior and in the future how to change the instance type without downtime.

So maybe it's just more of a question of what would be the best practice of changing the instance_type in the future that is more "battle tested"

martimarkov on 10 Jan 2019

Ah OK, I'll try to explain then!

If you want to replace all instances in the cluster with the new type then you need to scale up the amount of instances in the ASG, then drain the old smaller instances, then terminate them.

max-rocket-internet on 10 Jan 2019

👍3

Thanks!! Got it. This helps a lot. :)

—
MM

On 10 Jan 2019, at 16:37, Max Williams notifications@github.com wrote:

Ah OK, I'll try to explain then!

When you change the instance type and run TF, a new Launch Configuration (LC) is created with the new type and the Autoscaling Group (ASG) is updated to use the new LC. Nothing changes at this point. Existing instances remain running. Only when a new instance is created by the ASG will the new LC be used and this new instance will be of the new type.

If you want to replace all instances in the cluster with the new type then you need to scale up the amount of instances in the ASG, then drain the old smaller instances, then terminate them.

If you have pending pods that can't run because there is not enough CPU, then you could also run the cluster-autoscaler. This will increase the amount of instances in the cluster to accommodate new pods.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.