Terraform: Service not starting the new registered task.

Created on 18 Jan 2017 · 41Comments · Source: hashicorp/terraform

Hey, I am getting a issue while updating the ecs service using Terraform.
When i am releasing a new version of task through terraform, my service is getting updated with new version of task definition but when i am viewing my service, it is showing the current running task as [INACTIVE] as the service is updated with new task definition but service is not rolling the task to new version of task. wondering why it is behaving like this as when we update a service using UI, it rolls the running task to new version of task one by one. Issue is in kicking off the ECS service rolling deploy process.
Any idea why it is behaving like this?

Below mentioned my terraform script :

# service for ECS cluster
resource "aws_ecs_service" "ecsservice" {
  name = "${var.aws_ecs_service_name}"
  cluster = "${aws_ecs_cluster.ecscluster.id}"
  task_definition = "${aws_ecs_task_definition.app.arn}"
  desired_count = 1
  iam_role = "${aws_iam_role.elb_role.arn}"
  depends_on = ["aws_iam_role_policy.elb_policy"]
  load_balancer {
    elb_name = "${aws_elb.elb.name}"
    container_name = "app"
    container_port = 8081
  }
}

# Task-definition for ECS service
resource "aws_ecs_task_definition" "app" {
  family = "app"
  container_definitions = "${file("task-definitions/app.json")}"
}

I found a similar issue #4378 , but mine is different case as i am having latest version of Terraform i.e. v0.8.4.

bug provideaws waiting-response

Source

kailashv

👍4

Most helpful comment

Just wanted to confirm in 0.8.5 it works as I expect if I manually concat the family:revision rather than using the task defintion arn, in the service definition.

task_definition = "${aws_ecs_task_definition.my_task.family}:${aws_ecs_task_definition.my_task.revision}"
  ```
works, but 
 ```
task_definition = "${aws_ecs_task_definition.my_task.arn}

doesn't

jamisonhyatt on 4 Feb 2017

👍5

All 41 comments

@grubernaut Awaiting Response.. I am stucked due to this issue..

kailashv on 20 Jan 2017

Hello,
Same for me.
I have Terraform v0.8.4 installed.
As soon as I change anything in Task Definition and then apply changes with terraform apply I get this service in "INACTIVE" state and the only option to fix it is:

Comment out the aws_ecs_service section
terraform apply to destroy service
Uncomment the aws_ecs_service section
terraform apply to create the service again
Thanks a lot in advance

OleksandrUA on 23 Jan 2017

👍1

@OleksandrUA hey.. thanks for this solution. this solutions works, but this way we are having downtime right. What if we don't want any downtime ?? just like ECS provides when we update the service manually..

kailashv on 27 Jan 2017

Same behavior in 0.8.5. This prevents us who use terraform from leveraging the rolling update of task commission / decommission that ecs provides.

jamisonhyatt on 3 Feb 2017

@kailashv since Terraform in this particular case works unstable I only use it to prepare the infra and the deployment in my case is being done with aws ecs CLI tool. Hopefully will completely switch to Terraform as soon as it works completely ok

OleksandrUA on 3 Feb 2017

Just wanted to confirm in 0.8.5 it works as I expect if I manually concat the family:revision rather than using the task defintion arn, in the service definition.

task_definition = "${aws_ecs_task_definition.my_task.family}:${aws_ecs_task_definition.my_task.revision}"
  ```
works, but 
 ```
task_definition = "${aws_ecs_task_definition.my_task.arn}

doesn't

jamisonhyatt on 4 Feb 2017

👍5

Hello –

I'm trying to track this down using Terraform v0.8.6, and I'm unable to reproduce this so far. I'm using the configuration files from here (a demo repo of mine):

https://github.com/catsby/ecs_loadbalancing_demo/tree/master/alb

With that setup, I'm able to change the task_definition (ex: memory from 128 -> 256), both Terraform and ECS behave as expected:

Terraform cleanly applies the change
ECS drains the existing running task
ECS spins up new tasks to replace them

The only problem I hit was in early testing I didn't have enough memory/compute when I bumped the memory from 128->512, the instance size I chose couldn't run 3 of that size.

Please let me know if you're still hitting this issue!

catsby on 17 Feb 2017

I should note: when only 1 of my new tasks booted up, the "Events" tab explained my lack of memory preventing other tasks from booting, so maybe the "events" tab has information here?

catsby on 17 Feb 2017

@catsby hey, I had a look at your repo and did the required changes just with a hope that this issue will be resolved, but i am still getting this issue. ECS doesn't drains the existing running task. it just change it ti [INACTIVE] mode and also doesn't spins the new task.

kailashv on 23 Feb 2017

Hey @kailashv when this happens, do you see any errors or comments in the events history? Does it mention instance capacity or anything similar?

catsby on 27 Feb 2017

@catsby i am not getting any error or comments on events tab. I am using c4.large instance so i dont think so that its a memory related problem.
Please find the task definition details for more details :

app.json:

[
  {
    "name": "abc",
    "image": "12345.dkr.ecr.us-east-1.amazonaws.com/abc:1.0",
    "memoryReservation":500,
    "essential": true,
    "environment" : [
    { "name" : "ENV", "value" : "xyz" }
],
    "portMappings": [
      {
        "containerPort": 8065,
        "hostPort": 8065
      }
    ]
  }
]

kailashv on 28 Feb 2017

@kailashv deregistering task definitions does not lead to downtime. Inactive tasks will still be alive and services can even autoscale.

http://docs.aws.amazon.com/AmazonECS/latest/developerguide/deregister-task-definition.html

prog893 on 28 Feb 2017

And yes, running code like this:

resource "aws_ecs_service" "ecs_service" {
  name = "${var.name}"
  cluster = "${var.cluster_name}"
  task_definition = "${var.task_definition_arn}"
  desired_count = "${var.desired_count}"
  iam_role = "${var.iam_role_arn}"
}

and modifying container definitions in task definition leads to this:

The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.

Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.

+ aws_ecs_task_definition.ecs_task_definition
    arn:                   "<computed>"
    container_definitions: "353b9d1d18a009b8471b526e3df812c7cc6367ee"
    family:                "stg-api"
    network_mode:          "<computed>"
    revision:              "<computed>"

- aws_ecs_task_definition.ecs_task_definition


Plan: 1 to add, 0 to change, 1 to destroy.

Expecting service to be modified to reflect task definition changes.

prog893 on 28 Feb 2017

Hello all –

I've updated some sample code of mine to try and reproduce this issue, but so far I've had no luck:

https://gist.github.com/catsby/3e97f0bfe6f3f1c2ea2e7851565f22c8

If anyone can reproduce with that, or modify it to demo the reproduction, that would be very helpful. If not, I'll likely close this issue soon.

catsby on 8 Mar 2017

@pseudobeer Yes, when i update the task definition i get the same output. But service does not roles the new task definition. Though the older task keeps running in inactive mode, but when i am giving any new release it is expected to get the new task running. There should not be any manual effort like stopping the [INACTIVE] task and then ecs service will start a new task with updated task definition.

kailashv on 10 Mar 2017

@catsby Hey, I cross checked my code with yours but could not find any difference. The IAM role, policy, task definition and service, everything is written same but still getting this issue which is restricting to get a one click deployment. If this issue persist then i have to move my release process to cloudformation as i tried with that and not getting any issue on that. If anyone else getting this type of issue please update your findings here.

kailashv on 10 Mar 2017

@kailashv what I do in my deploy scripts after terraform apply is:

REVISION=$(terraform output task_revision)
aws ecs update-service --cluster ${CLUSTER_NAME} --service ${SERVICE_NAME} --task-definition ${APP_NAME}:${REVISION}

I force ECS to update service with new task definition made by terraform. Then, ECS does a rolling update and as it ends no [INACTIVE]s no more. (This should be done by terraform actually)

The problem is that, terraform does not update service with new task definition, nevertheless new task definition is created clearly.

prog893 on 10 Mar 2017

Providing more details of my code :

IAM role details

resource "aws_iam_role" "ecs_service" {
  name = "EcsServiceRoleTest"

  assume_role_policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Principal": {"AWS": "*"},
            "Effect": "Allow",
            "Sid": ""
        }
    ]
}
EOF
}

resource "aws_iam_role_policy" "ecs_service" {
  name = "EcsServicePolicyTest"
  role = "${aws_iam_role.ecs_service.name}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:*",
        "ec2:*",
        "ecs:*"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
EOF
}

kailashv on 10 Mar 2017

@pseudobeer uploading the screenshot of the issue i am facing.

After updating task-definition, service is getting updated with new revision of task but service is not rolling the old [INACTIVE] task to new version of task. I have to manually stop the running task so that it will start a new task with latest task-definition updated on the service.

@pseudobeer As i am using the port forwarding in my container, I am getting port conflict error while using the solution provided by you. i.e.
REVISION=$(terraform output task_revision)
aws ecs update-service --cluster ${CLUSTER_NAME} --service ${SERVICE_NAME} --task-definition ${APP_NAME}:${REVISION}

kailashv on 10 Mar 2017

@kailashv Are you by any chance using static port mapping, i.e. 80:80? If so, your new tasks are failing because there is no EC2 instance with free port your container uses. You can check that in Events tab.

By manually stopping the task your only port is freed, and with a free port is allocated by a new task. After updating Service with a new Task Definition, ECS does a rolling update by first making new containers, and killing old ones after the creation finishes. As creation happens before destruction, ECS fails to allocate a port.

I would recommend a dynamic port mapping, i.e. 0:80, which results in mapping a random host port to a static container port. If a Service is configured with a Target Group and dynamic ports, container entry with a random port will be added to Target Group automatically, so you don't have to worry about autoscaling.

TL;DR: You can try adding an extra EC2 instance to your cluster and see if it helps.

prog893 on 10 Mar 2017

👍2

@pseudobeer yes i am using static port mapping, but i tried the same using cloudformation (static port 80:80 and one instance), it rolled the task successfully, even when i am updating the task manualy using AWS console UI, it is working fine.

My all application's release process is now with static port only, planning to change the same, but for now i have to find a way to streamline the release process. i.e. one click deployment, no manual efforts.
Please let me know if i need to provide more info in this issue.
Thank you :-)

kailashv on 10 Mar 2017

@kailashv dynamic ports are less painful, but if you have to be on static, you could fix the deployment by adding extra EC2 instances to ECS as I mentioned above.

Even with free resources though, if you have 99 tasks and reservable resources to allocate only 1 task, during deploy ECS will create new 1, delete old 1, create new 1, etc. After terraform apply, I am doubling resources before aws ecs update-service. This way ECS could allocate 99 more tasks, so it creates new 99, delete old 99 at once. Here is code to double an Autoscaling Group:

ASG_DESCRIPTION=`aws autoscaling describe-auto-scaling-groups | jq --arg matcher ${ASG_NAME} '.AutoScalingGroups[] | select(.AutoScalingGroupARN | contains($matcher)) '
ASG_ARN=`echo -n ${ASG_DESCRIPTION} | jq -r ' .Tags[].ResourceId'
CURRENT_DESIRED_COUNT=`echo -n ${ASG_DESCRIPTION} | jq -r '.DesiredCapacity'
DESIRED_COUNT=`$((${CURRENT_DESIRED_COUNT}*2))`  
aws autoscaling set-desired-capacity --auto-scaling-group-name ${ASG_ARN} --desired-capacity ${CURRENT_DESIRED_COUNT}

In your case, you could of plus 1 instead of times 2, that is up to you.

To résumé, terraform does not update Service with a new Task Definition and does not provide extra resources for a smooth deploy, which will be pretty cool (and I would not have to mess with AWS API).

prog893 on 10 Mar 2017

👍2

+1 @catsby @pseudobeer @kailashv I'm getting the same issue. While updating the existing ECS service using Terraform v8.0.6. The task definition shows the status as INACTIVE. We have to manually stop the task to complete our deployment. It seems like you're facing the same issue.

tavisca-IdrisW on 10 Mar 2017

Thank you for posting the screenshot, @kailashv . Can you possibly share or post a shot of the Events tab, right next to the Tasks tab, on that same screen?

catsby on 10 Mar 2017

@tavisca-IdrisW – can you share what's happening on the Events tab in the AWS Console UI?

catsby on 10 Mar 2017

Some Ops I have discussed this issue with only provision with terraform, and deploy with a tool called ecs-formation.

prog893 on 10 Mar 2017

@catsby hey, please find the screenshot of my event tab mentioned below :

Here I am getting error port conflicts, but when i give release using cloudformation, it automatically roles the task even if i am having only one container instance and with static port. I need a solution for the condition when i am using static port forwarding.

kailashv on 14 Mar 2017

@kailashv no way without extra instance AFAIK. You can accomplish that with count = ${var.somevar} in your instance definition, and generate terraform.tfvars / use -var.

https://www.terraform.io/docs/configuration/variables.html

prog893 on 14 Mar 2017

This is the kind of issue I was suspecting when I mentioned the events tab a few comments back. It seems this is an issue with available compute power/abilities, as your instances don't have a free port, and you don't have spare instances with that port available. ECS spins up new tasks before powering down old tasks, thus the collision here.

When you mention releasing with CloudFormation, are you using CloudFormation to create the service and cluster? Or just releasing the new Task Definition? If the former, are you creating an AppAutoScalingGroup?

catsby on 14 Mar 2017

Hey all –

I've done some tinkering and reading and found a possible solution, which is the deployment_minimum_healthy_percent attribute of aws_ecs_service. This attribute controls the lower limit (as a percentage of the service's desiredCount) of the number of running tasks that must remain running and healthy in a service during a deployment. The default here is 100, so the service is not allowed to have less than all of the tasks running during a deployment to be considered healthy.

Given the above constraints of 1:1 mapping of port to host with static port mapping, if you don't have extra capacity available, or your ECS Instances are not configured in an AutoScaling group with CloudWatch logs, you won't be able to do a rolling deploy with Terraform without this argument.

To demonstrate, I created a new example setup here:

https://gist.github.com/catsby/5dc25ba9d7b5a02b2fd3ed7cd80b3c74

That will create an example map using a Classic ELB and static port mapping to the host. In resource "aws_ecs_service" "outyet" you see I've added deployment_minimum_healthy_percent = 50. You can then modify the task_definition (switching the memory/cpu from 128 to 256), and with deployment_minimum_healthy_percent, the service will do a rolling deployment. deployment_minimum_healthy_percent is a percentage, so if you have 4 instances you could do 75 and still get a rolling deployment with only 1 service instance out at a time.

Without deployment_minimum_healthy_percent, the above example will just sit in the above described [INACTIVE] state.

Please let me know if deployment_minimum_healthy_percent helps here

catsby on 14 Mar 2017

👍1

@catsby hey.. I am not able to access the example setup "https://gist.github.com/catsby/5dc25ba9d7b5a02b2fd3ed7cd80b3c74a" provided by you, getting page not found error, could you please check that. Will try to implement using " deployment_minimum_healthy_percent" .
Thanks

kailashv on 15 Mar 2017

Sorry about that @kailashv – the correct link is https://gist.github.com/catsby/5dc25ba9d7b5a02b2fd3ed7cd80b3c74 (corrected above too)

catsby on 15 Mar 2017

@catsby hey.. this worked :-) Thank you soo much for this solution.
I updated the terraform scripts with "deployment_minimum_healthy_percent = 50 " under ecs-service and changed the desired containers instances to 2. Now after releasing a new version of task, service is automatically rolling the new task.
closing this issue.
Thank you @catsby @pseudobeer for you great response :-)

kailashv on 16 Mar 2017

@catsby in order to perform live no-downtime deployment we actually have to have deployment_minimum_healthy_percent=100. But because we use dynamic ports that is not an issue...

In our case, service simply ignores task definition changes, nevertheless new task definition is created. We rely on AWS CLI hacks right now -- can I ask for help?

prog893 on 16 Mar 2017

Hey @pseudobeer – from reading the documentation, I would think 50% would still result in no-downtime deployments:

The minimum healthy percent represents a lower limit on the number of your service's tasks that must remain in the RUNNING state during a deployment, as a percentage of the desired number of tasks (rounded up to the nearest integer). This parameter enables you to deploy without using additional cluster capacity. For example, if your service has a desired number of four tasks and a minimum healthy percent of 50%, the scheduler may stop two existing tasks to free up cluster capacity before starting two new tasks. Tasks for services that do not use a load balancer are considered healthy if they are in the RUNNING state; tasks for services that do use a load balancer are considered healthy if they are in the RUNNING state and the container instance it is hosted on is reported as healthy by the load balancer. The default value for minimum healthy percent is 50% in the console and 100% for the AWS CLI, the AWS SDKs, and the APIs.

From http://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html

As you mentioned too, using dynamic ports should also alleviate this. None the less, if you're having issues with task definition changes, can you please open a new issue, and ping me on it? It's likely a separate issue than this one, and worth it's own issue to track. Thanks!

catsby on 16 Mar 2017

@catsby sure, will do.

prog893 on 16 Mar 2017

TESTING ONLY
If your service only has 1 replicate (desired count), and you want to see the update ASAP, set deployment_minimum_healthy_percent to 0.

also, do update the deregistration_delay directive of aws_alb_target_group.test to lower than 300 to have a faster draining state

tuannvm on 29 Jun 2017

👍1

@catsby I'm interested on how you got it to work with autoscaling groups. I have a similar requirement but finding another problem:

Basically I want to do a rolling deployment, and I can actually use dynamic ports. So with extra capacity in my ECS instances I can actually run a new task and remove an old one: all good.
The problem I'm having right now is that I also have an autoscaling group configured and attached to my target group with an "ELB" health check type:

resource "aws_autoscaling_group" "app" {
    name                 = "as_group"
    vpc_zone_identifier  = ["${aws_subnet.main.*.id}"]
    min_size             = "${var.asg_min}"
    max_size             = "${var.asg_max}"
    desired_capacity     = "${var.asg_desired}"
    launch_configuration = "${aws_launch_configuration.app.name}"
    health_check_type    = "ELB"
    target_group_arns    = ["${aws_alb_target_group.lb-target-group.arn}"]
}

This is the same group I use in my ecs_service:

resource "aws_ecs_service" "app" {
  name            = "ecs_service"
  cluster         = "${aws_ecs_cluster.app_cluster.id}"
  task_definition = "${aws_ecs_task_definition.app.arn}"
  desired_count   = 1
  iam_role        = "${aws_iam_role.ecs_service.name}"

  load_balancer {
    target_group_arn = "${aws_alb_target_group.lb-target-group.arn}"
    container_name   = "api"
    container_port   = "3000"
  }
}

and this is my alb target configuration:

resource "aws_alb_target_group" "lb-target-group" {
  name     = "lb-target-group"
  port     = 80
  protocol = "HTTP"
  vpc_id   = "${aws_vpc.default.id}"
  health_check {
    path   = "/health"
  }
}

The tasks register with the target group correctly with dynamic ports, and I can do rolling deployments as well without issues.
The problem I have at the moment is that because the target group definition requires a port, and I have no idea which dynamic port will be assigned, it initially creates the target pointing to port 80, which obviously doesn't pass the health check, and so the ALB marks that target as unhealthy. So even when my tasks registers as another target with the right port and gets marked as healthy, ASG will eventually kill my instance (as it thinks it's unhealthy) and will spin up another one. The new one will have the same behaviour so I have an infinite loop of instance creation and destruction.

Any ideas here? Has anyone been able to solve this health check issue either with dynamic ports, or managed to do rolling deployments otherwise when only running one instance of a task?

Thanks for any help here, much appreciated.

thenano on 31 Jul 2017

👍3

I was able to solve the inactive task definition issue with the example in the ECS task definition data source. You set up the ECS service resource to use the the max revision of either what your Terraform resource has created, or what is in the AWS console which the data source retrieves.

The one downside to this is if someone changes the task definition, Terraform will not realign that to what's defined in code.

dmikalova on 22 Oct 2017

While this may or may not help others, I found this issue when searching for a similar problem with regards to the "inactive" tasks not cycling.

I had incorrectly added this to my aws_ecs_service definition:

  placement_constraints {
    type = "distinctInstance"
  }

If you are only running one instance, then there is no way for multiple tasks for the same service to exist on the same instance, and the new tasks will never rotate in. So if you have the distinctInstance placement constraint set, remove it and try again.

Additionally you'll want to make sure that your parameters for desired_count, deployment_minimum_healthy_percent, and deployment_maximum_percent are sufficient to allow extra containers to rotate in while not incurring any downtime.

For example:

desired_count = 2
deployment_minimum_healthy_percent = 50
deployment_maximum_percent = 200

placement_strategy {
  type  = "spread"
  field = "instanceId"
}

The above will try to keep the number of tasks on an instance for the service to 2, and when you deploy a new revision to your task it will allow the number of tasks on your prevision revision to drop to 1 (or the total number of service tasks to go up to 4) which leaves room for the new task revision to become active. If you're on a t2.micro instance you'll likely run out of memory pretty quick if you go with a higher desired count.

jensbodal on 31 Jan 2018

👍2

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.