Amazon-ecs-agent: ECS failing to execute upstart-job from user data

Created on 13 Dec 2018  路  17Comments  路  Source: aws/amazon-ecs-agent

Summary

EC2 User Data not being properly parsed, resulting in ECS Task not starting

Description

We are attempting to upgrade to the ECS Optimized Amazon Linux 2 AIM (from version 1) and our ECS task is not getting started. We have our EC2 user data as multi-part MIME encoded text as outlined in the AWS documentation. The upstart-job part does not appear to be executed as this is what starts our task (which is not starting), and there is no output from the script contained in this upstart-job. Looking in the logs at /var/log/ecs/ we see this output:

2018-12-12T22:23:59Z [INFO] Loading configuration
2018-12-12T22:23:59Z [INFO] Unable to parse user data: invalid character 'C' looking for beginning of value
2018-12-12T22:23:59Z [INFO] Amazon ECS agent Version: 1.22.0, Commit: 26518174

which, looking at the source, would indicate this is trying to interpret the user data as JSON, which it is not. I've been unable to find documentation describing a differing format for the user data than that of the multipart MIME format.

Here's our user data (truncated):

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/upstart-job; charset="us-ascii"

#upstart-job
description "Amazon EC2 Container Service (start task on instance boot)"
author "Us"
start on started ecs

script
    mkdir -p /var/log/startup
    exec 1>>/var/log/startup/ecs-start-task.log 2>&1
    set -x
    until curl -s http://localhost:51678/v1/metadata
    do
        sleep 1
    done

        [snip]

    aws ecs start-task --cluster "${ECS_CLUSTER}" --task-definition "${ECS_TASK_DEFINITION}" --container-instances "${AWS_INSTANCE_ARN}" --started-by "${AWS_INSTANCE_ARN}" --region "${AWS_REGION}" --overrides "${CONTAINER_OVERRIDES}"

    echo "End EC2 task startup script: $(date)"
end script

Expected Behavior

We expect the user data to be properly parsed and executed.

Observed Behavior

The ECS Task fails to start since the upstart-job fails to execute.

Environment Details

[ec2-user@ip-172-31-0-92 ecs]$ curl http://localhost:51678/v1/metadata
{"Cluster":"default","ContainerInstanceArn":"arn:aws:ecs:us-east-1:760528713078:container-instance/46d67ed6-44f9-42be-8998-87ab9baf4d2c","Version":"Amazon ECS Agent - v1.22.0 (26518174)"}

Supporting Log Snippets

[ec2-user@ip-172-31-0-92 ecs]$ cat ecs-agent.log.2018-12-12-22 
2018-12-12T22:23:59Z [INFO] Loading configuration
2018-12-12T22:23:59Z [INFO] Unable to parse user data: invalid character 'C' looking for beginning of value
2018-12-12T22:23:59Z [INFO] Amazon ECS agent Version: 1.22.0, Commit: 26518174
2018-12-12T22:23:59Z [INFO] Creating root ecs cgroup: /ecs
2018-12-12T22:23:59Z [INFO] Creating cgroup /ecs
2018-12-12T22:23:59Z [INFO] Loading state! module="statemanager"
2018-12-12T22:23:59Z [INFO] Event stream ContainerChange start listening...
2018-12-12T22:23:59Z [INFO] Registering Instance with ECS
2018-12-12T22:23:59Z [INFO] Remaining mem: 985
2018-12-12T22:23:59Z [INFO] Registered container instance with cluster!
2018-12-12T22:23:59Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:us-east-1:760528713078:container-instance/46d67ed6-44f9-42be-8998-87ab9baf4d2c' in cluster 'default'
2018-12-12T22:23:59Z [INFO] Saving state! module="statemanager"
2018-12-12T22:23:59Z [INFO] Beginning Polling for updates
2018-12-12T22:23:59Z [INFO] Event stream DeregisterContainerInstance start listening...
2018-12-12T22:23:59Z [INFO] Initializing stats engine
2018-12-12T22:23:59Z [INFO] Establishing a Websocket connection to https://ecs-a-2.us-east-1.amazonaws.com/ws?agentHash=26518174&agentVersion=1.22.0&clusterArn=default&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A760528713078%3Acontainer-instance%2F46d67ed6-44f9-42be-8998-87ab9baf4d2c&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=true&seqNum=1
2018-12-12T22:23:59Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2018-12-12T22:23:59Z [INFO] Establishing a Websocket connection to https://ecs-t-2.us-east-1.amazonaws.com/ws?cluster=default&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A760528713078%3Acontainer-instance%2F46d67ed6-44f9-42be-8998-87ab9baf4d2c
2018-12-12T22:23:59Z [INFO] Connected to ACS endpoint
2018-12-12T22:23:59Z [INFO] Connected to TCS endpoint
2018-12-12T22:24:09Z [INFO] Saving state! module="statemanager"
more info needed

Most helpful comment

For anyone suffering from a lack of documentation, I've managed to get a working solution. Without updates to the official docs it's unclear if this is the "correct" way to launch ECS tasks from EC2 user data, but I hope this helps illustrate what I am trying to do, and perhaps helps someone in a similar situation.

This is my rewrite of the user data samples given at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html and https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html such that instead of a MIME part for upstart we write out a systemd service unit which executes a bash script, similar to what the examples in the aforementioned documentation do, to start the ECS task.

NOTE: Edited March 12, 2019 to include workarounds for issues which prevented the originally posted user data from operating as intended. Specifically, a workaround for #1707 and a wait loop to wait for the ecs service to be responsive (even though the systemd ecs.service has started).

NOTE: Edited March 13, 2019 to further refine workaround for #1707. Specifically we can avoid the need to edit the ecs.service and simply use --no-block when starting our service so the user data script exits and allows the system to boot.

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Set Docker daemon options
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --foo bar"' >> /etc/sysconfig/docker

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
# Specify the cluster that the container instance should register into
cluster=your_cluster_name

# Write the cluster configuration variable to the ecs.config file
# (add any other configuration variables here also)
echo ECS_CLUSTER=$cluster >> /etc/ecs/ecs.config

# Install the AWS CLI and the jq JSON parser
yum install -y aws-cli jq

START_TASK_SCRIPT_FILE="/etc/ecs/ecs-start-task.sh"
cat <<- 'EOF' > ${START_TASK_SCRIPT_FILE}
    exec 2>>/var/log/ecs/ecs-start-task.log
    set -x
    # Wait for the ecs service to be responsive
    until curl -s http://localhost:51678/v1/metadata
    do
        sleep 1
    done

    # Grab the container instance ARN and AWS region from instance metadata
    instance_arn=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F/ '{print $NF}' )
    cluster=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .Cluster' | awk -F/ '{print $NF}' )
    region=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F: '{print $4}')

    # Specify the task definition to run at launch
    task_definition=my_task_def

    # Run the AWS CLI start-task command to start your task on this container instance
    aws ecs start-task --cluster $cluster --task-definition $task_definition --container-instances $instance_arn --started-by $instance_arn --region $region
EOF

# Write systemd unit file
UNIT="ecs-start-task.service"
cat <<- EOF > /etc/systemd/system/${UNIT}
    [Unit]
    Description=ECS Start Task
    Requires=ecs.service
    After=ecs.service

    [Service]
    Restart=always
    ExecStart=/usr/bin/bash ${START_TASK_SCRIPT_FILE}

    [Install]
    WantedBy=default.target
EOF

# Enable our ecs.service dependent service with `--no-block` to prevent systemd deadlock
# See https://github.com/aws/amazon-ecs-agent/issues/1707
systemctl enable --now --no-block "${UNIT}"
--==BOUNDARY==--

All 17 comments

Thank you for reporting this issue. The user data parse error message ("unable to parse user data") isn't actually an error message, and we have cleaned up this log entry.
This may be related to the fact that AL2 doesn't use upstart but systemd. I will research more and update.

https://aws.amazon.com/amazon-linux-2/faqs/ in AL2 systemd replaced upstart. Could you please try again after migrating your upstart script to systemd?

closing this since we haven't heard back, feel free to reopen or update issue if needed.

Hi, Sorry for the delay but I've been trying to follow your advice without success. I do not see any updated documentation on how to start the ECS task using an EC2 Launch type. I've been using the user data approach documented here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html to configure and start the ECS task from the EC2 user data.

Can you please point me at documentation which replaces this approach?

Thank you.

@yunhee-l I see you mention similar comments about user data not parsing (see https://github.com/aws/amazon-ecs-agent/pull/1758 and related issue https://github.com/aws/amazon-ecs-agent/issues/1753 ). Your comments imply the user data is now in some JSON format specific to the agent config, but all the documentation I have seen to date references the ability to have various forms of user data, such as multipart MIME or shell script, etc. ( see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html and https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#amazon-linux-cloud-init and https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html and https://aws.amazon.com/premiumsupport/knowledge-center/execute-user-data-ec2/ and etc.). I would love to migrate to the new AMI but without understanding how to properly configure and start my tasks from EC2 is blocking my forward progress.
Thanks for your insight and assistance.

Passing agent config (in json format) through user data isn't something that's required, it's a convenience feature we had enabled and you do not have to configure ECS Agent through EC2 user data of the ECS instance. PR #1758 was fixing superfluous log message that was cluttering the log, because we enabled this.

Are you using ECS Optimized AMI or your own custom AMI?

@yunhee-l Thanks for your reply. I'm using Amazon ECS-Optimized Amazon Linux 2 AMI (ami-0a6b7e0cc0b1f464f).

The error message, as with issue #1753, is the error reported by the JSON parser trying to interpret the content of the user data as JSON. I'm sure you're familiar with the code more than I, as I just read through it trying to understand what was going wrong. If you look back at my original post here you'll see the error refers to an "invalid character 'C'" which is the first character of the user data we have configured.

This started as an attempt to migrate from Amazon ECS-Optimized Amazon Linux AMI to Amazon ECS-Optimized Amazon Linux 2 AMI, which we had configured based off of the document https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html which indicates setting up a multi-part user data with an upstart job (and other unix scripts). If this has been replaced by something better, I'm happy to use it, but I would love your help locating documentation which allows us to properly configure and start my tasks.

@levigroker as I have said above, the error message regarding JSON parsing is not relevant for AL to AL2 migration.
Have you tried migrating your upstart script to systemd?

@yunhee-l I would very much appreciate some indication of how I might migrate from upstart to systemd. I'm not sure how to proceed because the documentation I've been able to locate does not have any indication on how I should be starting tasks other than the approach I've been taking with the user data scripts.

migrating upstart to systemd isn't something that's particular to ECS, and I am not sure if there's AWS ECS public doc on this. I did quick search and found some pages that might be a good place to start:
https://wiki.ubuntu.com/SystemdForUpstartUsers
https://cloudonaut.io/migrating-to-amazon-linux-2/

@yunhee-l Many thanks... I feel like I must not be communicating clearly.

How would you suggest I use the AL2 ECS image to start my ECS task?

For anyone suffering from a lack of documentation, I've managed to get a working solution. Without updates to the official docs it's unclear if this is the "correct" way to launch ECS tasks from EC2 user data, but I hope this helps illustrate what I am trying to do, and perhaps helps someone in a similar situation.

This is my rewrite of the user data samples given at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html and https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html such that instead of a MIME part for upstart we write out a systemd service unit which executes a bash script, similar to what the examples in the aforementioned documentation do, to start the ECS task.

NOTE: Edited March 12, 2019 to include workarounds for issues which prevented the originally posted user data from operating as intended. Specifically, a workaround for #1707 and a wait loop to wait for the ecs service to be responsive (even though the systemd ecs.service has started).

NOTE: Edited March 13, 2019 to further refine workaround for #1707. Specifically we can avoid the need to edit the ecs.service and simply use --no-block when starting our service so the user data script exits and allows the system to boot.

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Set Docker daemon options
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --foo bar"' >> /etc/sysconfig/docker

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
# Specify the cluster that the container instance should register into
cluster=your_cluster_name

# Write the cluster configuration variable to the ecs.config file
# (add any other configuration variables here also)
echo ECS_CLUSTER=$cluster >> /etc/ecs/ecs.config

# Install the AWS CLI and the jq JSON parser
yum install -y aws-cli jq

START_TASK_SCRIPT_FILE="/etc/ecs/ecs-start-task.sh"
cat <<- 'EOF' > ${START_TASK_SCRIPT_FILE}
    exec 2>>/var/log/ecs/ecs-start-task.log
    set -x
    # Wait for the ecs service to be responsive
    until curl -s http://localhost:51678/v1/metadata
    do
        sleep 1
    done

    # Grab the container instance ARN and AWS region from instance metadata
    instance_arn=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F/ '{print $NF}' )
    cluster=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .Cluster' | awk -F/ '{print $NF}' )
    region=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F: '{print $4}')

    # Specify the task definition to run at launch
    task_definition=my_task_def

    # Run the AWS CLI start-task command to start your task on this container instance
    aws ecs start-task --cluster $cluster --task-definition $task_definition --container-instances $instance_arn --started-by $instance_arn --region $region
EOF

# Write systemd unit file
UNIT="ecs-start-task.service"
cat <<- EOF > /etc/systemd/system/${UNIT}
    [Unit]
    Description=ECS Start Task
    Requires=ecs.service
    After=ecs.service

    [Service]
    Restart=always
    ExecStart=/usr/bin/bash ${START_TASK_SCRIPT_FILE}

    [Install]
    WantedBy=default.target
EOF

# Enable our ecs.service dependent service with `--no-block` to prevent systemd deadlock
# See https://github.com/aws/amazon-ecs-agent/issues/1707
systemctl enable --now --no-block "${UNIT}"
--==BOUNDARY==--

@levigroker

Glad that you got it working and Thanks for sharing the info with everyone.

Thanks for posting your solution @levigroker!

AWS Team: Can you please update this official documentation about starting a task at instance launch?

It's currently misleading as the method specified there does not work for Amazon Linux 2, while customers are actively being encouraged to switch to the new OS version. We learned about this the hard way when things started breaking after upgrading the AMI.

Thanks for the feedback, I am working with the internal team on this.

@ellenthsu I see that the documentation has been updated. Thanks for the quick action. :)

Was this page helpful?
0 / 5 - 0 ratings