Containers within the server are unable to access credentials from the ECS Agent
Containers within the server are unable to access credentials from the ECS Agent resulting in inability to access Boto among other things within the container
2019-01-08T12:26:40Z [INFO] CredentialsV2Request: ID not found. Request IP Address: 172.17.0.3:22252
2019-01-08T12:26:40Z [WARN] Unknown eventType: GetCredentialsInvalidRoleType
Amazon ECS Agent - v1.21.0 (3d368554)
Docker Version - 18.06.1-ce
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 104K 3.9G 1% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/xvda1 20G 1.1G 19G 6% /
Please provide a way to provide the supporting log files in a private manner
Please provide a way to provide the supporting log files in a private manner
feel free to send logs to adnkha at amazon dot com with a reference to this issue. to help with debugging - do you have steps for a minimal repro? are you able to repro this in a constrained way?
I don't have exact steps to reproduce this. We faced this issue only one ECS service, it was working for our other ECS Services. This issue was affecting our ability to access S3 from within the container as it could not fetch the IAM Role. The HTTP Request to fetch the IAM role above gave the above error.
Things we tried:
We tried recreating the ECS service with the different images and tasks that did not help
We also tried upgrading the agent version to v1.23 as well, but that did not help.
Eventually the issue resolved itself in a few hours with the original ECS definition and ECR image
I've shared the logs, hope that can provide some insight.
@adnxn What does CredentialsV2Request: ID not found mean?
I'm getting the same error message and this issue is the first result Google gives me.
The underlying issue might be different but I really to know what the error message means so that I can have some clue.
I realized that the error message CredentialsV2Request: ID not found means the ID in AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is problematic. However, I have no idea how to find out why. It also happens in v1.24.
Hi @adnxn
Adding to @superprat 's comments:
Faced the issue again today. Our production systems partially went down since they could not talk to S3 and Kinesis.
ERROR 2019-01-25 13:46:35,298 kinesis put_record_to_stream exception: Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXXX",
CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXXX"
After 10 minutes this exception automatically went away and the systems started working normally.
Please help us here.
@ranvijayj: are you seeing the same errors in the agent logs that @superprat referenced?
specifically, see below:
2019-01-08T12:26:40Z [INFO] CredentialsV2Request: ID not found. Request IP Address: 172.17.0.3:22252
2019-01-08T12:26:40Z [WARN] Unknown eventType: GetCredentialsInvalidRoleType
also, the logs that you referenced - where are these logs originating from? they don't look like agent logs.
Eventually the issue resolved itself in a few hours with the original ECS definition and ECR image
I've shared the logs, hope that can provide some insight.
@superprat: so the set of logs you've sent are not at the debug level so our visibility is limited. i suspect the agent for some reason hadn't been relayed the container credentials from our backend by the time your application went looking for them. i think this is the case since you mention the issue is transient.
the [WARN] Unknown eventType: GetCredentialsInvalidRoleType entry is interesting, though I realised we can't see what role type was actually received. we should add more detailed logging for this failure mode.
are you still running into this regularly? i've tried to reproduce this but haven't had any luck.
@adnxn We are also seeing this issue regularly.
on one of our EC2 instances in the cluster, if we run docker logs ecs-agent we see:
2019-02-05T23:37:54Z [INFO] CredentialsV2Request: ID not found. Request IP Address: 10.0.100.189:33724
2019-02-05T23:37:54Z [WARN] Unknown eventType: GetCredentialsInvalidRoleType
2019-02-05T23:37:54Z 400 10.0.100.189:33724 "/v2/credentials" "aws-sdk-go/1.12.66 (go1.10.3; linux; amd64)" -
docker ps for the ecs agent looks like fwiw:
433399f72a24 amazon/amazon-ecs-agent:latest "/agent" 20 minutes ago Up 20 minutes ecs-agent
it appears the ID not found error happens when the API response on /v2/credentials is not successful which leads to services failing. It's only happening on a particular task for us as well. Other tasks are running just fine for us on the same instance in the same cluster.
Let me know if there is more information I can provide for you.
Did you notice any Docker timeout or other Docker errors in agent logs when this issue happened?
My theory is that Docker operation like inspect would have failed on the task's container, due to which agent would have moved the task to STOPPED. So the task's credentials are cleaned up as well. But the container is actually running and is now requesting for creds, which fails due to ID not being found.
Not really:
2019-02-05T23:35:35Z [INFO] Managed task [arn:aws:ecs:us-west-2:390...:task/45b5e47...]: redundant container state change. style-survey to RUNNING, but already RUNNING
2019-02-05T23:37:54Z [INFO] Handling http requestmethodGETfrom10.0.100.189:33724
2019-02-05T23:37:54Z [INFO] CredentialsV2Request: ID not found. Request IP Address: 10.0.100.189:33724
2019-02-05T23:37:54Z [WARN] Unknown eventType: GetCredentialsInvalidRoleType
2019-02-05T23:37:54Z 400 10.0.100.189:33724 "/v2/credentials" "aws-sdk-go/1.12.66 (go1.10.3; linux; amd64)" -
...
...
2019-02-06T00:07:12Z [INFO] TCS Websocket connection closed for a valid reason
2019-02-06T00:07:12Z [INFO] Establishing a Websocket connection to https://ecs-t-3.us-west-2.amazonaws.com/ws?cluster=production&containerInstance=arn%3Aaws%3Aecs%3Aus-west-2%3A3909...56%3Acontainer-instance%2Fproduction%2Fbefa...2f5
2019-02-06T00:07:12Z [INFO] Connected to TCS endpoint
If I view the /var/log/docker log on the same instance there is
time="2019-02-05T23:36:13.125150996Z" level=error msg="Error setting up exec command in container ecs-laravel-production-12-...01: Container 5c566...4bbcab15 is not running"
time="2019-02-05T23:36:15.106024184Z" level=error msg="Error setting up exec command in container ecs-laravel-production-12...01: Container 5c566...4bbcab15 is not running"
time="2019-02-05T23:37:54.524648464Z" level=error msg="Failed to create log stream" errorCode=CredentialsEndpointError logGroupName=/ecs/laravel-production logStreamName=ecs-laravel-fpm/laravel-fpm/771f8....05 message="failed to load credentials" origError="InvalidIdInRequest: CredentialsV2Request: ID not found"
time="2019-02-05T23:37:54.524758929Z" level=error msg="Handler for GET /v1.38/containers/ecs-......./logs returned error: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials\ncaused by: InvalidIdInRequest: CredentialsV2Request: ID not found"
time="2019-02-05T23:48:27.334867542Z" level=error msg="stream copy error: reading from a closed fifo"
Not sure if that's helpful or not. If there are other log files I can poke at please let me know.
@jrichard0725 please feel free to send the full set of logs to sharanyd at amazon.com
If you can reproduce this with log level as DEBUG and obtain those, it would be really helpful.
The task role
The task execution role
&
the ec2 instance role
all have the policy
Cloudwatchlogsfullaccess
Error response from daemon: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials caused by: InvalidIdInRequest: CredentialsV2Request: ID not found
Anyone have any ideas?
I think I know why this error message is coming up now.
Read "Enabling the awslogs Log Driver for Your Containers":
https://docs.docker.com/config/containers/logging/awslogs/
Read "Credentials":
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
No idea what "AWS_SESSION_TOKEN" is but I went to IAM, added a user with the policy awslogs and got the access and secret keys. I added them as environment variables to container but still got the same error message.
From my point of view, I think this is a BUG. The AWS ECS agent should be providing the instance EC2 role permissions with the docker containers to do logging.
I am going to stop using awslogs - too much hassle.
Tried this, still not working
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html
Hey guys, by chance has someone come up with a solution to this? I've just run into this problem. I'm deploying 2 services with the exact same deployment script but it's only the second one that runs into this problem
I'm seeing this as well, only happening to one out of the 5 services I'm currently running in ECS.
I've just hit this issue on one of my services when I activated "Auto configure Cloudwatch logs" on my container definition. Going back to the container definition, I could see that the options were still there, but the auto-configure checkbox was now unticked. Re-ticking it doesn't fix the issue, but I was at least able to disable logging temporarily and get my service back up. Weirdly the logs were still getting through to CloudWatch =/
I'm unable to reproduce the issue. If anyone still has this issue, please send the following information to ecs-agent-external AT amazon.com:
Thanks.
Hi,
Sorry you鈥檙e facing this issue. Currently, ECS Agent does not persist the credentials information for security reasons.
So, when the agent restarts, the credentials information for the tasks is streamed by the ECS service to the agent. Now, there are possibilities where the message containing the credentials information could get lost in transit. During this time period, if the task鈥檚 container requests for the credentials info, then agent will not hold this information and could return ID not found response. Hence the 400 http error response.
We will make this error message clearer and work on a server side fix to detect this state sooner.
For now as a workaround, if such an error occurs, we suggest you restart agent manually and see if that works.
We also suggest sending the instance debug logs as mentioned in the above comment by @fenxiong.
Thanks,
Sharanya
Closing. Please re-open if you face this issue. As a prerequisite to reopening, please send instance debug logs as mentioned in the comment above by @fenxiong
Hello,
Lately we started facing a similar issue, as per the logs in cloud-watch this is the error we're running into. Any help would be appreciated. Thanks
```
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response (400) from ECS metadata:
{
"code": "InvalidIdInRequest",
"message": "CredentialsV2Request: Credentials not found",
"HTTPErrorCode": 400
}
Hi,
Sorry you鈥檙e facing this issue. For current workaround, please refer to @sharanyad 's comment above.
We also suggest sending the instance debug logs as mentioned in the above comment by @fenxiong.
Thanks,
Meghna
Hi all,
We have deployed a service side change for handling this. This should be fixed now. I'm closing this issue. Please feel free to re-open/send task info and agent level debug logs if you face this issue again to ecs-agent-external AT amazon.com.
Thanks,
Sharanya
I am running into the same issue. I am using AWS SDK inside my container to get some sensitive data from the Secret Manager. And in the Secret Manager I am giving read access only to the IAM role of the container.
With the same error, container cannot connect to the Secret Manager or even simple STS.
aws sts get-caller-identity and get the response back.aws sts get-caller-identity it fails.Error inside the container:
Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response (400) from ECS metadata: {"code":"InvalidIdInRequest","message":"CredentialsV2Request: Credentials not found","HTTPErrorCode":400}
Same log lines from docker logs ecs-agent of the instance:
level=warn time=2021-01-14T23:09:48Z msg="Unknown eventType: GetCredentialsInvalidRoleType" module=entry_types.go
level=error time=2021-01-14T23:09:48Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
2021-01-14T23:09:48Z 400 172.17.0.2:45020 "/v2/credentials" "" -
curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/info and it properly returns AccountId. So does curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance and I see AccessKeyId and SecretAccessKey values for the temporary token that expires in 6 hours. So, there is a valid token actually created.aws secretsmanager get-secret-value --secret-id <my-secret-id-name>, it properly returns AccessDeniedException for the "Instance Role", as expected. I can lift that via giving the read access to the instance role. So there is nothing wrong between the Instance and secretsmanager in terms of connection.ecs-optimized) and tried to run a task on the EC2 Instances it created.ECS_ENABLE_TASK_IAM_ROLE, ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST) and none of the ones I tried help. In fact it resulted in getting timeout connections between the container and the instance, that I dropped the instance and started fresh again -- still the same problem. I assume the default ones should work when I go through the wizard and just pick the instance type to be t3.medium.After a full day of investigation, I found the root cause of my case, and I'm left with a big "how-to" question.
I had 3 containers that I wanted to run together as a single task (3 containers one instance). I had marked all of them as essential. Apparently one of them (container A) was dying on start-up (error on my side), which was causing B and C to also stop. And I was solely looking at C and didn't know why it dies right after starting up with no error from the container or in the UI.
So, what I was doing to further debug was to manually, SSH into the EC2 Instance, and docker start <container-id> of container C to see what's going on. That's why I was getting this error, after the manual start!
My guess is, the IAM roles and some configs/credentials are passed to the container only when the container is started (not just created, also started) by ECS Agent.
That should be why the manual docker start <container-id> of the dead container, would cause the container to be in this weird start -- after going into the container via docker -u 0 -exec it <container-id> bash, I would see the above errors trying to make any call to AWS services (e.g. aws sts get-caller-identity).
I understand that starting/restarting docker containers manually is not how things should be handled via ECS. But in case someone needs that (e.g. you are time-crunched to make a minor change to the container and restart it), how would they be able to do this? What's the equivalent of docker start/restart <container-id> in an ecs-agent-controlled instance?
PS. Cross-reference to StackOverFlow: https://stackoverflow.com/a/65743014
Most helpful comment
Hey guys, by chance has someone come up with a solution to this? I've just run into this problem. I'm deploying 2 services with the exact same deployment script but it's only the second one that runs into this problem