Amazon-ecs-agent: Windows Credential Proxy Unavailable after EC2 Stop/Start

Created on 18 Jul 2019  路  12Comments  路  Source: aws/amazon-ecs-agent

Summary

I recently tried upgrading from the Windows Server 2016 ECS Optimized AMI to the Server 2019 AMI: Windows_Server-2019-English-Full-ECS_Optimized-2019.05.10

One issue I'm encountering is that the credential proxy on 169.254.170.2:80 is available on first launch of the EC2, but if I stop and start the EC2, the credential proxy does not start back up.

I've tried upgrading to agent version 1.29.0 and the problem persists.

Description

I noticed that the ECS Agent itself is still listening on 127.0.0.1:51679. I also verified that the portproxy config is still intact after the stop/start operation and that the IPHelper service is running.

I'm not able to find logs or anymore details about why the proxy is not coming up. I do know that this works fine using the Server 2016 AMI.

Expected Behavior

The credential proxy works after stop/start of the EC2

Observed Behavior

The credential proxy does not work after stop/start of the EC2

kinbug owindows scopECS AMI workaround available

Most helpful comment

Hi @somujay,

I got this same issue running on Windows_Server-2019-English-Full-ECS_Optimized (ami-0941852617e614977) in EU-WEST-1 region (Ireland). The full error is this below:

[INFO] TaskHandler: batching container event: arn:aws:ecs:eu-west-1:ACCOUNT_ID:task/8a9709a3-98da-479b-a426-69959b10b54d windows -> STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials
caused by: RequestError: send request failed
caused by: Get http://169.254.170.2/v2/credentials/12f7f583-f11f-45f9-9a07-e7b237b72a72: dial tcp 169.254.170.2:80: connectex: A socket operation was attempted to an unreachable network., Known Sent: NONE

Please, find below the output of some commands after the instance restart:

> netsh interface portproxy show all

Listen on ipv4:             Connect to ipv4:

Address         Port        Address         Port
--------------- ----------  --------------- ----------
169.254.170.2   80          127.0.0.1       51679
> netstat -an | select-string 169.254.170.2

  TCP    169.254.170.2:80       0.0.0.0:0              LISTENING
> ping 169.254.170.2

Pinging 169.254.170.2 with 32 bytes of data:
PING: transmit failed. General failure.

As workaround, I just ran this two commands and the instance start to work again:

netsh interface portproxy delete v4tov4 80 169.254.170.2
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'

For a persistent workaround, please find below the userdata that must have been inserted on the instance:

<powershell>
C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 -Schedule
netsh interface portproxy delete v4tov4 80 169.254.170.2 | out-null
[Environment]::SetEnvironmentVariable('ECS_DISABLE_METRICS', 'false', 'Machine')
[Environment]::SetEnvironmentVariable('ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE', $TRUE, 'Machine')
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'
</powershell>

Using this userdata I was able to restart the instance and continue to use the ECS and start my tasks.

I think that should be good if this solution were included on "Initialize-ECSAgent" command.

Please, let me know your thoughts.

All 12 comments

We also have the same problem. It is also referenced here https://github.com/aws/amazon-ecs-agent/issues/2105
We tried the adding the routes for windows tasks, updating the agent to 1.29.1. whatever we could think of but weren't to able to get around this.
Seems like removing the task role fixes this, but then we cannot retrieve secret values from Parameter store.

We'll look into this.

@Crusad For your issue -- does this only happen after you restart EC2 instances?

The first time manual agent startup (Initialize-ECSAgent) looks ok, but the task never starts anyway. After that (if I run the command again) there is an error about 169.254.170.2:80 not listening, but this might be ok, since it's the second time to run the command?
Nevertheless the task never starts and fails with:
Status reason | CannotStartContainerError: Error response from daemon: failed to initialize logging driver: CredentialsEndpointError: failed to load credentials caused by: RequestError: send request failed caused by: Get http://169.254.170.2/v2/credentials/3f4f1371-a42f- -- | --

I suspect the logging driver error is just a consequence and the real problem is in credentials endpoint.
We are also setting these 2 env varaibles. The execution role override is necessary otherwise the task would fail with an error that the instance is missing required attribute.
[Environment]::SetEnvironmentVariable("ECS_TASK_METADATA_RPS_LIMIT", "100,150", "Machine")
[Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE", "true", "Machine")

Hi @petderek, any updates on this issue?

I'm happy to provide additional info or help in any way I can.

Hi @petderek, any updates on this issue?

I'm happy to provide additional info or help in any way I can.

Sorry for the delayed response here. I did investigate this issue, looks like a Windows bug (yet to confirm with Microsoft). The issue is that port forwarding is not working. To workaround the issue, please restart the Windows Service "IP Helper".

Hi @somujay, thanks for the response.

I've tried restarting the "IP Helper" service, but it did not fix this issue for me on the Server 2019 AMI.

That workaround used to be adequate back on the Server 2016 AMI where I actually avoided this problem by making the IP Helper start after the ECS Agent service. As long as the ECS Agent started first, or the IP Helper was restarted following the ECS Agent starting, the proxy worked fine. But that workaround doesn't work anymore on the Server 2019 AMI.

Hi @somujay, thanks for the response.

I've tried restarting the "IP Helper" service, but it did not fix this issue for me on the Server 2019 AMI.

That workaround used to be adequate back on the Server 2016 AMI where I actually avoided this problem by making the IP Helper start after the ECS Agent service. As long as the ECS Agent started first, or the IP Helper was restarted following the ECS Agent starting, the proxy worked fine. But that workaround doesn't work anymore on the Server 2019 AMI.

I've opened a ticket with Microsoft for this issue. They acknowledged and working on fixing it. However it's solved by restarting iphelper service (iphlpsvc). Surprised to hear that it's not working for you. What's the best time, i can reach you?

Hi @somujay,

I got this same issue running on Windows_Server-2019-English-Full-ECS_Optimized (ami-0941852617e614977) in EU-WEST-1 region (Ireland). The full error is this below:

[INFO] TaskHandler: batching container event: arn:aws:ecs:eu-west-1:ACCOUNT_ID:task/8a9709a3-98da-479b-a426-69959b10b54d windows -> STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials
caused by: RequestError: send request failed
caused by: Get http://169.254.170.2/v2/credentials/12f7f583-f11f-45f9-9a07-e7b237b72a72: dial tcp 169.254.170.2:80: connectex: A socket operation was attempted to an unreachable network., Known Sent: NONE

Please, find below the output of some commands after the instance restart:

> netsh interface portproxy show all

Listen on ipv4:             Connect to ipv4:

Address         Port        Address         Port
--------------- ----------  --------------- ----------
169.254.170.2   80          127.0.0.1       51679
> netstat -an | select-string 169.254.170.2

  TCP    169.254.170.2:80       0.0.0.0:0              LISTENING
> ping 169.254.170.2

Pinging 169.254.170.2 with 32 bytes of data:
PING: transmit failed. General failure.

As workaround, I just ran this two commands and the instance start to work again:

netsh interface portproxy delete v4tov4 80 169.254.170.2
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'

For a persistent workaround, please find below the userdata that must have been inserted on the instance:

<powershell>
C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 -Schedule
netsh interface portproxy delete v4tov4 80 169.254.170.2 | out-null
[Environment]::SetEnvironmentVariable('ECS_DISABLE_METRICS', 'false', 'Machine')
[Environment]::SetEnvironmentVariable('ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE', $TRUE, 'Machine')
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'
</powershell>

Using this userdata I was able to restart the instance and continue to use the ECS and start my tasks.

I think that should be good if this solution were included on "Initialize-ECSAgent" command.

Please, let me know your thoughts.

Hi @somujay,

I got this same issue running on Windows_Server-2019-English-Full-ECS_Optimized (ami-0941852617e614977) in EU-WEST-1 region (Ireland). The full error is this below:

[INFO] TaskHandler: batching container event: arn:aws:ecs:eu-west-1:ACCOUNT_ID:task/8a9709a3-98da-479b-a426-69959b10b54d windows -> STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials
caused by: RequestError: send request failed
caused by: Get http://169.254.170.2/v2/credentials/12f7f583-f11f-45f9-9a07-e7b237b72a72: dial tcp 169.254.170.2:80: connectex: A socket operation was attempted to an unreachable network., Known Sent: NONE

Please, find below the output of some commands after the instance restart:

> netsh interface portproxy show all

Listen on ipv4:             Connect to ipv4:

Address         Port        Address         Port
--------------- ----------  --------------- ----------
169.254.170.2   80          127.0.0.1       51679
> netstat -an | select-string 169.254.170.2

  TCP    169.254.170.2:80       0.0.0.0:0              LISTENING
> ping 169.254.170.2

Pinging 169.254.170.2 with 32 bytes of data:
PING: transmit failed. General failure.

As workaround, I just ran this two commands and the instance start to work again:

netsh interface portproxy delete v4tov4 80 169.254.170.2
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'

For a persistent workaround, please find below the userdata that must have been inserted on the instance:

<powershell>
C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 -Schedule
netsh interface portproxy delete v4tov4 80 169.254.170.2 | out-null
[Environment]::SetEnvironmentVariable('ECS_DISABLE_METRICS', 'false', 'Machine')
[Environment]::SetEnvironmentVariable('ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE', $TRUE, 'Machine')
Initialize-ECSAgent -Cluster Windows -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'
</powershell>

Using this userdata I was able to restart the instance and continue to use the ECS and start my tasks.

I think that should be good if this solution were included on "Initialize-ECSAgent" command.

Please, let me know your thoughts.

This workaround only works when instance start with a clean vswitch interface.
In some situations, like you stop your instance for a while and start, the old (vswitch) interface might not get destroyed and your new interface ends up with a different name.
Hope AWS can find a real solution soon. I might will try create the tunnel on my real network interface and use firewall block it from other services. I might also try use static credentials instead of this.

We've fixed the issue. March month Windows AMI will be including the fixes related to the credential issue.

Hello,

I had this problem with Windows_Server-2019-English-Full-ECS_Optimized-2020.01.15, i can confirm it is working with the latest Windows_Server-2019-English-Full-ECS_Optimized-2020.03.18 (ami-01df996ea00078d0c).

Thanks !

Thank you for the update, resolving as the issue is fixed with latest windows AMI.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

acmcelwee picture acmcelwee  路  4Comments

sparrc picture sparrc  路  4Comments

aaithal picture aaithal  路  3Comments

pspanchal picture pspanchal  路  3Comments

flowirtz picture flowirtz  路  5Comments