Amazon-ecs-agent: Websocket errors starting 1.17

Created on 15 Feb 2018 · 12Comments · Source: aws/amazon-ecs-agent

Summary

After we upgraded to 1.17, we observe websocket errors constantly in the logs.

Environment Details

Docker: 17.09.0-ce
Operating System: Ubuntu 16.04.3 LTS
Amazon ECS Agent: v1.17.0 (761937f7)

Supporting Log Snippets

2018-02-15T11:01:25Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:01:25Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:03:28Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:03:28Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:05:24Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:05:24Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel

Every 2 minutes.

kinbug

Source

roman-vynar

All 12 comments

@roman-vynar, the websocket 1002 error code indicates that an endpoint is terminating the connection due to a protocol error. I'm not able to reproduce this error on my end and some more logging data would be very helpful for debugging this behavior. Would you be able to send me logs from the problematic instance? If this is easily reproducible for you, I'd also suggest running the agent with ECS_LOGLEVEL=debug.

You can capture logs using our log collection tool and email them to me directly at adnkha at amazon dot com. If you end up sending the logs to my email, please update this issue so I don't miss them. Thanks!

Also fwiw, since this appears to be the tcs connection it shouldn't be affecting the task life cycles and scheduling. The only expected side effect should be erroneous metrics.

Also - Are you using a proxy with this instance?

adnxn on 16 Feb 2018

Hi @adnxn Thanks for looking into this! I have emailed you more details.

We don't have a proxy in front of ecs-agent.

roman-vynar on 19 Feb 2018

The reason was ECS_DISABLE_METRICS=true.
I have removed this option to make those errors disappear.

roman-vynar on 19 Feb 2018

Sorry, I decided to follow up on this.

As long as we don't use CloudWatch metrics we have ECS_DISABLE_METRICS=true.
Another reason for ECS_DISABLE_METRICS=true is a potential load spike per https://github.com/aws/amazon-ecs-agent/issues/588

We are still getting those errors on v1.17.3:

May 8 12:48:50 ecs-10-0-7-184.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:06 ecs-10-0-7-69.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:17 ecs-10-0-7-140.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:28 ecs-10-0-7-220.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:51 ecs-10-0-7-230.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]

Is there any relation between those errors and the fact that CloudWatch metrics are disabled?

Thanks.

roman-vynar on 8 May 2018

@roman-vynar

Is there any relation between those errors and the fact that CloudWatch metrics are disabled?

Yes, those errors are caused by the fact that there is no activity in the connection. And this connection is used to publish resource usage metrics and container health metrics. So, if the ECS_DISABLE_METRICS=true is set and no containers are using the container health check feature, the connection will be closed periodically.

richardpen on 9 May 2018

Thanks @richardpen !
Is it possible to suppress such errors since it is obvious there is no activity when metrics are disabled?

roman-vynar on 9 May 2018

Unfortunately, there is no way to suppress this error now. I'll mark this as a bug, and we will make the change to be able to disable both the resource usage metrics and container health metrics. We will keep this updated when we have any progress.

thanks,
Peng

richardpen on 9 May 2018

👍1

Any update on this? I see it's marked "more info needed", do you need more information?

rhuddleston on 1 Nov 2018

Can we remove "more info needed" from this issue?

rhuddleston on 12 Jan 2019

What's more info needed? :)

roman-vynar on 14 Jan 2019

We are looking into this issue.

shubham2892 on 15 Jan 2019

We have introduced the flag ECS_DISABLE_DOCKER_HEALTH_CHECK in Agent version 1.25.0. ECS_DISABLE_DOCKER_HEALTH_CHECK disables docker health container check.
After updating to 1.25.0, setting that flag to true along with setting ECS_DISABLE_METRICS to true will suppress the websocket errors in the logs.

Closing the issue, please re-open if you see the issue with the flag set or have any more questions.

shubham2892 on 24 Jan 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Option to select multiple log-configuration

soumyasmruti · 5Comments

Can not acquire network metric in EC 2/Bridge mode

hayajo · 3Comments

HostPort not present in ECS Task Metadata Endpoint response with bridge network type

MartinMitro · 3Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

Logentries docker driver

AbelGuti · 5Comments