After we upgraded to 1.17, we observe websocket errors constantly in the logs.
Docker: 17.09.0-ce
Operating System: Ubuntu 16.04.3 LTS
Amazon ECS Agent: v1.17.0 (761937f7)
2018-02-15T11:01:25Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-02-15T11:01:25Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:03:28Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-02-15T11:03:28Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:05:24Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-02-15T11:05:24Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
Every 2 minutes.
@roman-vynar, the websocket 1002 error code indicates that an endpoint is terminating the connection due to a protocol error. I'm not able to reproduce this error on my end and some more logging data would be very helpful for debugging this behavior. Would you be able to send me logs from the problematic instance? If this is easily reproducible for you, I'd also suggest running the agent with ECS_LOGLEVEL=debug.
You can capture logs using our log collection tool and email them to me directly at adnkha at amazon dot com. If you end up sending the logs to my email, please update this issue so I don't miss them. Thanks!
Also fwiw, since this appears to be the tcs connection it shouldn't be affecting the task life cycles and scheduling. The only expected side effect should be erroneous metrics.
Also - Are you using a proxy with this instance?
Hi @adnxn Thanks for looking into this! I have emailed you more details.
We don't have a proxy in front of ecs-agent.
The reason was ECS_DISABLE_METRICS=true.
I have removed this option to make those errors disappear.
Sorry, I decided to follow up on this.
As long as we don't use CloudWatch metrics we have ECS_DISABLE_METRICS=true.
Another reason for ECS_DISABLE_METRICS=true is a potential load spike per https://github.com/aws/amazon-ecs-agent/issues/588
We are still getting those errors on v1.17.3:
May 8 12:48:50 ecs-10-0-7-184.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:06 ecs-10-0-7-69.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:17 ecs-10-0-7-140.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:28 ecs-10-0-7-220.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:51 ecs-10-0-7-230.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
Is there any relation between those errors and the fact that CloudWatch metrics are disabled?
Thanks.
@roman-vynar
Is there any relation between those errors and the fact that CloudWatch metrics are disabled?
Yes, those errors are caused by the fact that there is no activity in the connection. And this connection is used to publish resource usage metrics and container health metrics. So, if the ECS_DISABLE_METRICS=true is set and no containers are using the container health check feature, the connection will be closed periodically.
Thanks @richardpen !
Is it possible to suppress such errors since it is obvious there is no activity when metrics are disabled?
Unfortunately, there is no way to suppress this error now. I'll mark this as a bug, and we will make the change to be able to disable both the resource usage metrics and container health metrics. We will keep this updated when we have any progress.
thanks,
Peng
Any update on this? I see it's marked "more info needed", do you need more information?
Can we remove "more info needed" from this issue?
What's more info needed? :)
We are looking into this issue.
We have introduced the flag ECS_DISABLE_DOCKER_HEALTH_CHECK in Agent version 1.25.0. ECS_DISABLE_DOCKER_HEALTH_CHECK disables docker health container check.
After updating to 1.25.0, setting that flag to true along with setting ECS_DISABLE_METRICS to true will suppress the websocket errors in the logs.
Closing the issue, please re-open if you see the issue with the flag set or have any more questions.