When a container unexpectedly exits, it should be more clearly communicated what went wrong.
This issue is to point out specific cases where it could be improved.
1) OOMKilled - If a container is killed due to memory constraints, this should be communicated.
2) "no such image" should instead give the error from pulling (e.g. registry auth).
3) Anything docker 1.5 puts in the .State.Error field (executable file not found in $PATH, etc) should be bubbled up.
Additional suggestions are welcome.
Sometimes when I run RunTask the response will have 'failures' with reason 'AGENT'. What does this reason mean? (I assume this issue is a catch-all reason that will be fixed in this issue?)
That's actually separate from this issue. This refers to the 'reason' field that can be populated when a task moves to stop. This will show up in 'describe-tasks', not run/start task.
The run/start task 'failure' response means that the backend will not pass the request onto an agent for some reason. The reason 'AGENT' means that the container instance that it tried to place on does not have a healthy agent running on it. It also references the container instance arn in the failure output.
You can verify this by checking the 'agentConnected' field in the output of 'describe-container-instances'.
These 'ghost' container instances could exist for a few reasons:
ECS_CHECKPOINT=trueECS_DATADIR between runs of the agent (e.g. stopping the agent on boot and starting it again with different configuration)Looking at the EC2 instance id in the describe output and figuring out why it's either not running or ran twice with two different container instance arns should help.
Good luck,
Euan
Edit:
I agree that output isn't very clear and we should improve it or document it better.
Thanks for the detail! I have some digging to do...
@euank Its not clear to me what could be wrong as the 'reason: AGENT' problem seems sporadic. I have an ECS Cluster with 1 ECS Instance. The ECS instance is running what I believe is the latest AMI (amzn-ami-2015.03.a-amazon-ecs-optimized (ami-ecd5e884)). I dont think this is necessarily a 'ghost' container because if I retry RunTask a couple times it will work. I haven't done anything custom with the agent or the container instance... Just running everything default from the AMI. So there must be something wrong with the agent itself. This problem seems to happen fairly regularly. Let me know if this issue should be moved to a separate thread.
By 'ghosts', I meant container instances which appear in list-container-instances but do not have a corresponding agent. If you only have a single (or small number of) container instances per that API, then my response above was off the mark.
One possibility that occurs to me now is that the agent does a disconnect/reconnect every once in a while and it's being seen as an invalid placement target at that time.
If you want to see if this is what caused it, you can find the times that this disconnect/reconnect occured with the following command: grep "Creating poll dialer" /var/log/ecs/ecs-agent.log | awk '{print $1}'.
You could then try to correlate the times that run-task failed and see if they were immediately before that message was printed each time.
If you want to discuss this issue further, please do create a new one.
Best,
Euan
The errors, as reported in he reason field, are much clearer in the v1.1.0 agent (see https://github.com/aws/amazon-ecs-agent/commit/7c02e04606ad44a951115d757da134d6d94885cf).
There's still room for improvement, but we can track those improvements separately as needed.
What does the failure reason "ATTRIBUTE" mean? That certainly doesn't qualify as a helpful error message! Issue 535 is related to fluentd but there's nothing in this message or that message that clearly indicated it's fluentd or any other thing!
@aboarya a failure reason of Attribute means one or more requiresAttributes of the task definition failed to match the attributes of the container instance
@JonCubed is it for sure that ATTRIBUTE is resolved through requiresAttributes that are not in the task definition?
Most helpful comment
What does the failure reason
"ATTRIBUTE"mean? That certainly doesn't qualify as a helpful error message! Issue 535 is related tofluentdbut there's nothing in this message or that message that clearly indicated it'sfluentdor any other thing!