EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.
What do you want us to build?
I've deployed scheduled Fargate tasks and been clobbered with a high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is horrific.
It would be much more pleasant if there were a (hidden is fine) resource local to the instance where my containers run which could be used to load my docker images.
Which service(s) is this request for?
Fargate and ECR.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I shouldn't be charged for pulling a Docker image every time my scheduled Fargate task runs. It leaves a bad taste in my mouth. :)
In all honesty, I feel like I'm being ripped off. I love Fargate but this is unpleasant. On that note the VPC endpoint doc should be better too. These are the kinds of usability issues that destroy the whole notion of "serverless" (Fargate being a serverless container orchestrator). I really don't want to have to deal with these kinds of details.
Are you currently working around this issue?
This was for a personal project, I instead just deployed an EC2 instance running a cron job, which is not my preference. I would prefer being able to use Docker and the ECS/Fargate ecosystem.
@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?
@jtoberon The second question is not a question I want to be asked! :)
Can't we have these kinds of things in _every_ region? I generally use us-east-1 and us-west-2 these days.
OK. . . not going to check the doc all the way back in time, it seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html
However. . . this is a good example of a leaky abstraction. . . should I really need to know and think about S3 in this case? I'd argue that I shouldn't. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.
tl;dr when I last tried to do this, the doc was bad, I distinctly remember it not being clear that S3 configuration was needed. I haven't tried this again, I'd be happy to give it a shot sometime soon. But. . . it would be great if the S3 part could be "abstracted away".
Regarding regions, I'm really asking whether you're doing cross-region pulls.
You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.
Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. https://github.com/aws/containers-roadmap/issues/140 has more details and some discussion.
Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do.
Thank you!
Look at that, I know the guy who posted #140. Small world.
@jtoberon your call on whether this should be a separate request or folded into the other one.
Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?
This was added (it seems) to ECS EC2 in 2018:
https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/
Agent config docs.
I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?
@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.
much needed indeed this caching option for fargate
I would like to upvote this feature too.
I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.
How's this evolving?
There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.
@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.
Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.
We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.
We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.
I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC.
Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.
@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.
Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.
+1
+1
Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.
we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(
Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?
Azure is solving this problem in their plataform
https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/
+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.
Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.
Would really like to see some sort of image caching.
I have 1GB Containers with no way of reducing the size of it.
It takes very long time to start up on fargate.
We really need caching features
+1 on this we really need this feature.
The amount of time wasted by this not being a thing is no doubt staggering, and continuous to grow as AWS does not address this.
AWS, we could really do with some communication here. I thought that was the point of this repo.
This is one of those weird cases where we are paying for poor performance, bandwidth usage + a 3 min image pull on restart/deploy.
We have work in progress on image pull performance, in particular for images stored in ECR. In the meantime, our metrics and performance testing is showing more consistent image pull performance with platform version 1.4 compared to platform version 1.3, especially looking at p90 and above.
When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?
@mlanner-aws
Personally, I just want to see quick bootup times in Fargate (which are currently overshadowed by image pull time). I don't have much desire to control much about that, but that may be different for other people on this thread. I just want it to be fast by default.
TL;DR Ability to set ECS_IMAGE_PULL_BEHAVIOR to prefer-cached in Fargate. Right now it's always by design limitation of Fargate that we want a workaround for.
When it comes to image caching specifically, could you expand a little bit on what you would like to see?
@mlanner-aws, the expectation I had in mind was that we essentially get EC2-like caching where there is perhaps some common cache that Fargate tasks already have access to and so when they download an image from ECR or otherwise, they are only downloading the Docker layers that have changed since the previous image.
How would you like control what images should be cached for example?
I think any image a task uses (or at least the largest to begin with) would use the above-like functionality where the pull causes a local cache of the image for future pulls. If the image gets completely invalidate by a very early Docker layer invalidation and takes a long time, that's expected and would be the same in EC2 as well.
As @amunhoz pointed out above, Azure has been able to implement this (https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/).
In our case, it is something pretty similar as what @fitzn described.
Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with
PULL_BEHAVIOR.We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the
PENDINGstate before moving to theRUNNINGstate. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.
As a workaround, we are currently initiating an EC2 from an AMI which has docker installed and our docker image already in it, instead of using Fargate.
EC2 startup time is a bit faster than waiting for a Fargate container to start, because of image downloading time consumption.
@nakulpathak3 the problem is more complex than setting ECS_IMAGE_PULL_BEHAVIOR because, as you noted, the instances backing the tasks are recycled with the task so caching won't apply here. Altering this behavior would have deep ramifications in how Fargate works. We are exploring decoupling the lifecycle of the instances backing the tasks from the storage they use to host the images to achieve this but there are some mechanics that need to be considered for that to work properly. We hear you loud and clear and we would like to solve this asap.
Seeing as many people have the same issue where images take too long to pull, let's talk strategies you guys use to reduce this time, at least until AWS addresses this.
I tried to slim down the image as much as I can, but are there any network tips I can use to make downloading faster? I have a very basic VPC with a single public subnet, no inbound security group rules and attached internet gateway. I don't want any inbound access, only outbound so I found this setup to be okay for me.
I've also tried storing the image with ECR instead of Docker Hub, but this does not help reduce the download time for me. It takes around 55 seconds to pull the image (PENDING -> RUNNING) from Docker Hub. Docker Hub reports 300 MB compressed size.
What are some tricks you guys use to reduce the download time?
Fargate caching will really round out the Fargate offering: the FG capacity provider is amazing, but the lack of caching really cuts into the responsiveness of the CP, and the significantly increased pipeline deployment times is a disincentive for a number of my clients to fully adopt FG.
Big +1 to this - we heavily use Fargate and it's a somewhat embarrassing experience to wait for 2 minutes for a 5 second script to run. We've tried Lambda, but due to memory (and time) constraints, we were unable to stick with it.
We would LOVE any caching that could be done.
Caching is definitely something i would like to have and been able to choice but not manage. Love to use in Fargate instead of figuring how that compares with ECS on EC2 or having to deploy custom container solution.
This issue is made even more salient by Docker implementing the Hub pull rate limits for anonymous and free-tier users. That alone pretty much makes caching with Fargate essential now.
When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?
@mlanner-aws
Sure, we cant expect it to cache every image out there, right? I imagine that it would be configurable per cluster. First of all, any images from any currently running tasks should be in cache and ready to use if the service auto-scales. Next, allow the cluster to specify an ECR repo that could be watched for new pushes, and eagerly cache them. If a new image is pushed it's likely there will soon be a request to launch a task with that image.
When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?
@mlanner-aws
Actually, I don't want to spend too much time in configuring that cache. Based on the registered task definitions and recently launched tasks, ECS/Fargate should be smart enough to cache the right images.
I can even do a parameter on the task definition related to required capabilities. But the caching would be awesome, especially that Fargate already adds overhead to the launch time due to the awsvpc network mode when compared to bridge & "classic" ECS
I had a case of an task that didn't get stable and I failed to notice it during a month, only to end the month with 15TB of transfers of the exact same image (compounded with costs of TGW and PrivateLink traffic due to design of network).
While I can understand the origin of the problem with Fargate, the very minimum of relaunching an image that is being drained should be cached. Any optimisation about not having to download again the layers that are already being run on cluster (or have been running up to x minutes ago) should have an optimized solution.
With regards to the behavior that we would like to see for how caching is handled, I LOVE the idea @markmelville specified, where any ECR repository could configured to be watched from an ECS cluster-wide. Whether that's an automatic watch based on recent images used or a manual configuration, that would rock.
A note, we update the same tags over and over (e.g. dev and released), so we would want ECR actions to be the trigger to inform ECS to update the local cache (as opposed to ECS long-polling the repository).
I could see under the covers, creating an FSx for Lustre filesystem and storing the docker cache there. Then attaching that dynamically to whatever node(s) are running your task
In order to reduce my bill I put a container to run several times a day and shutdown thus saving vcpu and memory costs but now I got data transfer costs in return.
I suppose I'll change to EC2 ECS so I can set ECS_IMAGE_PULL_BEHAVIOR until we have at least a workaround.
Most helpful comment
I would like to upvote this feature too.
I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.