Break out from https://github.com/bazelbuild/bazel/issues/4870.
Bazel can use a local directory as a remote cache via the --disk_cache flag.
We want it to also be able to automatically clean the cache after a size threshold
has been reached. It probably makes sense to clean based on least recently used
semantics.
@RNabel would you want to work on this?
@RNabel @davido
I will look into implementing this, unless someone else is faster than me.
I don't have time to work on this right now. @davido, if you don't get around to working on this in the next 2-3 weeks, I'm happy to pick this up.
Hi, I would also very much like to see this feature implemented! @davido , @RNabel did you get anywhere with your experiments?
Not finished, but had an initial stab: https://github.com/RNabel/bazel/compare/baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work)
I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido, what do you think is the best solution here? Also happy to write up a brief design doc for discussion.
What do you guys think about just running a local proxy service that has this functionality already implemented? For exampe: https://github.com/Asana/bazels3cache or https://github.com/buchgr/bazel-remote? One could then point Bazel to it using --remote_http_cache=http://localhost:XXX. We could even think about Bazel automatically launching such a service if it is not running already.
I think @aehlig solved this problem for the repository cache. Maybe you can
borrow his implementation here as well.
@buchgr, I feel this is core Bazel functionality and in my humble opinion
outsourcing it isn’t the right direction. People at my company are often
amazed Bazel doesn’t have this fully supported out of the box.
On Tue, 11 Sep 2018 at 13:14 Robin Nabel notifications@github.com wrote:
Not finished, but had an initial stab: RNabel/bazel@
baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size
https://github.com/RNabel/bazel/compare/baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size
(this is mostly plumbing and figuring out where to put the logic it
definitely doesn't work)I figured the simplest solution is an LRU relying on the file system for
access times and modification times. Unfortunately, access times are not
available on windows through Bazel's file system abstraction. One
alternative would be a simple database, but that feels like overkill here.
@davido https://github.com/davido, what do you think is the best
solution here? Also happy to write up a brief design doc for discussion.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/5139#issuecomment-420221831,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF_yJPnfWAoPzJufI6WwjckenYmNUks5uZ4zygaJpZM4TvSgK
.
I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well.
@ittaiz, what solution are you talking about? What we have so far for the repository cache is that the file gets touched on every cache hit (see e0d80356eed), so that deleting the oldest files would be a cleanup; the latter, however, is not yet implemented, for lack of time.
For the repository cache, it is also a slightly different story, as clean up should always be manual; upstream might have disappeared, to the cache might be last copy of the archive available to the user—and we don't want to remove that on the fly.
outsourcing it isn’t the right direction
I would be interested to learn more about why you think so.
@aehlig sorry, my bad. You are indeed correct.
@buchgr,
I think so because I think a disk cache is a really basic feature of Bazel
and the fact that it doesn’t work like this by default is IMHO a leaky
abstraction (of how exactly the cached work) and influenced greatly by the
fact that googlers work mainly (almost exclusively?) with remote execution.
I’ve explained bazel to tens maybe hundreds of people. All of them were
surprised disk cache isn’t out of the box (eviction wise and also plans
wise like we discussed).
On Tue, 11 Sep 2018 at 16:24 Jakob Buchgraber notifications@github.com
wrote:
outsourcing it isn’t the right direction
I would be interested to learn more about why you think so.
—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/5139#issuecomment-420273144,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF-CT0FTFJOrIqJUvj5rmeKlfT502ks5uZ7mKgaJpZM4TvSgK
.
@ittaiz
the disk cache is indeed a leaky abstraction that was mainly added because
it was easy to do so. I agree that if Bazel should have a disk cache in the long
term, then it should also support read/write through to a remote cache and
garbage collection.
However, I am not convinced that Bazel should have a disk cache built in but
instead this functionality could also be handled by another program running
locally. So I am trying to better understand why this should be part of Bazel.
Please note that there are no immediate plans to remove it and we will not do
so without a design doc of an alternative. I am mainly interested in kicking off
a discussion.
Thanks for the clarification and I appreciate the discussion.
I think that users don’t want to operate many different tools and servers
locally. They want a build tool that works.
The main disadvantage I see is that it sounds like you’re offering a
cleaner design at the user’s expense.
On Thu, 13 Sep 2018 at 22:56 Jakob Buchgraber notifications@github.com
wrote:
@ittaiz https://github.com/ittaiz
the disk cache is indeed a leaky abstraction that was mainly added because
it was easy to do so. I agree that if Bazel should have a disk cache in
the long
term, then it should also support read/write through to a remote cache and
garbage collection.However, I am not convinced that Bazel should have a disk cache built in
but
instead this functionality could also be handled by another program running
locally. So I am trying to better understand why this should be part of
Bazel.
Please note that there are no immediate plans to remove it and we will not
do
so without a design doc of an alternative. I am mainly interested in
kicking off
a discussion.—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/5139#issuecomment-421132801,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF8ewS8x09uklzku9r6-aS6zUeLqYks5uarh4gaJpZM4TvSgK
.
I think that users don’t want to operate many different tools and servers locally.
I partly agree. I'd argue in many companies that would change as you would typically have an IT department configuring workstations and laptops.
The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense.
I think that also depends. I'd say that if one only wants to use the local disk cache then I agree that providing two flags is as friction less as it gets. However, I think it's possible that most disk cache users will also want to do remote caching/execution and that for them this might not be noteworthy additional work.
So I think there are two possible future scenarios for the disk cache:
I think 1) makes sense if we think that the disk cache will be a standalone feature that a lot of people will find useful on its own and if so I think its worth the effort to implement this in Bazel. For 2) I am not so sure as I can see several challenges that might be better solved in a separate process:
So I think it might make sense for us to think about having a standard local caching proxy that's a separate process and that can be operated independently and/or that Bazel can launch automatically for improved usability might be an idea worth thinking about.
Is there any plan to roll out the "virtual remote filesystem" soon? I am interested to learn more about it and can help if needed. We are hitting network speed bottleneck.
yep, please follow https://github.com/bazelbuild/bazel/issues/6862
any plan of implementing the max size feature or a garbage collector for the local cache?
This is a much needed feature in order to use Remote Builds without the Bytes, since naively cleaning up the disk cache results in build failures.
Any updates on this?
+1 We would like to be able to set the max size for the cache. Currently we rely on users doing this manually. We could add a script to do this but it feels like it would be a good feature for Bazel to have.
+1 on this, I had to write a script to keep my local disk from filling up.
(by doing this I also discovered that something creates non-writable directories in .cache/bazel which seems bad in general)
+1 on this feature require. I need it so I can run it inside a docker container.
+1.
Most helpful comment
I partly agree. I'd argue in many companies that would change as you would typically have an IT department configuring workstations and laptops.
I think that also depends. I'd say that if one only wants to use the local disk cache then I agree that providing two flags is as friction less as it gets. However, I think it's possible that most disk cache users will also want to do remote caching/execution and that for them this might not be noteworthy additional work.
So I think there are two possible future scenarios for the disk cache:
I think 1) makes sense if we think that the disk cache will be a standalone feature that a lot of people will find useful on its own and if so I think its worth the effort to implement this in Bazel. For 2) I am not so sure as I can see several challenges that might be better solved in a separate process:
So I think it might make sense for us to think about having a standard local caching proxy that's a separate process and that can be operated independently and/or that Bazel can launch automatically for improved usability might be an idea worth thinking about.