Black: Suggestion: Cache already formatted files

Created on 5 Apr 2018 · 15Comments · Source: psf/black

We have a decently sized Python code base and running black is "slow" (~10 seconds), which can be a bit annoying when run often. A solution for this would be to cache already formatted files (eg by storing a checksum of them in a file somewhere).

I've implemented a hacky solution for this and for my project, on a fully black-ed + cacheed run takes ~1s vs the ~10s for fully black-ed + uncached. The initial run (fully non-black + uncached) takes significantly longer, so clearly more work is needed should this feature be desired, but before I continue working on this I want to know if this is something that is desired and could eventually be merged in.

design enhancement

Source

ojii

👍1

All 15 comments

If this is in the interest of speeding things up then yeah, you have my full attention.

Playing devil's advocate here for a while:

Maybe you could look into a solution like pre-commit which essentially gives you what you want: it will only reformat files that are staged and doesn't touch anything else?
Maybe you need integration with your text editor? That way you only format the files you're working on, not running it on everything every time.
Where would Black look for this cache you're speaking of? The point would be to make statting for such file as fast as possible.

Take your time to answer. Also, there are two big optimizations that I'm planning to do myself. First is cythonizing everything, which should already cut down startup time and maybe 10% runtime. Second is blackd with a simple Unix socket interface that is going to circumvent startup time entirely for text editor integration.

If any of those sounds interesting then I'd be thrilled to work together on those things.

ambv on 5 Apr 2018

If this is in the interest of speeding things up then yeah, you have my full attention.

The basic idea here is simply to do less work. If we know a file is already "good", there's no need to look at it. Ideally, doing less work results in better speed.

Maybe you could look into a solution like pre-commit which essentially gives you what you want: it will only reformat files that are staged and doesn't touch anything else?
Maybe you need integration with your text editor? That way you only format the files you're working on, not running it on everything every time.
Where would Black look for this cache you're speaking of? The point would be to make statting for such file as fast as possible.

I actually started looking into this because I wanted to set up a git pre-commit hook to ensure the files are always checked in after run through black. However my hook naively just runs it on the whole project rather than only changed files, so I could probably solve my problem with a more complicated hook.

I avoid tying it too tightly to my editor because I'd prefer something that "just works" for everyone on my team, irregardless of which editor they use. Handing off the task of running black to the editor could certainly be a solution for some users though.

Right now, I store the cache in a file in <CWD>/.cache/black/cache. Each line has a SHA1 checksum, followed by a space, followed by the full path to a file. When black starts up with --cache set, it reads that file and ignores any files where the checksum matches the one from the cache. After black is done, it then writes back the file, updating all hashes of files that changed.

I have a proof-of-concept implementation which you can find at https://github.com/ojii/black/tree/cached. I've not put any though into robustness or efficiency yet, but initial "benchmarking" with time gives me these results:

black --cache black.py (initial): 1.2s
black --cache black.py (cached): 0.3s
black black.py: 1.1s

For my own project (~400 files/50KSLOC):

black --cache black.py (initial): ~10s
black --cache black.py (cached): ~1s
black black.py: ~8s

There's clearly overhead, but in most cases should give a dramatic speedup [citation needed].

Also, there are two big optimizations that I'm planning to do myself. First is cythonizing everything, which should already cut down startup time and maybe 10% runtime.

This sounds exciting, but complementary to what I'm proposing. I myself don't know cython well enough to help with this.

Second is blackd with a simple Unix socket interface that is going to circumvent startup time entirely for text editor integration.

This is something I might possibly tackle, and it could even make caching a lot simpler if we just store an in-memory cache within blackd (possibly with an option to persist to disk).

ojii on 5 Apr 2018

The reason I asked for cache location is that if subsequent runs are unable to find it (because they were running from a different directory) then it's not very useful. And if Black starts walking directory trees, that will also impacts its speed (hence no configuration for now for example).

If you haven't tried that yet, try pre-commit.com, which only acts on changed files, so you get the best of both worlds.

Working this into an in-memory cache in blackd seems like the best option for me.

ambv on 5 Apr 2018

I've actually thought more about this, and the way I cache it now (checksum per file) is the wrong approach. Instead, there should be a "global" file (somewhere in ~/.cache/black on linux for example) which simply stores checksums. If two files are the same, we don't care which path they're at, if either of them is properly formatted, so is the other.

The naive approach with this would be to just add all checksums of properly formatted files to that cache, but this would result in the cache file infinitely growing. Instead, it should probably be a LRU cache of a fixed maximum size.

An eventual blackd could then also pick up this cache file, since the current directory no longer matters.

ojii on 6 Apr 2018

I have updated my branch with a new, global cache and the overhead got a lot lower:

$ rm ~/Library/Caches/black/already-formatted 
$ time black black.py
black.py already well formatted, good job.

real    0m1.026s
user    0m0.974s
sys 0m0.044s
$ time black --cache black.py
black.py already well formatted, good job.

real    0m1.056s
user    0m1.003s
sys 0m0.045s
$ time black --cache black.py
black.py already well formatted, good job.

real    0m0.334s
user    0m0.281s
sys 0m0.044s

My project:

$ rm ~/Library/Caches/black/already-formatted 
$ time black src/ tests/
[...]
All done! ✨ 🍰 ✨
372 files left unchanged.

real    0m7.299s
user    0m50.681s
sys 0m0.638s
$ time black --cache src/ tests/
[...]
All done! ✨ 🍰 ✨
372 files left unchanged.

real    0m7.295s
user    0m50.724s
sys 0m0.636s
$ time black --cache src/ tests/
[...]
All done! ✨ 🍰 ✨
372 files left unchanged.

real    0m0.784s
user    0m0.764s
sys 0m0.236s

ojii on 6 Apr 2018

Why do you need to compute the hashes? The way git status works for example is just by looking at file modification date and size. Then you don't even have to open the file, which is much faster.

ambv on 9 Apr 2018

The hashes + ignore filepaths approach has a few benefits over modtime + size:

Moving a file won't invalidate the cache
Switching branches in your VCS won't invalidate the cache (since multiple versions of a single file can be cached trivially)
Having the same file in multiple different locations will be picked up as already formatted.
It's probably easier. The on-disk format is super easy with constant sized hashes.

Of course, as you say, not having to read the file at all would be faster. It's a tradeoff and I'm not sure which side wins, it highly depends on a users workflow/setup I think.

ojii on 10 Apr 2018

I've given the "cache path mtime + size" strategy and also a version where it caches both mtime + size and checksums. The "dual cache" version is not significantly slower (using time for benchmarking shows no difference) and gives the benefit of both, though of course is slightly more complex.

Implementations can be found at https://github.com/ojii/black/tree/fastcache and https://github.com/ojii/black/tree/dual-cache if you want to give them a try.

Fully cached the dual cache version reduces a run to about 0.3 seconds, even for my somewhat large project.

ojii on 12 Apr 2018

I think the dual-cache version is the best solution. @ambv if you agree I'll make a clean version of it and open a PR.

ojii on 16 Apr 2018

I don't see the benefit of dual-cache compared to the simple mtime + size. Let's do the simple one first. A few pointers for your PR:

Don't make a config option for it.
We will never cache in stdin/stdout mode.
We will never cache in --diff mode.
We will always cache in regular file mode and --check mode.
Using pickle for this is fine, we are already using it for the lib2to3 generated parser. That removes worry about handling the format internally. Just handle the case when reading is broken or the version of Black doesn't match.
Invalidating the entire cache on Black version change makes sense because a new version of Black might format files differently. We want all files to be retried.
Let's not skip the files in gen_python_files_in_dir(), I'd prefer to keep the "already well formatted" message for all files found. Bonus points if the message stated that "file didn't change since last run".
Let's only write the cache once per entire run.

ambv on 16 Apr 2018

👍1

Are you okay with the extra dependency on appdirs and using their user_cache_dir to determine the location?

ojii on 16 Apr 2018

Yes, I like appdirs. You can even use their "per-version isolation" functionality to make the "per-version invalidation" implicit by using the file system :-) And what happens on Windows when "appauthor" is not provided?

Oh, and you're probably already doing that but make sure to .resolve() paths under cache :-)

ambv on 16 Apr 2018

👍1

Oh, and maybe also unicodedata.normalize() will be required on paths, unless you know better.

ambv on 16 Apr 2018

👍1

Solved! ✨ 🍰 ✨

ambv on 19 Apr 2018

A few pointers for your PR:

We will never cache in --diff mode.

Is there any particular reason to skip the cache in--diff mode? (It does not seem to be discussed here.)

Not caching in --diff mode can cause black to behave differently with --diff than in the other modes (normal, --fast, --check) when a file was reformatted with --fast that could not be reformatted with just black, see #1145.

How strong is this 'never'? Is there a chance of getting a PR merged that enables caching in --diff mode? It might have been disabled for some reason.