Pants: Improve MyPy performance

Created on 25 Sep 2020 · 10Comments · Source: pantsbuild/pants

This is non-trivial because MyPy's cache is not append-only and mutates itself. We need a way to safely use its cache in a way that still works with things like remote execution and how we partition MyPy runs.

Q42020-idea performance

Source

Eric-Arellano

Most helpful comment

Improve MyPy performance by leveraging its cache

Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib

stuhood on 25 Sep 2020

👍2

All 10 comments

Improve MyPy performance by leveraging its cache

Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib

stuhood on 25 Sep 2020

👍2

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

I'm less familiar with how we plan to implement parenting instances of the mypy daemon, but I think it seems very clear we would want to support persistent processes somehow and this seems like exactly the right use case to try that on. It's possible that these two approaches are 100% complementary (parenting the daemon, and snapshotting the cache directory). I can't tell which approach would be more immediately useful.

cosmicexplorer on 27 Sep 2020

❤1

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

The non-append-only-ness has more to do with not being concurrency-safe/shareable, for a few reasons:

it expects to be located inside the buildroot, and uses absolute paths. locating it outside of the buildroot in a non-buildroot-specific absolute path might require rewriting files (similar to what we did for Zinc, which was ... painful/error-prone)
it doesn't expect to have multiple writers (which is probably the key differentiator for an "append-only" or well behaved cache) but we do expect to have multiple concurrent mypy processes (although perhaps we could constrain to one instance per interpreter-constraint at a time)

A few pieces that don't really fit together into a complete story:

Maybe possible to invoke recursively (per-target), and do lots of re-writing (both paths and timestamps) of process cache CAS entries (nb: not mutable caches) in wrapper code during per-process setup/teardown to allow them to be used at relative paths within the sandbox directory, but be stored elsewhere.
Keep instance(s) of the mypy daemon (or a Bazel-worker-API-like wrapper around the basic CLI API) up in dedicated/owned working directories, and then "sync" relevant files into the workspace (if they use file watching... or maybe FUSE otherwise?)

stuhood on 27 Sep 2020

I had totally missed the absolute paths part of mypy. Either of the bullet points at the bottom are quite exciting to consider. I really like the idea of invoking recursively, as I suspect we can make the path rewriting extremely fast as well as generic enough to apply to other tools that write absolute paths.

I would also really like to dive into using FUSE, like so, so much. I created this project which makes it super easy (ideally) to virtualize all of the I/O of a JVM process: https://github.com/cosmicexplorer/upc/blob/7dc4d0ae3219e014b00de62adfb415432f6c0850/local/virtual-cli/client/MainWrapper.scala#L108-L122, but I think possibly the entire thing could be deleted if we could instead have pants create very-high-performance FUSE instances. Especially given @eed3si9n's success with virtualizing zinc's I/O (http://eed3si9n.com/cached-compilation-for-sbt), I think that there is reason to believe FUSE would be a winning strategy that could be generic enough to avoid tons of separate incidental complexity.

I'll probably look into reviving brfs and incorporating some of the ideas from https://github.com/cosmicexplorer/upc.

cosmicexplorer on 27 Sep 2020

Hm. So after investigating a .mypy_cache/ directory from the pex repo a bit, I found something somewhat surprising: it appears that the only absolute paths it contains are for types from the stdlib (which I believe should not change at all). The rest actually seem to be...relative to the working directory? I'm not 100% sure how to fit this together yet, but here are the results of my quick investigation:

# The paths are all located at a top-level key 'path'.
> <.mypy_cache/3.5/uuid.meta.json jq '.' | grep '/Users'
(standard input):62:  "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi"
# The only files in the cache are '.json' files, with a lone '.gitignore'.
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
# And it appears all of the paths are relative to the working directory, except ones from the stdlib!
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py

It _feels_ like we might be able to make use of a mutable cache here for the .mypy_cache directory, without having to rewrite any files. It _would_ end up reaching out to somewhere on the disk to get the stdlib typeshed stubs, but I believe that can also be configured with a command-line option (so we could resolve mypy into a pex, extract the .deps/, then point it into there).

I think that materializing the typeshed stubs would definitely be a (smallish) slowdown -- but we could consider something like #8905 in that case, extending the symlinks we add to include specific large directories that we don't want to keep materializing. The API could be similar to the existing mutable cache API.

Let me know if something's missing from that.

cosmicexplorer on 28 Sep 2020

Ok, after some more investigation, there are some other unstable fields. For posterity, I'm going to post a gist containing the cache contents for the uuid module: https://gist.github.com/cosmicexplorer/258a9947589fb8d3500da2bd6ec5cea5.

At first glance, the only bad entries in uuid.data.json are just path, which is only absolute for typeshed types, which could be mitigated by materializing them into the execution directory as per https://github.com/pantsbuild/pants/issues/10864#issuecomment-699696539.

The bad entries in uuid.meta.json are more numerous, but the file is also much smaller. path is still there with the same caveats, but in addition we have data_mtime and mtime, which presumably contain the modification times for uuid.meta.json and uuid.data.json. I believe that everything else should remain constant if the mypy version and command-line args are kept constant. There's a platform key, but that should be fine since we are discussing a local cache.

The files aren't that large, so while we could attempt to rewrite them with a regex, it's likely to be safer and not _much_ slower to rewrite them by parsing some json (this could be way wrong). I'm not quite sure yet where we would rewrite them -- if someone could give some pointers as to what we need to be making deterministic here, I would love a tip. This should be enough info to go ahead and implement this, though.

cosmicexplorer on 28 Sep 2020

Also, the mypy cache partitions itself by python version. I believe that should satisfy (from the OP):

how we partition MyPy runs.

I think that the fully recursive method described in https://github.com/pantsbuild/pants/issues/10864#issuecomment-699689828 should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0 for all json files in the cache after all invocations, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

There is a concern there about having to scan all the json files in the cache after each mypy invocation. If we could just look for the module owned by the current target, that would involve only scanning two files (the *.data.json and the *.meta.json) after each invocation, resetting their modified timestamps, then digesting them. I think that's what we would do if we did a fully recursive invocation as Stu described above, but I would want someone to check my work on that. If we do really have to check all the files in the cache on each invocation, we should consider using a mutable cache, or we should consider using regex from rust instead of parsing and rewriting json from python.

If we strictly merge cache directories from leaf to root, we should be able to avoid redoing a ton of work. However, since the cache directory is (I believe) supposed to be consistent across the whole repo, it's possible we could be a little more intense, and at each depth of a breadth-first search of the dependency closure, merge the cache digests and use them for all mypy invocations at the next depth level of the BFS. I think the latter might be necessary, but would be a little bit (I don't think too much) more complex to implement.

cosmicexplorer on 28 Sep 2020

To be clear, I'm thinking of:

[ ] Put typeshed types into the input_digest for each mypy execution, and use the CLI option to make it look for types there.
- [ ] Optimization: make these into symlinks and only materialize them once as per #8905.
[ ] Run mypy recursively, once per python target (which I believe is now once per python source file, plus any stub).
[ ] After every invocation, read all the files (?) in the "cache dir" (which is just a normal process exec output dir), and rewrite their json contents to ensure data_mtime and mtime are set to 0 in all *.meta.json files.
[ ] In the mypy recursive invocation, merge digests for the mypy "cache dir" at each depth level of a BFS of all targets and their transitive dependencies (see details above).

To make it remotable, we need one more thing (I think):

[ ] Rewrite the platform key in all *.meta.json files to point to the correct platform.

cosmicexplorer on 28 Sep 2020

I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

Yep, that's what I was thinking. As mentioned in slack, the main issue with it is just that it's hard to know how fast or slow it would be (relative to daemonization approaches) without prototyping it.

stuhood on 28 Sep 2020

This is non-trivial because MyPy's cache is not append-only and mutates itself.

This is mostly false if we use the sqlite cache:
https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/mypy/metastore.py#L140-L223

Actually ... ditto for the FS based cache which uses an atomic rename for inserts depending on how they use their own store API.

Currently entries are keyed on (path, mtime) so we'd need to get in a patch to make this hash-based. With that though, afaict, we could store a sqlite database in a Process.append_only_caches entry.

jsirois on 11 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Improve python tool startup overhead.

jsirois · 3Comments

Target Hitting Recursion Limit During Pants Setup (with workaround)

rcuza · 3Comments

python_binary doesn't have the singular `source=` argument documented

cosmicexplorer · 4Comments

release.sh allowed a release on a sha without wheels built yet

stuhood · 5Comments

Add gRPC and MyPy support to Python Protobuf

Eric-Arellano · 5Comments