This is non-trivial because MyPy's cache is not append-only and mutates itself. We need a way to safely use its cache in a way that still works with things like remote execution and how we partition MyPy runs.
Improve MyPy performance by leveraging its cache
Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib
It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.
I'm less familiar with how we plan to implement parenting instances of the mypy daemon, but I think it seems very clear we would want to support persistent processes somehow and this seems like exactly the right use case to try that on. It's possible that these two approaches are 100% complementary (parenting the daemon, and snapshotting the cache directory). I can't tell which approach would be more immediately useful.
It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.
The non-append-only-ness has more to do with not being concurrency-safe/shareable, for a few reasons:
A few pieces that don't really fit together into a complete story:
I had totally missed the absolute paths part of mypy. Either of the bullet points at the bottom are quite exciting to consider. I really like the idea of invoking recursively, as I suspect we can make the path rewriting extremely fast as well as generic enough to apply to other tools that write absolute paths.
I would also really like to dive into using FUSE, like so, so much. I created this project which makes it super easy (ideally) to virtualize all of the I/O of a JVM process: https://github.com/cosmicexplorer/upc/blob/7dc4d0ae3219e014b00de62adfb415432f6c0850/local/virtual-cli/client/MainWrapper.scala#L108-L122, but I think possibly the entire thing could be deleted if we could instead have pants create very-high-performance FUSE instances. Especially given @eed3si9n's success with virtualizing zinc's I/O (http://eed3si9n.com/cached-compilation-for-sbt), I think that there is reason to believe FUSE would be a winning strategy that could be generic enough to avoid tons of separate incidental complexity.
I'll probably look into reviving brfs and incorporating some of the ideas from https://github.com/cosmicexplorer/upc.
Hm. So after investigating a .mypy_cache/ directory from the pex repo a bit, I found something somewhat surprising: it appears that the only absolute paths it contains are for types from the stdlib (which I believe should not change at all). The rest actually seem to be...relative to the working directory? I'm not 100% sure how to fit this together yet, but here are the results of my quick investigation:
# The paths are all located at a top-level key 'path'.
> <.mypy_cache/3.5/uuid.meta.json jq '.' | grep '/Users'
(standard input):62: "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi"
# The only files in the cache are '.json' files, with a lone '.gitignore'.
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
# And it appears all of the paths are relative to the working directory, except ones from the stdlib!
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py
It _feels_ like we might be able to make use of a mutable cache here for the .mypy_cache directory, without having to rewrite any files. It _would_ end up reaching out to somewhere on the disk to get the stdlib typeshed stubs, but I believe that can also be configured with a command-line option (so we could resolve mypy into a pex, extract the .deps/, then point it into there).
I think that materializing the typeshed stubs would definitely be a (smallish) slowdown -- but we could consider something like #8905 in that case, extending the symlinks we add to include specific large directories that we don't want to keep materializing. The API could be similar to the existing mutable cache API.
Let me know if something's missing from that.
Ok, after some more investigation, there are some other unstable fields. For posterity, I'm going to post a gist containing the cache contents for the uuid module: https://gist.github.com/cosmicexplorer/258a9947589fb8d3500da2bd6ec5cea5.
At first glance, the only bad entries in uuid.data.json are just path, which is only absolute for typeshed types, which could be mitigated by materializing them into the execution directory as per https://github.com/pantsbuild/pants/issues/10864#issuecomment-699696539.
The bad entries in uuid.meta.json are more numerous, but the file is also much smaller. path is still there with the same caveats, but in addition we have data_mtime and mtime, which presumably contain the modification times for uuid.meta.json and uuid.data.json. I believe that everything else should remain constant if the mypy version and command-line args are kept constant. There's a platform key, but that should be fine since we are discussing a local cache.
The files aren't that large, so while we could attempt to rewrite them with a regex, it's likely to be safer and not _much_ slower to rewrite them by parsing some json (this could be way wrong). I'm not quite sure yet where we would rewrite them -- if someone could give some pointers as to what we need to be making deterministic here, I would love a tip. This should be enough info to go ahead and implement this, though.
Also, the mypy cache partitions itself by python version. I believe that should satisfy (from the OP):
how we partition MyPy runs.
I think that the fully recursive method described in https://github.com/pantsbuild/pants/issues/10864#issuecomment-699689828 should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0 for all json files in the cache after all invocations, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.
There is a concern there about having to scan all the json files in the cache after each mypy invocation. If we could just look for the module owned by the current target, that would involve only scanning two files (the *.data.json and the *.meta.json) after each invocation, resetting their modified timestamps, then digesting them. I think that's what we would do if we did a fully recursive invocation as Stu described above, but I would want someone to check my work on that. If we do really have to check all the files in the cache on each invocation, we should consider using a mutable cache, or we should consider using regex from rust instead of parsing and rewriting json from python.
If we strictly merge cache directories from leaf to root, we should be able to avoid redoing a ton of work. However, since the cache directory is (I believe) supposed to be consistent across the whole repo, it's possible we could be a little more intense, and at each depth of a breadth-first search of the dependency closure, merge the cache digests and use them for all mypy invocations at the next depth level of the BFS. I think the latter might be necessary, but would be a little bit (I don't think too much) more complex to implement.
To be clear, I'm thinking of:
input_digest for each mypy execution, and use the CLI option to make it look for types there.data_mtime and mtime are set to 0 in all *.meta.json files.To make it remotable, we need one more thing (I think):
platform key in all *.meta.json files to point to the correct platform.I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.
Yep, that's what I was thinking. As mentioned in slack, the main issue with it is just that it's hard to know how fast or slow it would be (relative to daemonization approaches) without prototyping it.
This is non-trivial because MyPy's cache is not append-only and mutates itself.
This is mostly false if we use the sqlite cache:
https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/mypy/metastore.py#L140-L223
Actually ... ditto for the FS based cache which uses an atomic rename for inserts depending on how they use their own store API.
Currently entries are keyed on (path, mtime) so we'd need to get in a patch to make this hash-based. With that though, afaict, we could store a sqlite database in a Process.append_only_caches entry.
Most helpful comment
Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib