Extract the cache from Parcel 1, and move into its own package: @parcel/cache in packages/core/cache. It should also be modified to suit Parcel 2.
TODO: details
https://github.com/parcel-bundler/parcel/pull/2375#issuecomment-446067098:
The current v2 work is in the v2-work-so-far branch, which does include a new cache already but not the server.
Should Parcel 1 use that as well or is this resolved?
@mischnic the Parcel 1 and Parcel 2 cache are not compatible. So it's only for Parcel 2, the Parcel 1 cache should remain as is
Discussed with @padmaia:
@parcel/transformer-babel should store versions of babel plugins in the cache and be able to check that they are the same when checking the cache..babelrc.js and babel.config.js which might return functions directly. If those files are seen, don't cache at all..babelrc is created further down the tree than before so that files would resolve to a different config. Three options: 1) find all the babelrcs ahead of time. 2) check git to see what changed. 3) just resolve the config every time.I was thinking about caching a bit more, and started wondering if perhaps we can serialize the asset graph to disk after a build and then run a fast comparison when parcel starts a rebuild to determine what nodes to invalidate. This way we can potentially avoid running resolvers and transformers at all for files that do not change, similar to how we do it for rebuilds using info from the watcher.
In order to make comparisons faster, we can also use stat information similar to how git does it. If the creation time, modification time, or file size changes, then we know the file is modified and there is no need to actually read the file to compare hashes. It is possible, however, for a file to be modified but for none of these pieces of metadata to change. This can happen if a file is modified more than once in a single second, for example, since many filesystems don't offer high precision modification times. There is some information about how git solves this problem here which we could potentially implement as well.
I don't think we can use git itself to detect changed files since git does not track some files (like node_modules, or anything else in .gitignore), and also some users may not be using git.
One other issue to solve is how to correctly cache the results of resolvers and configs (e.g. .babelrc). These often rely on filesystem hierarchy to determine which file to read. The resolved config may change if a new file is added further down the file tree, e.g. if a .babelrc file is added in a directory closer to the transformed file. In order to effectively cache the resolver, we need some way to check for this case in order to invalidate it.
One possibility would be to store a reference to each of the files that were searched and did not exist in the graph. That way, if any of those files are created before the next build, we could easily check and invalidate the config. The result of this would be quite large though, since the resolver typically looks for many different file extensions and through lots of directories to find a module.
Another way could be to build up a graph of the entire file hierarchy at the start of a build and serialize this to disk in the cache as well. This graph would include all files on the filesystem, not just ones that are actually included in the asset graph. On subsequent builds we could quickly see which files were added/removed/updated compared to the previous build, similar to git. With this metadata stored once, resolvers could access the cached file graph instead of the actual filesystem to resolve modules/configs so that the filesystem is only accessed once, and files that don't exist (most of the stat calls we do) are never even tried. This could potentially be quite fast compared to the current resolution mechanism which makes stat calls for lots of files which don't exist. One problem is to determine what the "entire file hierarchy" means to us. A build may include files that are outside the root directory hierarchy, so it may be hard to know which files to diff.
That's a good write up of the Racy Git problem. I think we should definitely be able to use stat to speed up cache checks. Keeping a graph to represent the entire file system seemed a bit extreme at first, but now I'm thinking it could really help. I'll experiment with that.
Yeah one other issue with that is that unlike git, parcel builds may not only include files in one subtree. There may be dependencies outside the project root directory, symlinks, etc. So it might be hard to determine which files to include in the cached filesystem graph.
Most helpful comment
I was thinking about caching a bit more, and started wondering if perhaps we can serialize the asset graph to disk after a build and then run a fast comparison when parcel starts a rebuild to determine what nodes to invalidate. This way we can potentially avoid running resolvers and transformers at all for files that do not change, similar to how we do it for rebuilds using info from the watcher.
In order to make comparisons faster, we can also use
statinformation similar to how git does it. If the creation time, modification time, or file size changes, then we know the file is modified and there is no need to actually read the file to compare hashes. It is possible, however, for a file to be modified but for none of these pieces of metadata to change. This can happen if a file is modified more than once in a single second, for example, since many filesystems don't offer high precision modification times. There is some information about how git solves this problem here which we could potentially implement as well.I don't think we can use git itself to detect changed files since git does not track some files (like node_modules, or anything else in
.gitignore), and also some users may not be using git.Resolver cache
One other issue to solve is how to correctly cache the results of resolvers and configs (e.g.
.babelrc). These often rely on filesystem hierarchy to determine which file to read. The resolved config may change if a new file is added further down the file tree, e.g. if a.babelrcfile is added in a directory closer to the transformed file. In order to effectively cache the resolver, we need some way to check for this case in order to invalidate it.One possibility would be to store a reference to each of the files that were searched and did not exist in the graph. That way, if any of those files are created before the next build, we could easily check and invalidate the config. The result of this would be quite large though, since the resolver typically looks for many different file extensions and through lots of directories to find a module.
Another way could be to build up a graph of the entire file hierarchy at the start of a build and serialize this to disk in the cache as well. This graph would include all files on the filesystem, not just ones that are actually included in the asset graph. On subsequent builds we could quickly see which files were added/removed/updated compared to the previous build, similar to git. With this metadata stored once, resolvers could access the cached file graph instead of the actual filesystem to resolve modules/configs so that the filesystem is only accessed once, and files that don't exist (most of the stat calls we do) are never even tried. This could potentially be quite fast compared to the current resolution mechanism which makes stat calls for lots of files which don't exist. One problem is to determine what the "entire file hierarchy" means to us. A build may include files that are outside the root directory hierarchy, so it may be hard to know which files to diff.