Pinging @mgsloan, @borsboom, and @sol. This is a discussion issue related to the current precompiled cache, nix-style builds, and implicit snapshots.
__Goal__ Maximize the amount of binary caching Stack is able to do.
extra-dep package at the base of a large dependency hierarchy.stack.yaml file. This will promote many more packages in the configuration from local to snapshot packages. Besides implementation complexity, the biggest downside is disk space usage: each time you tweak extra-deps and rebuild, you'll get binary builds in ~/.stack which last forever.nix-style builds/package databases. GHC 8.0 (IIRC) added the ability for package databases to allow more than one copy of a package/version combo in a database. cabal new-build uses this to get greater sharing of binary builds than Stack. I'm not familiar with some of the details of how this is implemented (such as how this would work with a Git-based package installation), but at least three downsides arerunghc like we can with Stack, since a single database is no longer a consistent viewNow I'm going to offer a new approach, approach 3. We have both _immutable_ and _mutable_ package sources. Local file paths are mutable. We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached. (We _could_ get into arguments about caching based on the file hashes, but I think it's overall not worth it.) That leaves us with immutable package sources, where the package contents come from Hackage, some archive, or a Git/Mercurial repository.
We no longer care if these packages are in snapshots or local databases, that detail is totally irrelevant. Every time we build one of these packages, we install its library and other data files into a directory under ~/.stack, let's say ~/.stack/immutable, with a directory structure that uses hashes to fully encapsulate all of the settings for this package (package source itself, dependencies, flags, etc). In addition, we register the package into its own package database inside that new directory.
We keep snapshot database and local database logic the same. But now, instead of rebuilding packages in local databases, or having special logic for the precompiled cache, we have a simple algorithm we follow every time we build an immutable package:
~/.stack/immutable~/.stack/immutable package database to the appropriate snapshot or local databaseThere's also some bikeshedding around whether we would even need a snapshot database after this, or if we can simply always use local databases. I'm not too terribly invested in that, but I have a feeling that due to the slowness of ghc-pkg's registering capabilities we'd want to keep the snapshot database concept.
The motivation for this came up when I was planning how to rewrite the Stackage build process to maximize sharing by switching over to Stack as the build tool, and realized this change would greatly help the Stackage build process.
Without thinking much about the details, this makes sense to me. I do think it's worth discussing:
We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached.
A common complaint is that switching between profiling enabled and disabled causes too many recompiles. With this proposal, at least hackage/git extra-deps won't have to be recompiled, but if your project has a lot of local packages there would still be a lot of recompilation. Some kind of solution to that would be great, but not sure if it should be part of this proposal or not.
I think the profiling issue is just a bug in Stack right now. While switching from no profiling to profiling should cause a recompile, the other way around shouldn't. That should reduce at least some of the friction in going back and forth with profiling builds. We could in principle have the same kind of logic at this cache layer.
Seems good to me! This seems very similar to how I was thinking implicit snapshots would work. Is the main difference that initially they get installed to their own package DB? This way they aren't stored under the path of a particular snapshot. If so, then I grok this and am :+1: on it.
I guess you could look at it that way, sure. Sounds like this should move ahead, no one's raising any concerns. If I get a chance to get started, I'll comment here. If someone else wants to take a stab at it instead, I'll be very happy to advise :).
This has really nice implications for intero's binary. 👍
I think overall this makes sense.
Local file paths are mutable. We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached.
Hmm, tinc caches local dependencies too. As I understand it stack differentiates between project packages (packages) and dependencies (extra-deps). If a local dependency is specified through extra-deps then why not cache it. Wouldn't that be easier, both in terms of code and documentation?
For completeness, here is how tinc calculates hashes for local dependencies:
cabal sdist (this is to exclude uninteresting files and increase cache hits)hash(hash(name), hash(content)) (the name is included to detect renaming and file concatenation or splitting)(https://github.com/sol/tinc/blob/8f8007e62d9953eca7cf1e3c6d9aefb988eda47a/src/Util.hs#L49)
I have a feeling that due to the slowness of ghc-pkg's registering capabilities we'd want to keep the snapshot database concept.
Not sure about ghc-pkg register, but if I do
ghc-pkg recachethen I didn't have the feeling that this was particularly slow.
Isolated, per project data bases would certainly be neat (if that is what "local databases" means).
I didn't have the feeling that this was particularly slow.
Numbers, registering 92 packages takes 195ms on my system.
Registering all transitive dependencies for stack takes 285ms (177 packages).
If a local dependency is specified through extra-deps then why not cache it. Wouldn't that be easier, both in terms of code and documentation?
To clarify: the goal of this proposal is to make extra-deps cacheable. What we _don't_ want to allow is a local filepath (as opposed to say a package from Hackage or a Git repo) be cacheable. It's possible, by taking a hash of the entire directory contents, but it's rife with difficulties (like making sure we find all of the files that actually impact the build) and will likely end up with large amounts of garbage in the cache.
It could be that the sdist approach you mention would work, but how would Stack know when it needs to rerunning the sdist command?
Your comments on recache are interesting, I wonder if we could use them for optimizing both registering _and_ unregistering. The latter is particularly slow right now.
I'm a Haskell beginner, but if possible, would like to contribute to implementing this solution. @snoyberg mentioned there's mentorship and I think I would be able to contribute with some help. This looks like a really neat feature!
I'm in the same boat as @DebugSteven - Haskell beginner that thinks they can contribute w/ some help. Since @DebugSteven "got here first", I'll assume they will work on the code but happy to contribute if I can or if plans change.
As I see it, there are going to be (at least) four pieces of work here worth doing:
Step (4) is the easiest to parallelize: once (1) is done, multiple people (myself included) can start using the new branch and report on issues. (2) is probably close to trivial. (3) is something I'm volunteering to do regularly, and encourage others to do as well.
So that leaves (1). I'd recommend only one person at a time work on it, and if that's @DebugSteven, cool. If it looks like you're stuck, or won't have time to continue, please ping this issue so others can jump in if desired. All that said, how the hell do we approach this issue? ;)
I've pushed a massively incomplete commit to the 3860-better-cache branch (d76790547c275ade52318b7003acc90c1965247b), which shows the parts of the code base that I think need to be touched. Essentially, here's what's going on:
ConstructPlan module, we determine which packages are going to be used, and of those which are already installed, and which need to be built~/.stack or not. If they _don't_, we should rebuild them, since we don't want a package in some other directory being depended on and then getting moved/deleted.Stack.Build.Execute with functions that perform the new caching. This should _basically_ come down to:~/.stack (which hasn't been set up yet, your call on exactly what the directory structure looks like)It's a bit hairy getting all of this right, but the prior art of the precompiled cache should be good guidance on moving forward here. Also, there's very little "smart" Haskell code here, this is basically a big ol' imperative mess :)
Does this give enough info to get started on? And feel free to ask questions at any point.
Yeah I think that's enough information to get started. I'll get started on this today and probably have some questions posted here later today.
How does Stack determine right now which packages need to be recompiled? The way I understand it is __some__ packages are stored in the cache. Before Stack recompiles any package it checks the cache to see if it’s already been compiled. Anything that has been changed (different version of a package, new dependency, etc.) in the project is recompiled (and the immutable dependencies are added to the cache).
Build.hs has configure options for how to build projects. What are the configure options and what, if anything, needs to added for immutable packages? My guess is nothing. If we determine it’s immutable we don’t need to build it, which happens in ConstructPlan, and those packages are added to the cache.
It seems like the majority of the changes should be made in ConstructPlan.hs, specifically addFinal, addDep, installPackage, installPackageGivenDeps, and addPackageDeps (I think). Based on my digging around it seems like 1 change would happen in Build.hs and a couple of changes in the Packages.hs file, but maybe not for this particular issue. I want to make sure I stay in the scope of this problem.
What is the package index in "look up in the package index and see if there's a recommendation available" and what does that mean in terms of adding code to accomplish seeing if there’s a recommendation? The first part of the question might lead me to answer the second part.
Great questions. One from me in return: when you say Build.hs, what's the full module name you're referring to?
Alright, some background. A _package database_ is a concept from GHC itself. It contains information on installed libraries. Every project in Stack has three package databases it cares about:
When we get to ConstructPlan, Stack has already determined two pieces of information:
stack.yaml configuration. This will include snapshot packages and local packages.In ConstructPlan, Stack determines that a package needs to be built if both of the following are true:
stack build foo, and foo depends on bar, then bar is depended on. If bar depends on baz, then baz is depended on too.ConstructPlan figures out all of this today, and determines which _tasks_ exist. The task includes information like the configure options, which are the flags passed in to the ./Setup.hs configure command, and indicate things like extra GHC options, flags, whether profiling is enabled, etc.
Today, with the precompiled cache: in the Stack.Build.Execute, when processing a Task, Stack will first check if a package with the same set of configure options (and everything else uniquely identifying an installed package) has already been built in another snapshot. If so: it will copy the registration from the other snapshot into the current snapshot, bypassing the need to recompile.
The goal here would be that, when reaching this phase, Stack will instead:
~/.stack/..., otherwise users may accidentally delete files we need.I think your intuition is correct, and most changes will occur in ConstructPlan. I _do_ recommend staying in scope to avoid driving yourself crazy, but if you see related things you'd like to take a crack at changing, I definitely will welcome other PRs: the ConstructPlan module is well known as a mess :)
I hope this helps, let me know if I can further clarify.
I just had a call with @DebugSteven to go over things. One takeaway we had is that, instead of focusing on the code in ConstructPlan, going straight to the relevant code in Stack.Build.Execute may make more sense. That code already tracks whether it should use a cache to install, and currently using snapshot-based logic. Instead, moving over to paying attention to whether the package is immutable or not may be the majority of the work for this issue.
One other thing that popped up was ConfigureOpts. Let's rewind a bit: the way the cache works today, and needs to continue working in the future, is to only use a cached build if all of its settings are the same. This means that it's _not_ sufficient that mtl-2.2.1 (as an example) has been built before. We need to ensure:
mtl-2.2.1This is handled today by the signature of the precompiled cache functions, which as an example look like:
readPrecompiledCache :: forall env. HasEnvConfig env
=> PackageLocationIndex FilePath -- ^ target package
-> ConfigureOpts
-> Set GhcPkgId -- ^ dependencies
-> RIO env (Maybe PrecompiledCache)
Note that the three arguments represent the three things mentioned above.
One thing that I think may need revisitng is the definition of ConfigureOpts itself. Today, it looks like this:
data ConfigureOpts = ConfigureOpts
{ coDirs :: ![String]
-- ^ Options related to various paths. We separate these out since they do
-- not have an impact on the contents of the compiled binary for checking
-- if we can use an existing precompiled cache.
, coNoDirs :: ![String]
}
We separate out configure options which refer to specific file paths, since they shouldn't invalidate the cache. For example, if a package was installed into an LTS 8.12 specific directory, that shouldn't preclude it from being used for LTS 8.13.
With the new approach being discussed here, we may have a better method. Instead of using paths which refer to the snapshot, we can use paths which install into cache-specific directories. This will avoid the need to some awkward configuration option splits, and provide an easy way to ensure that all packages that end up in the cache are fully installed inside ~/.stack.
@DebugSteven If this idea is still hazy, don't worry about it too much. We should be able to get things working by just attacking the Execute module.
@snoyberg @DebugSteven Glad you all are syncing up;
I've been looking around and decided I'd try my hand at incorporating the information about the immutable directory into PathInfo this would eventually allow us to do something along the lines of
stack --path --immutable-cache-dir => ...
stack cache-status => [some cache related information, such as the subset of depedencies currently cached]
Trying to figure out integration with the lenses by way of adding one or more of the polymorphic env Constraints:
HasRunner env
HasCabalLoader env
HasEnvConfig env
The current head: https://github.com/elrikdante/stack/commit/e1f4fd04d41d04dc2d79fc9bd2e93566b6454879
also please ping me next time you all are doing a call - glad to help!
@snoyberg
Thanks for the descriptions; thorough and helpful for jogging into the right context.
-Note that the three arguments represent the three things mentioned above.
readPrecompiledCache :: forall env. HasEnvConfig env
=> PackageLocationIndex FilePath -- ^ target package
-> ConfigureOpts
-> Set GhcPkgId -- ^ dependencies
-> RIO env (Maybe PrecompiledCache)
via
./src/Stack/Build/ConstructPlan.hs:368:packageIsImmutable :: PackageLocationIndex FilePath -> AllImmutable
We could have a situation like this (toplevel or within the body doesn't matter; I don't need to change signatures to feel useful =)
readPrecompiledCache :: forall env. HasEnvConfig env
=> AllImmutable
-> ConfigureOpts
-> Set GhcPkgId -- ^ dependencies
-> RIO env (Maybe PrecompiledCache)
Just asking a clarifying question:
Is it possible to get to a point where we're better than nix and where stack can reuse a cache entry that is out of date? I.e. when nix can't find anything in the cache, it will rebuild that part.
However, bazel will rebuild only the individual files that have changed.
A midpoint between the two is if stack would use the object files from an out-of-date build, look at timestamps, and only rebuild those files that need to be rebuilt (lghc --make-like).
A lot of the compilation in a normal devel cycle happens at the leaves (packages:-part). Optimizing this is also part of the caching problem IMHO.
If I understand the question correctly, you're talking about bypassing the need to fully recompile local packages in a project when dependencies change. Unfortunately, with the current build system as it is, we have very little control over that. We essentially just shell out to ./Setup.hs build and hope for the best. There are cases where it will be able to reuse existing object files, and other cases where it will unnecessarily rebuild them. Some of this is due to functionality in Cabal (like changes to macro files), others due to functionality in GHC (like Template Haskell recompilation detection).
I _think_ we're overall better set up for this than Nix, because we at least have a chance of reusing previously built object files, but I have to admit a large level of ignorance in how the Nix build process works, and what options there are for modifying it.
The nix process is quite transparent to me - they basically start an isolated workspace with all dependencies available, run some command (similar to ./Setup.sh build) and hope for the best.
stack is not that transparent to me. It could create an isolated workspace that includes a copy of the previous build, including setting timestamps so a traditional make-based flow would do the right thing, and hope that ./Setup.hs build works like make and would not rebuild everything.
Does stack do this, or should it?
This is probably getting off topic for this issue, so I'd rather not go into too much detail here (the topic at hand is already complex enough). The basic answer is: for packages which are located on disk already, there is not isolated workspace created, everything is built in the local directory. For packages which come from Hackage, it will unpack into a temporary directory. For Git repos, it will clone into an isolated location. So in theory, object files from Git repos and local packages can be reused, but not for Hackage packages.
I just came across this and noticed pier is not mentioned. It seems to be a great start at a more nix-like caching approach. http://hackage.haskell.org/package/pier, useful 15m intro.
Wow, pier looks really cool at least based on the readme. I'm really glad someone is exploring that design space.
Closing in favor of #3922.
Most helpful comment
@snoyberg @DebugSteven Glad you all are syncing up;
I've been looking around and decided I'd try my hand at incorporating the information about the immutable directory into
PathInfothis would eventually allow us to do something along the lines ofstack --path --immutable-cache-dir => ...stack cache-status => [some cache related information, such as the subset of depedencies currently cached]Trying to figure out integration with the lenses by way of adding one or more of the polymorphic
envConstraints:HasRunner envHasCabalLoader envHasEnvConfig envThe current head: https://github.com/elrikdante/stack/commit/e1f4fd04d41d04dc2d79fc9bd2e93566b6454879
also please ping me next time you all are doing a call - glad to help!