Stack: New approach to precompiled cache?

Created on 11 Feb 2018 · 25Comments · Source: commercialhaskell/stack

Pinging @mgsloan, @borsboom, and @sol. This is a discussion issue related to the current precompiled cache, nix-style builds, and implicit snapshots.

__Goal__ Maximize the amount of binary caching Stack is able to do.

Approach 1: current precompiled cache. The current system tracks snapshot libraries built and, when trying to build a new snapshot library, checks if an existing copy exists with identical settings (flags, options, dependencies, etc). If so, it reuses them. Downsides: this only works for snapshot packages, not local packages. This is an inherent limitation, since we can't trust the continued existence of a local installed package. This can be especially problematic when specifying an extra-dep package at the base of a large dependency hierarchy.
- Approach 1a: the proposed implicit snapshot implementation (#3330) would increase the number of packages which are considered snapshot packages by automatically creating a snapshot out of a stack.yaml file. This will promote many more packages in the configuration from local to snapshot packages. Besides implementation complexity, the biggest downside is disk space usage: each time you tweak extra-deps and rebuild, you'll get binary builds in ~/.stack which last forever.
Approach 2: nix-style builds/package databases. GHC 8.0 (IIRC) added the ability for package databases to allow more than one copy of a package/version combo in a database. cabal new-build uses this to get greater sharing of binary builds than Stack. I'm not familiar with some of the details of how this is implemented (such as how this would work with a Git-based package installation), but at least three downsides are
- large disk space usage (like approach 1a)
- the inability to use a package database via runghc like we can with Stack, since a single database is no longer a consistent view
- lack of support in older GHCs, requiring multiple code paths in Stack

Now I'm going to offer a new approach, approach 3. We have both _immutable_ and _mutable_ package sources. Local file paths are mutable. We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached. (We _could_ get into arguments about caching based on the file hashes, but I think it's overall not worth it.) That leaves us with immutable package sources, where the package contents come from Hackage, some archive, or a Git/Mercurial repository.

We no longer care if these packages are in snapshots or local databases, that detail is totally irrelevant. Every time we build one of these packages, we install its library and other data files into a directory under ~/.stack, let's say ~/.stack/immutable, with a directory structure that uses hashes to fully encapsulate all of the settings for this package (package source itself, dependencies, flags, etc). In addition, we register the package into its own package database inside that new directory.

We keep snapshot database and local database logic the same. But now, instead of rebuilding packages in local databases, or having special logic for the precompiled cache, we have a simple algorithm we follow every time we build an immutable package:

Check if it exists in ~/.stack/immutable
If not, build it
Copy the registration from the ~/.stack/immutable package database to the appropriate snapshot or local database

There's also some bikeshedding around whether we would even need a snapshot database after this, or if we can simply always use local databases. I'm not too terribly invested in that, but I have a feeling that due to the slowness of ghc-pkg's registering capabilities we'd want to keep the snapshot database concept.

The motivation for this came up when I was planning how to rewrite the Stackage build process to maximize sharing by switching over to Stack as the build tool, and realized this change would greatly help the Stackage build process.

discuss

Source

snoyberg

Most helpful comment

@snoyberg @DebugSteven Glad you all are syncing up;

I've been looking around and decided I'd try my hand at incorporating the information about the immutable directory into PathInfo this would eventually allow us to do something along the lines of

stack --path --immutable-cache-dir => ...

stack cache-status => [some cache related information, such as the subset of depedencies currently cached]

Trying to figure out integration with the lenses by way of adding one or more of the polymorphic env Constraints:

HasRunner env
HasCabalLoader env
HasEnvConfig env

The current head: https://github.com/elrikdante/stack/commit/e1f4fd04d41d04dc2d79fc9bd2e93566b6454879

also please ping me next time you all are doing a call - glad to help!

elrikdante on 26 Feb 2018

👍2

All 25 comments

Without thinking much about the details, this makes sense to me. I do think it's worth discussing:

We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached.

A common complaint is that switching between profiling enabled and disabled causes too many recompiles. With this proposal, at least hackage/git extra-deps won't have to be recompiled, but if your project has a lot of local packages there would still be a lot of recompilation. Some kind of solution to that would be great, but not sure if it should be part of this proposal or not.

borsboom on 11 Feb 2018

I think the profiling issue is just a bug in Stack right now. While switching from no profiling to profiling should cause a recompile, the other way around shouldn't. That should reduce at least some of the friction in going back and forth with profiling builds. We could in principle have the same kind of logic at this cache layer.

snoyberg on 11 Feb 2018

👍2

Seems good to me! This seems very similar to how I was thinking implicit snapshots would work. Is the main difference that initially they get installed to their own package DB? This way they aren't stored under the path of a particular snapshot. If so, then I grok this and am :+1: on it.

mgsloan on 12 Feb 2018

I guess you could look at it that way, sure. Sounds like this should move ahead, no one's raising any concerns. If I get a chance to get started, I'll comment here. If someone else wants to take a stab at it instead, I'll be very happy to advise :).

snoyberg on 12 Feb 2018

This has really nice implications for intero's binary. 👍

chrisdone on 12 Feb 2018

I think overall this makes sense.

Local file paths are mutable. We first state than anything mutable, and anything that depends on a mutable package source, cannot be cached.

Hmm, tinc caches local dependencies too. As I understand it stack differentiates between project packages (packages) and dependencies (extra-deps). If a local dependency is specified through extra-deps then why not cache it. Wouldn't that be easier, both in terms of code and documentation?

For completeness, here is how tinc calculates hashes for local dependencies:

cabal sdist (this is to exclude uninteresting files and increase cache hits)
unpack the tarball into a build sandbox
list all files and bring them into a deterministic order
for each file calculate hash(hash(name), hash(content)) (the name is included to detect renaming and file concatenation or splitting)
then finally hash the list of all hashes

(https://github.com/sol/tinc/blob/8f8007e62d9953eca7cf1e3c6d9aefb988eda47a/src/Util.hs#L49)

I have a feeling that due to the slowness of ghc-pkg's registering capabilities we'd want to keep the snapshot database concept.

Not sure about ghc-pkg register, but if I do

copy (or symliking) the package configs
run ghc-pkg recache

then I didn't have the feeling that this was particularly slow.

Isolated, per project data bases would certainly be neat (if that is what "local databases" means).

sol on 15 Feb 2018

I didn't have the feeling that this was particularly slow.

Numbers, registering 92 packages takes 195ms on my system.

sol on 15 Feb 2018

Registering all transitive dependencies for stack takes 285ms (177 packages).

sol on 15 Feb 2018

If a local dependency is specified through extra-deps then why not cache it. Wouldn't that be easier, both in terms of code and documentation?

To clarify: the goal of this proposal is to make extra-deps cacheable. What we _don't_ want to allow is a local filepath (as opposed to say a package from Hackage or a Git repo) be cacheable. It's possible, by taking a hash of the entire directory contents, but it's rife with difficulties (like making sure we find all of the files that actually impact the build) and will likely end up with large amounts of garbage in the cache.

It could be that the sdist approach you mention would work, but how would Stack know when it needs to rerunning the sdist command?

Your comments on recache are interesting, I wonder if we could use them for optimizing both registering _and_ unregistering. The latter is particularly slow right now.

snoyberg on 18 Feb 2018

I'm a Haskell beginner, but if possible, would like to contribute to implementing this solution. @snoyberg mentioned there's mentorship and I think I would be able to contribute with some help. This looks like a really neat feature!

DebugSteven on 19 Feb 2018

I'm in the same boat as @DebugSteven - Haskell beginner that thinks they can contribute w/ some help. Since @DebugSteven "got here first", I'll assume they will work on the code but happy to contribute if I can or if plans change.

StevenXL on 20 Feb 2018

As I see it, there are going to be (at least) four pieces of work here worth doing:

Writing the actual code
Updating docs
Code review
Testing

Step (4) is the easiest to parallelize: once (1) is done, multiple people (myself included) can start using the new branch and report on issues. (2) is probably close to trivial. (3) is something I'm volunteering to do regularly, and encourage others to do as well.

So that leaves (1). I'd recommend only one person at a time work on it, and if that's @DebugSteven, cool. If it looks like you're stuck, or won't have time to continue, please ping this issue so others can jump in if desired. All that said, how the hell do we approach this issue? ;)

I've pushed a massively incomplete commit to the 3860-better-cache branch (d76790547c275ade52318b7003acc90c1965247b), which shows the parts of the code base that I think need to be touched. Essentially, here's what's going on:

In the ConstructPlan module, we determine which packages are going to be used, and of those which are already installed, and which need to be built
We're now going to also track for each of these packages whether it's an "all immutable" package, meaning that it and its dependencies come from an immutable package source
I didn't do it yet, but we need to check for the already installed packages whether they exist under ~/.stack or not. If they _don't_, we should rebuild them, since we don't want a package in some other directory being depended on and then getting moved/deleted.
We need to replace the usage of precompiled cache functions in Stack.Build.Execute with functions that perform the new caching. This should _basically_ come down to:
- If it's not "all immutable", don't do anything at all
- If it's all immutable, and it hasn't been built yet, build it and register in the correct location in ~/.stack (which hasn't been set up yet, your call on exactly what the directory structure looks like)
- Either it was previously built, or we just built it. Either way: reregister it from the cache location into the snapshot/local database. You can reuse the same approach as the precompiled cache.

It's a bit hairy getting all of this right, but the prior art of the precompiled cache should be good guidance on moving forward here. Also, there's very little "smart" Haskell code here, this is basically a big ol' imperative mess :)

Does this give enough info to get started on? And feel free to ask questions at any point.

snoyberg on 20 Feb 2018

👍2

Yeah I think that's enough information to get started. I'll get started on this today and probably have some questions posted here later today.

DebugSteven on 20 Feb 2018

How does Stack determine right now which packages need to be recompiled? The way I understand it is __some__ packages are stored in the cache. Before Stack recompiles any package it checks the cache to see if it’s already been compiled. Anything that has been changed (different version of a package, new dependency, etc.) in the project is recompiled (and the immutable dependencies are added to the cache).

Build.hs has configure options for how to build projects. What are the configure options and what, if anything, needs to added for immutable packages? My guess is nothing. If we determine it’s immutable we don’t need to build it, which happens in ConstructPlan, and those packages are added to the cache.

It seems like the majority of the changes should be made in ConstructPlan.hs, specifically addFinal, addDep, installPackage, installPackageGivenDeps, and addPackageDeps (I think). Based on my digging around it seems like 1 change would happen in Build.hs and a couple of changes in the Packages.hs file, but maybe not for this particular issue. I want to make sure I stay in the scope of this problem.

What is the package index in "look up in the package index and see if there's a recommendation available" and what does that mean in terms of adding code to accomplish seeing if there’s a recommendation? The first part of the question might lead me to answer the second part.

DebugSteven on 22 Feb 2018

Great questions. One from me in return: when you say Build.hs, what's the full module name you're referring to?

Alright, some background. A _package database_ is a concept from GHC itself. It contains information on installed libraries. Every project in Stack has three package databases it cares about:

The global database ships with GHC itself. We never modify anything in that database, we just take its packages as-is
The snapshot database is shared across all projects using a snapshot, such as LTS 10.5
The local database, which includes packages defined in a project, extra-deps, and things depending on them

When we get to ConstructPlan, Stack has already determined two pieces of information:

The source map: the full set of packages available to be built, based on the stack.yaml configuration. This will include snapshot packages and local packages.
The installed map: the packages already installed in the global, snapshot, and local databases. Stack has already taken care to filter out installed packages which somehow mismatch with the source map, such as if a different set of flags is used.

In ConstructPlan, Stack determines that a package needs to be built if both of the following are true:

The targets for the current build depend on it (directly or indirectly). For example, if I run stack build foo, and foo depends on bar, then bar is depended on. If bar depends on baz, then baz is depended on too.
It's not currently in the installed map

ConstructPlan figures out all of this today, and determines which _tasks_ exist. The task includes information like the configure options, which are the flags passed in to the ./Setup.hs configure command, and indicate things like extra GHC options, flags, whether profiling is enabled, etc.

Today, with the precompiled cache: in the Stack.Build.Execute, when processing a Task, Stack will first check if a package with the same set of configure options (and everything else uniquely identifying an installed package) has already been built in another snapshot. If so: it will copy the registration from the other snapshot into the current snapshot, bypassing the need to recompile.

The goal here would be that, when reaching this phase, Stack will instead:

Check if the package is immutable. If it isn't: just do a normal build. Given that it's immutable, now follow these steps:
Check if it's already been installed by checking the cache. If not: build it and register it to the appropriate place in the cache. (I've not really specced out fully where that would be, you can go about it any way you want.) Importantly, we need to make sure that things only land in the cache if they are installed to ~/.stack/..., otherwise users may accidentally delete files we need.
Now that the package is in the cache (either because it already was, or we just added it): copy the registration from the cache into the relevant package database (either snapshot or local), and we're done.

I think your intuition is correct, and most changes will occur in ConstructPlan. I _do_ recommend staying in scope to avoid driving yourself crazy, but if you see related things you'd like to take a crack at changing, I definitely will welcome other PRs: the ConstructPlan module is well known as a mess :)

I hope this helps, let me know if I can further clarify.

snoyberg on 22 Feb 2018

I just had a call with @DebugSteven to go over things. One takeaway we had is that, instead of focusing on the code in ConstructPlan, going straight to the relevant code in Stack.Build.Execute may make more sense. That code already tracks whether it should use a cache to install, and currently using snapshot-based logic. Instead, moving over to paying attention to whether the package is immutable or not may be the majority of the work for this issue.

One other thing that popped up was ConfigureOpts. Let's rewind a bit: the way the cache works today, and needs to continue working in the future, is to only use a cached build if all of its settings are the same. This means that it's _not_ sufficient that mtl-2.2.1 (as an example) has been built before. We need to ensure:

We have the exact same source code for mtl-2.2.1
We use the same set of configuration options, like optimization level and cabal flags
We use the same set of dependencies

This is handled today by the signature of the precompiled cache functions, which as an example look like:

readPrecompiledCache :: forall env. HasEnvConfig env
                     => PackageLocationIndex FilePath -- ^ target package
                     -> ConfigureOpts
                     -> Set GhcPkgId -- ^ dependencies
                     -> RIO env (Maybe PrecompiledCache)

Note that the three arguments represent the three things mentioned above.

One thing that I think may need revisitng is the definition of ConfigureOpts itself. Today, it looks like this:

data ConfigureOpts = ConfigureOpts
    { coDirs :: ![String]
    -- ^ Options related to various paths. We separate these out since they do
    -- not have an impact on the contents of the compiled binary for checking
    -- if we can use an existing precompiled cache.
    , coNoDirs :: ![String]
    }

We separate out configure options which refer to specific file paths, since they shouldn't invalidate the cache. For example, if a package was installed into an LTS 8.12 specific directory, that shouldn't preclude it from being used for LTS 8.13.

With the new approach being discussed here, we may have a better method. Instead of using paths which refer to the snapshot, we can use paths which install into cache-specific directories. This will avoid the need to some awkward configuration option splits, and provide an easy way to ensure that all packages that end up in the cache are fully installed inside ~/.stack.

@DebugSteven If this idea is still hazy, don't worry about it too much. We should be able to get things working by just attacking the Execute module.

snoyberg on 25 Feb 2018

@snoyberg @DebugSteven Glad you all are syncing up;

I've been looking around and decided I'd try my hand at incorporating the information about the immutable directory into PathInfo this would eventually allow us to do something along the lines of

stack --path --immutable-cache-dir => ...

stack cache-status => [some cache related information, such as the subset of depedencies currently cached]

Trying to figure out integration with the lenses by way of adding one or more of the polymorphic env Constraints:

HasRunner env
HasCabalLoader env
HasEnvConfig env

The current head: https://github.com/elrikdante/stack/commit/e1f4fd04d41d04dc2d79fc9bd2e93566b6454879

also please ping me next time you all are doing a call - glad to help!

elrikdante on 26 Feb 2018

👍2

@snoyberg
Thanks for the descriptions; thorough and helpful for jogging into the right context.

-Note that the three arguments represent the three things mentioned above.

 readPrecompiledCache :: forall env. HasEnvConfig env
                     => PackageLocationIndex FilePath -- ^ target package
                     -> ConfigureOpts
                     -> Set GhcPkgId -- ^ dependencies
                     -> RIO env (Maybe PrecompiledCache)

via

./src/Stack/Build/ConstructPlan.hs:368:packageIsImmutable :: PackageLocationIndex FilePath -> AllImmutable

We could have a situation like this (toplevel or within the body doesn't matter; I don't need to change signatures to feel useful =)

 readPrecompiledCache :: forall env. HasEnvConfig env
                     => AllImmutable 
                     -> ConfigureOpts
                     -> Set GhcPkgId -- ^ dependencies
                     -> RIO env (Maybe PrecompiledCache)

elrikdante on 2 Mar 2018

Just asking a clarifying question:

Is it possible to get to a point where we're better than nix and where stack can reuse a cache entry that is out of date? I.e. when nix can't find anything in the cache, it will rebuild that part.

However, bazel will rebuild only the individual files that have changed.

A midpoint between the two is if stack would use the object files from an out-of-date build, look at timestamps, and only rebuild those files that need to be rebuilt (lghc --make-like).

A lot of the compilation in a normal devel cycle happens at the leaves (packages:-part). Optimizing this is also part of the caching problem IMHO.

alexanderkjeldaas on 7 Mar 2018

If I understand the question correctly, you're talking about bypassing the need to fully recompile local packages in a project when dependencies change. Unfortunately, with the current build system as it is, we have very little control over that. We essentially just shell out to ./Setup.hs build and hope for the best. There are cases where it will be able to reuse existing object files, and other cases where it will unnecessarily rebuild them. Some of this is due to functionality in Cabal (like changes to macro files), others due to functionality in GHC (like Template Haskell recompilation detection).

I _think_ we're overall better set up for this than Nix, because we at least have a chance of reusing previously built object files, but I have to admit a large level of ignorance in how the Nix build process works, and what options there are for modifying it.

snoyberg on 7 Mar 2018

The nix process is quite transparent to me - they basically start an isolated workspace with all dependencies available, run some command (similar to ./Setup.sh build) and hope for the best.

stack is not that transparent to me. It could create an isolated workspace that includes a copy of the previous build, including setting timestamps so a traditional make-based flow would do the right thing, and hope that ./Setup.hs build works like make and would not rebuild everything.

Does stack do this, or should it?

alexanderkjeldaas on 7 Mar 2018

This is probably getting off topic for this issue, so I'd rather not go into too much detail here (the topic at hand is already complex enough). The basic answer is: for packages which are located on disk already, there is not isolated workspace created, everything is built in the local directory. For packages which come from Hackage, it will unpack into a temporary directory. For Git repos, it will clone into an isolated location. So in theory, object files from Git repos and local packages can be reused, but not for Hackage packages.

snoyberg on 7 Mar 2018

I just came across this and noticed pier is not mentioned. It seems to be a great start at a more nix-like caching approach. http://hackage.haskell.org/package/pier, useful 15m intro.