Dhall-haskell: Proposal: Semi-semantic caching

Created on 12 Jul 2019 · 21Comments · Source: dhall-lang/dhall-haskell

Introduction

I propose ‘semi-semantic’ hashing as a solution to caching of non-frozen imports. The semi-semantic approach to caching is motivated by the fact that parsing is computationally trivial compared to typechecking and normalisation. See #888 for the detailed reasoning.

I use semi-semantic to mean that any two semantically different (i.e. differing in their normal forms) inputs are mapped to distinct hashes, while semantically equivalent inputs that differ syntactically may result in different hashes. We could compute the semi-semantic hash of a dhall import/file/expression as follows:

Parse the input;
compute the semantic hashes of all imports referenced in the AST, i.e. the hashes of their normal forms;
compute the syntactic hash of the input (for example hashing the unparsed input, or alternatively by hashing the parsed AST);
concatenate the syntactic hash of the input with the semantic hashes of its imports and hash the result.

Update

The current implementation #1154 substitutes any imports by their normal form and hashes the resulting expression to obtain a semi-semantic hash. This is easier to implement and, since we store normal forms anyway, only incurs a constant-factor cost.

Discussion

A cache indexed by semi-semantic hashes has the following properties:

For a given input we can quickly find out if it is in the cache: we only need to parse the input – we don't need to typecheck or normalise it!
The cache stays consistent, that is, we don't need to ‘invalidate’ old cache entries if their dependencies change!

Comparison to alternative approaches:

Naive syntactic hashing is problematic because two syntactically equivalent dhall expressions may have different semantics depending on the context (e.g. if they contain relative imports).
We can fix syntactic hashing by ‘resolving’ the absolute location of all imports before hashing. However, in this case we still need to worry about invalidating cached results if their imports change.
Local caching as implemented in #1073. Here we simply dump all loaded imports after each invocation of dhall. The problem here is again keeping the cache consistent; additionally, the user experience is suboptimal since for example changing to a different directory causes dhall to start from scratch with an empty cache. Also, we leave cache files everywhere.

The main disadvantage of a ‘semi-semantic’ approach is that it

incures a slight performance penalty when evaluating an expression with a clean cache since every import needs to be separately encoded and written to disk (currently we only ever encode and write ‘frozen’ imports to disk)
is slightly slower than a plain syntactic hashing approach since all transitive imports (except for frozen imports loaded from the semantic cache) have to be loaded as well. Using a syntactic cache we would only have to check their timestamps (or perhaps syntactic hashes) to see if they changed.

Preliminary benchmarks

cpk: Cloned from https://github.com/vmchale/cpkg. I changed any references to the dhall prelude to point to a local copy. I tested dhall hash <<< pkgs/pkg-set.dhall
- ‘main branch’: 14s with clean cache, same with populated cache.
- ‘cache dump’ syntactic: 14s with clean cache, ca. 3.6s with populated cache.
- ‘semi-semantic’ 15.5-16s with clean cache, ca. 4.6s with populated cache.
dhall-nethack: Cloned from https://github.com/dhall-lang/dhall-nethack. I set the environment variable ‘DHALL-PRELUDE’ to point at a local copy of the prelude.
- ‘main branch’: 0.36s with clean cache, same with populated cache.
- ‘cache dump’ syntactic: 0.36s with clean cache, 0.18s with populated cache.
- ‘semi-semantic’ 0.42s with clean cache, 0.25s with populated cache.

Merge Plan

This is assuming public opinion turns out in favour of this proposal.

Change Dhall.Import from the current ‘imperative’ approach to a recursive one. This is needed because in order to be able to query the semi-semantic cache we already need to know the values of all transitive imports; currently, imports are first loaded in their unevaluated parsed form, and their imports loaded in a next step.

This will be quite a big change (cf. my prototype over at #1095). At the same time this should not have any performance impact.
Implement semi-semantic caching. This is straightforward assuming Step 1.
Discuss ‘unsafe’ caching of remote imports as a possible next step.

Source

EggBaconAndSpam

Most helpful comment

The PR implementing the proposal (#1154) has been merged! It will be interesting to see whether this makes a noticable difference on real-life dhall code.

Note that I didn't implement any kind of "unsafe" caching, like caching of remote imports. That is left for the future (I'm open to ideas!).

EggBaconAndSpam on 26 Jul 2019

🎉2

All 21 comments

incures a slight performance penalty when evaluating an expression with a clean cache since every import needs to be separately encoded and written to disk (currently we only ever encode and write ‘frozen’ imports to disk)

I think this is a fine tradeoff. My preference is that we optimize for the speed of interpretation when the cache is full rather than empty.

I'm also really surprised that this leads to a 4x improvement for the cpkg example. In particular, if I take the cpkg repository and freeze everything using semantic integrity checks I still don't get that large of an improvement. (Not that I'm going to complain about a 4x improvement :slightly_smiling_face: )

Gabriel439 on 13 Jul 2019

The secret here is that I'm benchmarking the hash subcommand! Running dhall <<< "~/tmp/cpkg/pkgs/pkg-set.dhall" > /dev/zero also pretty-prints the result, which dominates the runtime; here I get 13.4s vs 23s.

As an interesting side-note: when I load the file using --file directly, I get a 1.5s slowdown with semi-semantic caching. I wouldn't expect a speedup, because here the big expression is still typechecked and normalised (and not loaded from cache), but a slowdown is slightly unexpected...

EggBaconAndSpam on 13 Jul 2019

I'm still skeptical about caching normal forms ("still" referring to my previous Discourse post on this topic which was accidentally nuked). Even if we can skip rechecking cached normal forms, I expect that the performance gains will be negative in larger projects.

One issue is sharing loss and code bloat. As an analogue, consider normalizing every definition in a Haskell program. If we have something like \x -> let y = f x in replicate 100 y, then the normal form of this expression is a function which recomputes f x 100 times, while the original code computes it once. When GHC performs inlining and specialization, it transforms towards normal forms, but often it also transforms away from normal forms, by CSE or let-floating, in order to increase sharing. Source code written by humans is as a rule of thumb acceptably good, which is why -O0 and bytecode interpretation for Haskell works reasonably well, without any code transformation. In comparison, normal forms of non-trivial programs are almost always pathological.

I also find it plausible that for larger projects, just parsing normal cached expressions can be more expensive than checking source expressions.

AndrasKovacs on 13 Jul 2019

Should we wait until we can benchmark with the new NbE before going forward with this? (Because it involves big code changes and we might realise it's actually a net performance loss with the new checking)

f-f on 13 Jul 2019

To be clear, caching non-normal expressions would always be a performance improvement! The issue, in short, is that normalization degrades code quality compared to source expressions, with respect to evaluation.

AndrasKovacs on 13 Jul 2019

@EggBaconAndSpam: Ah! I forgot that your branch was caching the ./pkg/pkg-set.dhall import. I think that's the root of the large performance increase, because when I freeze the ./pkg/pkg-set.dhall import then I also get a speedup, although not as large (13.9 s → 6.2 s)

Gabriel439 on 13 Jul 2019

Also, this is a little out of scope, but I wanted to note that there is a lot of low-hanging fruit for improving decoding speed (regardless of what we cache). For example, the current decoder actually decodes into a Term and then converts that to an Expr, but I imagine it would be much faster if we decoded straight from ByteString to Expr.

For example, the current decoder is decoding at a rate of ~1 MB/s:

$ dhall hash <<< './dhall/cpkg-prelude.dhall' 
sha256:f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036

$ du -hs ~/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036 
472K    /home/gabriel/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036

$ time dhall decode < ~/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036 > /dev/null

real    0m0.424s
user    0m0.414s
sys     0m0.010s

... which seems to me much slower than it should be. I'm pretty sure that it would be faster if we skipped the Term intermediate and the decoding speed should really be in the range of at least 10-100 MB/s (which is what we would get from using JSON).

Gabriel439 on 13 Jul 2019

👍1

I'm still skeptical about caching normal forms ("still" referring to my previous Discourse post on this topic which was accidentally nuked). Even if we can skip rechecking cached normal forms, I expect that the performance gains will be negative in larger projects.

Good point!

To be clear, caching non-normal expressions would always be a performance improvement! The issue, in short, is that normalization degrades code quality compared to source expressions, with respect to evaluation.

I think this is what we should be going for in the end, i.e. to use caching to avoid typechecking known-good code again and again.

For the time being however, since normalising imports is already the current behaviour, implementing my proposal "naively" should not lead to any further performance degradation. Switching to a less naive approach to import resolution would then be a separate problem for later :)

EggBaconAndSpam on 15 Jul 2019

Should we wait until we can benchmark with the new NbE before going forward with this? (Because it involves big code changes and we might realise it's actually a net performance loss with the new checking)

I think restructuring Dhall.Import will be necessary in the long run anyway.

I will work on a few PRs to implement the necessary changes in a hopefully less invasive way than my prototype code did :)

EggBaconAndSpam on 15 Jul 2019

I think @AndrasKovacs makes some really good points, @Gabriel439 what are your thoughts on them?

Profpatsch on 16 Jul 2019

@Profpatsch @AndrasKovacs: I think the main issue is that we currently use semantic integrity checks for two separate reasons which are starting to get at odds with each other:

Securing imports
Caching imports

For the former use case (securing imports), I still believe we want to hash the αβ-normalized expression so that it's insensitive to refactoring. People will get annoyed if they have to re-freeze their imports for non-semantic changes to upstream expressions because every time they re-freeze an import they need to re-audit it.

However, for the latter use case (caching dependencies) it seems to me like this implicit semi-semantic caching is superior because it doesn't require freezing imports at all to take advantage of it and it also doesn't require manually invaliding reverse dependencies of an import that changes.

The main question is whether or not to upstream semi-semantic caching into the standard or not. @EggBaconAndSpam recently made the case that we don't really need to (since it's a transparent optimization that doesn't change observable behavior or syntax), but we could still document the trick so that other implementations could follow suit if necessary. I expect we will do something similar for the normalization-by-evaluation and elaboration speed improvements that @AndrasKovacs is working on: just document the performance enhancement even if we don't need to change the standard.

Gabriel439 on 16 Jul 2019

@Gabriel439 my point is that caching normal forms makes things slower, and it is conducive neither to secure imports, nor performance. Caching processed files is a commonly (universally?) implemented feature, but storing normal forms is something that every dependently typed compiler avoids like the plague.

AndrasKovacs on 16 Jul 2019

@AndrasKovacs: I mostly agree. The part where I disagree is that I believe normalizing before hashing greatly improves the user experience (both for expression publishers and subscribers).

For example, there have been numerous cases of recent changes to the standard that would have broken integrity checks for end users had expressions not been normalized before hashing. Also, even without changes to the standard it's important for us to preserve the ability to refactor upstream expressions without requiring them to be re-frozen and re-audited.

Think of it from the perspective of a Dhall library author: if semantic integrity checks depend on irrelevant details like whether or not a sub-expression is factored into a let binding or an imported file then all of these irrelevant details become part of the author's API contract and they can never refactor code once they have published it without complaints from their end users. Very few people would want to publish or maintain Dhall code if they have to commit to support that level of backwards compatibility. The Haskell analog of that would your library's internal modules becoming part of your public API.

The part where I do agree is that we don't really need to αβ-normalize things for caching purposes. I think semi-semantic caching that @EggBaconAndSpam is proposing here (which is the same as what you propose) is basically superior in every way to the old caching mechanism and his draft pull request prefers the semi-semantic cache over the semantic cache.

Gabriel439 on 16 Jul 2019

Think of it from the perspective of a Dhall library author: if semantic integrity checks depend on irrelevant details like whether or not a sub-expression is factored into a let binding or an imported file then all of these irrelevant details become part of the author's API contract and they can never refactor code once they have published it without complaints from their end users.

We actually just stumbled in a nice demonstration of this: we had a big refactoring in https://github.com/purescript/package-sets/pull/399 and the hash didn't change. I used this to verify that we weren't actually changing anything (consumers were not affected much, as usually everyone points at tags, but I've seen someone point at master)

f-f on 16 Jul 2019

@Gabriel439 it was not clear to me that @EggBaconAndSpam's implementation indeed does not cache normal forms. If it does not, as you say, then I don't really have complaints.

AndrasKovacs on 16 Jul 2019

👍1

@AndrasKovacs: Yeah, you can think of it as a merkle tree of non-normalized Dhall ASTs where the edges of the tree are the parent-child import relations.

Gabriel439 on 16 Jul 2019

@AndrasKovacs To be fair the proposal did explicitely mention 'semantic hashes', and also the term 'semi-semantic' is very much inspired by the fact that imports were to be represented by their semantic hashes. What exactly we end up caching is independent of the proposed scheme.

If instead of using the semantic hashes of imports we were to use 'semi-semantic' ones we would end up with similar properties, while avoiding the whole normalisation business. Though I guess in that case calling the hashes 'semi-semantic' would be a bit misleading, as they would be very much more syntactic in nature. What about calling them 'recursively-syntactic' or 'Merkle-style' hashes intead? :)

EggBaconAndSpam on 17 Jul 2019

Though I guess in that case calling the hashes 'semi-semantic' would be a bit misleading, as they would be very much more syntactic in nature. What about calling them 'recursively-syntactic' or 'Merkle-style' hashes intead? :)

Non-normalized?

Profpatsch on 17 Jul 2019

The PR implementing the proposal (#1154) has been merged! It will be interesting to see whether this makes a noticable difference on real-life dhall code.

Note that I didn't implement any kind of "unsafe" caching, like caching of remote imports. That is left for the future (I'm open to ideas!).

EggBaconAndSpam on 26 Jul 2019

🎉2

@EggBaconAndSpam: Great work! Getting caching for free, even for non-frozen imports, will greatly improve the user experience

Gabriel439 on 27 Jul 2019

Also, this fixes #888, which proposed basically exactly what you implemented :slightly_smiling_face:

Gabriel439 on 27 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings