I propose ‘semi-semantic’ hashing as a solution to caching of non-frozen imports. The semi-semantic approach to caching is motivated by the fact that parsing is computationally trivial compared to typechecking and normalisation. See #888 for the detailed reasoning.
I use semi-semantic to mean that any two semantically different (i.e. differing in their normal forms) inputs are mapped to distinct hashes, while semantically equivalent inputs that differ syntactically may result in different hashes. We could compute the semi-semantic hash of a dhall import/file/expression as follows:
The current implementation #1154 substitutes any imports by their normal form and hashes the resulting expression to obtain a semi-semantic hash. This is easier to implement and, since we store normal forms anyway, only incurs a constant-factor cost.
A cache indexed by semi-semantic hashes has the following properties:
Comparison to alternative approaches:
The main disadvantage of a ‘semi-semantic’ approach is that it
cpk: Cloned from https://github.com/vmchale/cpkg. I changed any references to the dhall prelude to point to a local copy. I tested dhall hash <<< pkgs/pkg-set.dhall
dhall-nethack: Cloned from https://github.com/dhall-lang/dhall-nethack. I set the environment variable ‘DHALL-PRELUDE’ to point at a local copy of the prelude.
This is assuming public opinion turns out in favour of this proposal.
Change Dhall.Import from the current ‘imperative’ approach to a recursive one. This is needed because in order to be able to query the semi-semantic cache we already need to know the values of all transitive imports; currently, imports are first loaded in their unevaluated parsed form, and their imports loaded in a next step.
This will be quite a big change (cf. my prototype over at #1095). At the same time this should not have any performance impact.
Implement semi-semantic caching. This is straightforward assuming Step 1.
Discuss ‘unsafe’ caching of remote imports as a possible next step.
incures a slight performance penalty when evaluating an expression with a clean cache since every import needs to be separately encoded and written to disk (currently we only ever encode and write ‘frozen’ imports to disk)
I think this is a fine tradeoff. My preference is that we optimize for the speed of interpretation when the cache is full rather than empty.
I'm also really surprised that this leads to a 4x improvement for the cpkg example. In particular, if I take the cpkg repository and freeze everything using semantic integrity checks I still don't get that large of an improvement. (Not that I'm going to complain about a 4x improvement :slightly_smiling_face: )
The secret here is that I'm benchmarking the hash subcommand! Running dhall <<< "~/tmp/cpkg/pkgs/pkg-set.dhall" > /dev/zero also pretty-prints the result, which dominates the runtime; here I get 13.4s vs 23s.
As an interesting side-note: when I load the file using --file directly, I get a 1.5s slowdown with semi-semantic caching. I wouldn't expect a speedup, because here the big expression is still typechecked and normalised (and not loaded from cache), but a slowdown is slightly unexpected...
I'm still skeptical about caching normal forms ("still" referring to my previous Discourse post on this topic which was accidentally nuked). Even if we can skip rechecking cached normal forms, I expect that the performance gains will be negative in larger projects.
One issue is sharing loss and code bloat. As an analogue, consider normalizing every definition in a Haskell program. If we have something like \x -> let y = f x in replicate 100 y, then the normal form of this expression is a function which recomputes f x 100 times, while the original code computes it once. When GHC performs inlining and specialization, it transforms towards normal forms, but often it also transforms away from normal forms, by CSE or let-floating, in order to increase sharing. Source code written by humans is as a rule of thumb acceptably good, which is why -O0 and bytecode interpretation for Haskell works reasonably well, without any code transformation. In comparison, normal forms of non-trivial programs are almost always pathological.
I also find it plausible that for larger projects, just parsing normal cached expressions can be more expensive than checking source expressions.
Should we wait until we can benchmark with the new NbE before going forward with this? (Because it involves big code changes and we might realise it's actually a net performance loss with the new checking)
To be clear, caching non-normal expressions would always be a performance improvement! The issue, in short, is that normalization degrades code quality compared to source expressions, with respect to evaluation.
@EggBaconAndSpam: Ah! I forgot that your branch was caching the ./pkg/pkg-set.dhall import. I think that's the root of the large performance increase, because when I freeze the ./pkg/pkg-set.dhall import then I also get a speedup, although not as large (13.9 s → 6.2 s)
Also, this is a little out of scope, but I wanted to note that there is a lot of low-hanging fruit for improving decoding speed (regardless of what we cache). For example, the current decoder actually decodes into a Term and then converts that to an Expr, but I imagine it would be much faster if we decoded straight from ByteString to Expr.
For example, the current decoder is decoding at a rate of ~1 MB/s:
$ dhall hash <<< './dhall/cpkg-prelude.dhall'
sha256:f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036
$ du -hs ~/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036
472K /home/gabriel/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036
$ time dhall decode < ~/.cache/dhall/f6c3202a7c91b84004811f37f5f39f19d761db4f99a8d176379c9e9a8634c036 > /dev/null
real 0m0.424s
user 0m0.414s
sys 0m0.010s
... which seems to me much slower than it should be. I'm pretty sure that it would be faster if we skipped the Term intermediate and the decoding speed should really be in the range of at least 10-100 MB/s (which is what we would get from using JSON).
I'm still skeptical about caching normal forms ("still" referring to my previous Discourse post on this topic which was accidentally nuked). Even if we can skip rechecking cached normal forms, I expect that the performance gains will be negative in larger projects.
Good point!
To be clear, caching non-normal expressions would always be a performance improvement! The issue, in short, is that normalization degrades code quality compared to source expressions, with respect to evaluation.
I think this is what we should be going for in the end, i.e. to use caching to avoid typechecking known-good code again and again.
For the time being however, since normalising imports is already the current behaviour, implementing my proposal "naively" should not lead to any further performance degradation. Switching to a less naive approach to import resolution would then be a separate problem for later :)
Should we wait until we can benchmark with the new NbE before going forward with this? (Because it involves big code changes and we might realise it's actually a net performance loss with the new checking)
I think restructuring Dhall.Import will be necessary in the long run anyway.
I will work on a few PRs to implement the necessary changes in a hopefully less invasive way than my prototype code did :)
I think @AndrasKovacs makes some really good points, @Gabriel439 what are your thoughts on them?
@Profpatsch @AndrasKovacs: I think the main issue is that we currently use semantic integrity checks for two separate reasons which are starting to get at odds with each other:
For the former use case (securing imports), I still believe we want to hash the αβ-normalized expression so that it's insensitive to refactoring. People will get annoyed if they have to re-freeze their imports for non-semantic changes to upstream expressions because every time they re-freeze an import they need to re-audit it.
However, for the latter use case (caching dependencies) it seems to me like this implicit semi-semantic caching is superior because it doesn't require freezing imports at all to take advantage of it and it also doesn't require manually invaliding reverse dependencies of an import that changes.
The main question is whether or not to upstream semi-semantic caching into the standard or not. @EggBaconAndSpam recently made the case that we don't really need to (since it's a transparent optimization that doesn't change observable behavior or syntax), but we could still document the trick so that other implementations could follow suit if necessary. I expect we will do something similar for the normalization-by-evaluation and elaboration speed improvements that @AndrasKovacs is working on: just document the performance enhancement even if we don't need to change the standard.
@Gabriel439 my point is that caching normal forms makes things slower, and it is conducive neither to secure imports, nor performance. Caching processed files is a commonly (universally?) implemented feature, but storing normal forms is something that every dependently typed compiler avoids like the plague.
@AndrasKovacs: I mostly agree. The part where I disagree is that I believe normalizing before hashing greatly improves the user experience (both for expression publishers and subscribers).
For example, there have been numerous cases of recent changes to the standard that would have broken integrity checks for end users had expressions not been normalized before hashing. Also, even without changes to the standard it's important for us to preserve the ability to refactor upstream expressions without requiring them to be re-frozen and re-audited.
Think of it from the perspective of a Dhall library author: if semantic integrity checks depend on irrelevant details like whether or not a sub-expression is factored into a let binding or an imported file then all of these irrelevant details become part of the author's API contract and they can never refactor code once they have published it without complaints from their end users. Very few people would want to publish or maintain Dhall code if they have to commit to support that level of backwards compatibility. The Haskell analog of that would your library's internal modules becoming part of your public API.
The part where I do agree is that we don't really need to αβ-normalize things for caching purposes. I think semi-semantic caching that @EggBaconAndSpam is proposing here (which is the same as what you propose) is basically superior in every way to the old caching mechanism and his draft pull request prefers the semi-semantic cache over the semantic cache.
Think of it from the perspective of a Dhall library author: if semantic integrity checks depend on irrelevant details like whether or not a sub-expression is factored into a
letbinding or an imported file then all of these irrelevant details become part of the author's API contract and they can never refactor code once they have published it without complaints from their end users.
We actually just stumbled in a nice demonstration of this: we had a big refactoring in https://github.com/purescript/package-sets/pull/399 and the hash didn't change. I used this to verify that we weren't actually changing anything (consumers were not affected much, as usually everyone points at tags, but I've seen someone point at master)
@Gabriel439 it was not clear to me that @EggBaconAndSpam's implementation indeed does not cache normal forms. If it does not, as you say, then I don't really have complaints.
@AndrasKovacs: Yeah, you can think of it as a merkle tree of non-normalized Dhall ASTs where the edges of the tree are the parent-child import relations.
@AndrasKovacs To be fair the proposal did explicitely mention 'semantic hashes', and also the term 'semi-semantic' is very much inspired by the fact that imports were to be represented by their semantic hashes. What exactly we end up caching is independent of the proposed scheme.
If instead of using the semantic hashes of imports we were to use 'semi-semantic' ones we would end up with similar properties, while avoiding the whole normalisation business. Though I guess in that case calling the hashes 'semi-semantic' would be a bit misleading, as they would be very much more syntactic in nature. What about calling them 'recursively-syntactic' or 'Merkle-style' hashes intead? :)
Though I guess in that case calling the hashes 'semi-semantic' would be a bit misleading, as they would be very much more syntactic in nature. What about calling them 'recursively-syntactic' or 'Merkle-style' hashes intead? :)
Non-normalized?
The PR implementing the proposal (#1154) has been merged! It will be interesting to see whether this makes a noticable difference on real-life dhall code.
Note that I didn't implement any kind of "unsafe" caching, like caching of remote imports. That is left for the future (I'm open to ideas!).
@EggBaconAndSpam: Great work! Getting caching for free, even for non-frozen imports, will greatly improve the user experience
Also, this fixes #888, which proposed basically exactly what you implemented :slightly_smiling_face:
Most helpful comment
The PR implementing the proposal (#1154) has been merged! It will be interesting to see whether this makes a noticable difference on real-life dhall code.
Note that I didn't implement any kind of "unsafe" caching, like caching of remote imports. That is left for the future (I'm open to ideas!).