Dhall-haskell: Caching decreases readability

Created on 7 Aug 2019  ·  18Comments  ·  Source: dhall-lang/dhall-haskell

≻ rm -rf ~/.cache/dhall*
≻ echo '( https://raw.githubusercontent.com/dhall-lang/dhall-lang/b08bfa062e2373d1cafaadfb75661e30f4caa91e/Prelude/package.dhall ).List.map' | dhall-head
  λ(a : Type)
→ λ(b : Type)
→ λ(f : a → b)
→ λ(xs : List a)
→ List/fold
    a
    xs
    (List b)
    (λ(x : a) → λ(`as` : List b) → [ f x ] # `as`)
    ([] : List b)
≻ echo '( https://raw.githubusercontent.com/dhall-lang/dhall-lang/b08bfa062e2373d1cafaadfb75661e30f4caa91e/Prelude/package.dhall ).List.map' | dhall-head
  λ(_ : Type)
→ λ(_ : Type)
→ λ(_ : _@1 → _@1)
→ λ(_ : List _@2)
→ List/fold
    _@3
    _
    (List _@2)
    (λ(_ : _@3) → λ(_ : List _@3) → [ _@3 _@1 ] # _)
    ([] : List _@2)

Apparently cached imports are alpha-normalized. That makes the code very hard to read. I would always prefer to see the non-alpha-normalized code.

caching

All 18 comments

Hmmm, I thought I fixed this in https://github.com/dhall-lang/dhall-haskell/pull/1162. Maybe I missed a spot

This is because Prelude/package.dhall has its imports annotated with semantic hashes, which means they are always alpha-normalized. What we could do is to only alpha-normalize when computing the semantic hashes, but then save the non-alpha-normalized form in the semantic cache.

Oh, I missed that this was protected by a semantic integrity check.

I actually think the right behavior here is to return the α-normalized import and save the α-normalized import in the cache.

There was a recent discussion about this in an unrelated thread here: https://github.com/dhall-lang/dhall-lang/issues/690#issuecomment-518067869

So there's a tradeoff between semantic integrity checking and readability of imports? That's a bit sad.

Could the integrity checking be changed not to require α-normalization?

@sjakobi: I think if we do that then the absence of α-normalization should be specified somewhere in the integrity check, perhaps like this:

./example/import.dhall sha256+named:…

I'm not sure that's a good user experience or design, but you get the idea.

The key principle is that an integrity check should be referentially transparent, meaning that the same integrity check always resolves to the exact same Dhall expression, regardless of the state of your local cache or prior imports that you've made.

You could imagine that we could do something similar for the semi-semantic caching that @EggBaconAndSpam introduced, where the integrity check could specify that it would not αβ-normalize the result before computing the hash or storing the import (meaning that it would be more prone to hash disruption, but perhaps more efficient or more user-friendly). Encoding those amendments in the integrity check itself helps ensure that the integrity check remains referentially transparent.

The key principle is that an integrity check should be referentially transparent, meaning that the same integrity check always resolves to the exact same Dhall expression, regardless of the state of your local cache or prior imports that you've made.

Couldn't we relax this to only requiring it being the same up to judgemental equality?

@EggBaconAndSpam: The first issue is that lookups from the semantic cache become less efficient if you do that. The standard requires re-verifying any expression retrieved from the cache, and currently that only requires hashing the retrieved bytes and verifying that they match the integrity check, but if we loosen the restriction to judgmental equality then we'd have to first type-check the expression, normalize it, and then re-encode it to verify that it matches the integrity check.

The second issue is that at some point we might like to begin sharing caches (including a global cache, similar to Unison), which a malicious user could "poison" by caching an inefficient expression that is judgmentally equivalent to a desired and commonly used expression. Keeping caches strictly content addressable protects against these sorts of attacks. This is why I think if we do add support for non-normal forms then the hash should be a hash of exactly the retrieved contents (i.e. the non-normal form).

The third issue is that you would get inconsistent results from the cache if you import two separate but judgmentally equivalent imports, depending on the order in which they were imported. For example, I could see this happening if a user is editing an import protected by an integrity check and wondering why they're still getting old names from the cached expression even after re-freezing things.

How about caching β-normalized expressions instead of αβ-normalized ones? Caches would get slightly bigger, and there would be more cache misses because

\(a : Bool) -> a

and

\(x : Bool) -> x

wouldn't have the same hash. That seems acceptable IMHO. Are there more downsides?

Would it be possible for renamed variables to remember their original name in the alpha normalization? If so, it seems like it should be possible to render the original name instead of the alpha normalized one.

Yeah, if this is only a readability issue, we should address that instead of compromising on hash stability.

The problem I see with readability here is that the “same name count backwards” representation is hard to decipher for humans. So we should add an option to dhall to convert it back to (somewhat descriptive) variable names instead, and probably make that the default.

It can also use some heuristics, naming \(_: Type) -> \(_: Type) as \(t0: Type) -> \(t1: Type) ->, _: List t0 as xs : List t0 and so forth (kind of like a hungarian notation).

The problem I see with readability here is that the “same name count backwards” representation is hard to decipher for humans. So we should add an option to dhall to convert it back to (somewhat descriptive) variable names instead, and probably make that the default.

I think that would be a nice feature to have in general. But in this specific case, how is heuristically guessing names better than just remembering their original names? Note that hash-stability isn't an issue as we can still compute the hash from the alpha-normal form.

@Gabriel439's point of course still stands:

The third issue is that you would get inconsistent results from the cache if you import two separate but judgmentally equivalent imports, depending on the order in which they were imported. For example, I could see this happening if a user is editing an import protected by an integrity check and wondering why they're still getting old names from the cached expression even after re-freezing things.

The renaming would only happen for displaying it to the user, it would still be alpha-normalized in the cache. I assume the algorithm would be fast enough to do it on the fly, but I could be mistaken.

I still think the straightforward solution is to give the user the ability to opt into the level of detail that they prefer. In other words, the integrity check allows them to specify if they want the hash to depend on the αβ-normal form, the β-normal form, or the non-normalized code. Then the cache still remains perfectly content-addressable and the user can tune what balance they want to strike between hash stability and readability.

I still think the straightforward solution is to give the user the ability to opt into the level of detail that they prefer. In other words, the integrity check allows them to specify if they want the hash to depend on the αβ-normal form, the β-normal form, or the non-normalized code.

I wonder in which cases users would prefer the αβ-normal or non-normal caching over β-normal caches.

Would you expect αβ-normal caching to have significant performance benefits over β-normal caching?

Regarding non-normal caching: Doesn't the standard currently demand that imports are normalized anyway? Would we get rid of that requirement? Otherwise, what's the point of caching non-normalized expressions?

I also wonder what libraries would do. A library user who prefers β-normalized imports would be stuck with αβ-normalized expressions, if the library uses αβ-normal caching internally. Should libraries offer multiple versions for each cache mode?

I'm not a fan of letting the user control this. Also I don't think people would be very surprised to find stale names when they edit an integrity-protected expression, as long as they are aware there's a cache somewhere.

I feel like we stumble on the problem of readability of normal forms very regularly. I was thinking: could we formalize a sort of quasi-normal form, that's not too far away from normal form (so that poisoning with an equivalent quasi-normal form can't change performance too much), but that keeps some key readability properties ? The two main properties I have in mind would be keeping names and keeping type aliases. It feels doable to me, and that could be a huge improvement in human-friendliness

See my comment here about possibly not α-normalizing imports at all (either for caching or hashing them): https://github.com/dhall-lang/dhall-lang/issues/733#issuecomment-529230102

However, to answer your question about efficiency: α-normal imports are slightly more efficient because we have more compact encodings for them.

This discussion probably belongs in either the dhall-lang or the Discourse mailing list, though.

I still think the straightforward solution is to give the user the ability to opt into the level of detail that they prefer. In other words, the integrity check allows them to specify if they want the hash to depend on the αβ-normal form, the β-normal form, or the non-normalized code. Then the cache still remains perfectly content-addressable and the user can tune what balance they want to strike between hash stability and readability.

I think specifying this in the hash is the wrong place, because the hash is placed there by the author but here's we're talking about the preferences of the reader. In many cases, of course, there is no human reader and so full alpha-beta is probably easiest. For cases like the dhall command line tool where the output is partly meant for human consumption, having control given to the user at that time about if they want names preserved could be an interesting feature for that specific case, but should work no matter how the source is authored.

@singpolyma: One thing to consider is that there are actually two use cases for hashing:

  • Protecting against malicious imports (where the hash is typically added by the user)
  • Internal caching (where the hash is typically added by the author; e.g. the Prelude)

For internal caching we already do a weird dance where import something like this:

./example sha256:… ? ./example

... but what if we just added a alternative type of integrity check optimized for the caching use case? In other words, something like:

./example cache:…

Then we could have it behave differently (like disable α-normalization for the case use case, but preserve α-normalization for the protection use case).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thebritican picture thebritican  ·  17Comments

ocharles picture ocharles  ·  20Comments

sjakobi picture sjakobi  ·  18Comments

FintanH picture FintanH  ·  19Comments

AndrasKovacs picture AndrasKovacs  ·  29Comments