Dhall-haskell: performance advice for dhall-kubernetes based package

Created on 26 Jun 2020  ·  17Comments  ·  Source: dhall-lang/dhall-haskell

Forked from https://functionalprogramming.slack.com/archives/C82N1L0MB/p1592337474256100

My team is investigating Dhall to see if it'd be a good way to ship + customize our complex Kubernetes application on our internal clusters and for our customers to use. sourcegraph/deploy-sourcegraph-dhall-archived is my PoC implementation of our Kubernetes configuration. I like what I see so far, but I'm wondering how I can improve the render times that I'm seeing.

I've posted benchmark's in sourcegraph/deploy-sourcegraph-dhall-archived's README: https://github.com/sourcegraph/deploy-sourcegraph-dhall-archived/blob/a3f32062b10e08435c6ad73085a5df8c4c30c00d/README.md#benchmarks

Going off of https://discourse.dhall-lang.org/t/figuring-out-performance-bottlenecks/251/5, I'd like to call out that my normal form is apparently ~300x the size of the Prelude.

I know that dhall-kubernetes has a lot of types, but is this size expected?


Note: I've vendored https://github.com/dhall-lang/dhall-kubernetes for sourcegraph/deploy-sourcegraph-dhall-archived following @Gabriel439 's advice here: https://functionalprogramming.slack.com/archives/C82N1L0MB/p1592340512256900?thread_ts=1592337474.256100&cid=C82N1L0MB

Most helpful comment

I wanted to give an update on the approach I I believe will ultimately address this problem.

I'm primarily focusing on reducing the size of cached expressions because I believe that improving decoding performance is a dead end. My reasoning is that even if we could improve decoding performance by 10x (and I don't think we can any time soon) that would still not be the right order of magnitude for where Dhall needs to be in order to build out to the Dhall analog of Helm.

I believe the most promising approach to reducing the size of cached expressions is to preserve let bindings for types in normal forms and inferred types instead of inlining them. In other words, an expression like:

let Age = Natural

in  λ(x : Age) → x + 0

... would normalize to:

let Age = Natural

in  λ(x : Age) → x

... and have an inferred type of:

let Age = Natural

in  ∀(x : Age) → Age

... instead of the current behavior, which is to normalize to:

λ(x : Natural) → x

... and have an inferred type of:

∀(x : Natural) → Natural

There are a few reasons I prefer this solution:

  • I believe it would lead to a dramatic space reduction, since every type would now only appear once in normal forms

    ... assuming users let-bind them, which they are highly likely to do.

    This would lead to the biggest space savings for code using union literals but even other simpler code would see significant compression (since the types would be no larger than those the user authored by hand, and human-generated code is small).

  • It would improve the readability of normal forms and inferred types

    In fact, something similar has been requested at least once before in https://github.com/dhall-lang/dhall-to-cabal/issues/22 (cc: @ocharles)

I'm still thinking through the implications and details of this change and this would require proposing a change to the language standard which might not necessarily be accepted. However, I believe the improved readability of normal forms will persuade other implementations to agree to the change.

If this change were completed I'm fairly confident it would fix the performance issues that people are reporting when creating both Kubernetes-related utilities and lists of Kubernetes resources. However, I would verify this performance improvement first before proposing the change.

All 17 comments

@ggilmore: I'm still investigating, but I'll report what I know so far and various avenues that I'm exploring.

First, make sure that you are using a newer version of dhall / dhall-to-yaml since they contain caching performance improvements.

For reference, here is the timing I get when I dhall freeze --all --inplace ./pipeline.dhall on my laptop:

$ cat pipeline.dhall 
let Sourcegraph =
      ./package.dhall sha256:e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7

let ToList = Sourcegraph.ToList

let c = Sourcegraph.Configuration::{=}

in  ToList c
$ time dhall-to-yaml --file ./pipeline.dhall
…
real    0m10.717s
user    0m10.043s
sys 0m0.659s

For future reference, the only expression that you need to freeze for benchmarking purposes is the top-most one since no transitive expressions are resolved in the event of a cache hit.

I suspect (but have not confirmed) that most of the time spent is on decoding the cached expression, which is 50 MB:

$ du -hs ~/.cache/dhall/1220e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7 
 50M    /Users/gabriel/.cache/dhall/1220e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7

... which means a decoding speed of ~4 MB/s.

There are two main avenues that I'm exploring:

  • More compact cached expressions (so that there's less to decode)
  • Faster decoding speed (since ~4 MB/s seems slow to me)

To answer your two initial questions:

  • You can inspect the normal form by replacing dhall-to-yaml with the dhall command, but there aren't tools for identifying common patterns, yet

  • From a quick glance at the normal form the main reason the encoded form is large appears to be due to using typesUnion

    One of the ways I'm exploring to reduce the size of the cached expression is to see if we can encode things more compactly by specifying the union type only once

Hey @Gabriel439 thanks for looking into this! A few responses:

  • I am using dhall 1.33.1 for the benchmark https://github.com/sourcegraph/deploy-sourcegraph-dhall-archived/blob/285b1515ade18682d1fac33f8e70bb32100dd67c/.tool-versions#L1.
  • 🤦 I realize now that I was freezing everything besides pipeline.dhall for the benchmark. Here are my benchmark numbers after freezing that file:

    ~/dev/go/src/github.com/sourcegraph/ds-dhall master
    ❯ ./scripts/post-order-freeze.sh
    ...
    ~/dev/go/src/github.com/sourcegraph/ds-dhall master* 
    ❯ dhall freeze --all --inplace ./pipeline.dhall
    
    ~/dev/go/src/github.com/sourcegraph/ds-dhall master* 28s
    ❯ bench  'dhall-to-yaml --file pipeline.dhall'
    benchmarking dhall-to-yaml --file pipeline.dhall
    time                 8.996 s    (8.203 s .. NaN s)
                         0.995 R²   (0.991 R² .. 1.000 R²)
    mean                 8.468 s    (8.285 s .. 8.781 s)
    std dev              299.8 ms   (45.39 ms .. 371.0 ms)
    variance introduced by outliers: 19% (moderately inflated)
    
    • I still find it surprising that there is a ~ 14 second speedup after I freeze the pipeline.dhall file alone. Can you explain why this is the case?
  • > One of the ways I'm exploring to reduce the size of the cached expression is to see if we can encode things more compactly by specifying the union type only once

    • Ah, I thought it'd make more sense for each component to construct its own list of resources so that logic is encapsulated. Using this massive union as the list type slows things down though?

@ggilmore I am so happy not to be an isolated soul on this planet ;-)

I have been so confused around the topic of performance/freeze that I have opened this discourse discussion.

IIRC at the time I did realized it was pointless to freeze everything. It has been quite a surprise mainly because all the generated dhall-packages repositories that I know about (dhall-kubernetes, dhall-openshift, dhall-argocd ) seems to be sending the wrong message on that regard.

At the end of the day when you need to integrate these 3 different generated packages, you do realized the current state is not a straightforward journey (see https://github.com/EarnestResearch/dhall-packages/issues/62). While using a fork as a workaround this kind of code will be on your way.

To contribute in a positive way, I guess a generic api generator is the way to go but you might not have the time/opportunities to do so.

@Gabriel439 Thank you so much for looking at this.

I plan to document more about freezing idioms and when to use them in the Dhall manual but I need to finish fixing a few warts in the import UX that I ran into while documenting things. I can give a quick summary here, though:

  • The only purpose of freezing transitive dependencies is if you expect users to import them directly instead of the top-level dependency

    For example, the Prelude assumes that users might import any file, so it freezes all of them as a precaution.

  • Most users don't need --cache (yet)

    The --cache flag was created back when there were more frequent backwards-incompatible changes to our hashing algorithm. Specifically, the original intent behind the --cache flag was to preserve the caching features of integrity checks without the security features so that older interpreters wouldn't reject imports due to a hash mismatch.

    The long-term solution is a feature I plan to propose to add language support for marking an integrity check as optimized for caching purposes rather than security purposes. When I do add such a feature then the --cache flag will transparently begin to use that feature and gain some new features (like not renaming variables to _).

A semi-advanced example

The following code takes ~2.5-3 seconds to evaluate with dhall --file. This snippet omits dhall-k8s imports and the actual data behind bindings like secretVolumeMounts (but I promise it's only two or three small records).

This function gets imported and used in a few other modules, massively increasing the time for those modules to evaluate.

let Container/withSecrets
    : kubernetes.Container.Type -> kubernetes.Container.Type
    = \(container : kubernetes.Container.Type) ->
            container
        //  { volumeMounts = Some
                ( merge
                    { Some =
                        \(volumeMounts : List kubernetes.VolumeMount.Type) ->
                          volumeMounts # secretVolumeMounts
                    , None = secretVolumeMounts
                    }
                    container.volumeMounts
                )
            }

let PodSpec/withSecrets
    : kubernetes.PodSpec.Type -> kubernetes.PodSpec.Type
    = \(podSpec : kubernetes.PodSpec.Type) ->
            podSpec
        //  { volumes = Some
                ( merge
                    { Some =
                        \(volumes : List kubernetes.Volume.Type) ->
                          volumes # secretVolumes
                    , None = secretVolumes
                    }
                    podSpec.volumes
                )
            , initContainers = Some
                ( merge
                    { Some =
                        \(containers : List kubernetes.Container.Type) ->
                          containers # [ secretsContainer ]
                    , None = [ secretsContainer ]
                    }
                    podSpec.initContainers
                )
            , containers =
                Prelude.List.map
                  kubernetes.Container.Type
                  kubernetes.Container.Type
                  Container/withSecrets
                  podSpec.containers
            }

let DeploymentSpec/withSecrets
    : kubernetes.DeploymentSpec.Type -> kubernetes.DeploymentSpec.Type
    = \(spec : kubernetes.DeploymentSpec.Type) ->
        spec
        with template.spec = Some
            ( merge
                { Some = PodSpec/withSecrets
                , None =
                    PodSpec/withSecrets
                      kubernetes.PodSpec::{
                      , containers = [] : List kubernetes.Container.Type
                      }
                }
                spec.template.spec
            )

let withSecrets
    : kubernetes.Deployment.Type -> kubernetes.Deployment.Type
    = \(deployment : kubernetes.Deployment.Type) ->
            deployment
        //  { spec = Some
                ( merge
                    { Some = DeploymentSpec/withSecrets
                    , None =
                        DeploymentSpec/withSecrets
                          kubernetes.DeploymentSpec::{
                          , selector = kubernetes.LabelSelector::{=}
                          , template = kubernetes.PodTemplateSpec::{
                            , metadata = kubernetes.ObjectMeta::{
                              , name = "deployment-spec-with-secrets"
                              }
                            }
                          }
                    }
                    deployment.spec
                )
            }

in { withSecrets }

We have about 4 or 5 similarly complex functions that do deeply nested appends in this module. Including those bring the module's time to evaluate up to ~5 seconds. It does drop to 3 seconds when the dhall-k8s import is frozen, though.

Running dhall --file on another module that effectively calls each of these functions once takes around 30 seconds w/o local import freezes, 22 seconds with local import freezes.

Since dhall to-directory-tree would require an explicit mapping to Prelude.JSON types, generating k8s manifests has been the one spot we're not yet using it. In the meantime, we adopted dhall-render to skip that bit of type coercion. The bash script that would call dhall-to-yaml 11 times (for 11 different configuration environments) would take several minutes to complete, but the dhall-render solution about 20 seconds. We believe this is due to these factors:

  1. using GNU parallel resulted in too many context switches for dhall processes (probably due to lock contention when decoding large and cached k8s files?)
  2. not using parallel meant each dhall-to-yaml process had to pay the full cost of decoding k8s-related code in serial

Summary

  1. import dhall-k8s and running dhall-to-yaml multiple times is very fast
  2. let a = dhall-k8s.SomeResource::{} with x = y, import a, and then running dhall-to-yaml multiple times started getting very slow. 5 seconds -> 1 minute after introducing one function like the snippet above. Add more functions to a module that works with something common like dhall-k8s.Deployment and we saw 1 minute grow to 5.
  3. If dhall to-directory-tree could be used here, we'd probably see performance like dhall-render (20 seconds).

Here is another measure that might help:

~ → time dhall hash <<< https://raw.githubusercontent.com/PierreR/dhall-packages/master/openshift/package.dhall
sha256:5c0bd30ac3c1d0914770754fe82256720a04471f38a26bbc6b63bd7ebd9eea94

real    1m42.402s
user    1m34.308s
sys 0m1.727s

Not sure why it takes almost 2min to get this hash

I appreciate the additional examples, but they're no longer necessary. At this point I have enough information to reproduce the problem. The thing that will help move this issue forward is one of the two approaches I outlined previously:

  • Finding ways to improve decoder performance
  • Finding ways to reduce the size of the cached expressions

@Gabriel439 Thanks for your input. Just a little question. Is dhall hash also affected by the decoder performance or is it just the size of the cached expressions ? It is not quite clear (for me) why this command would decode anything.

@PierreR: Any dhall command that interprets a cached dhall expression will decode an expression from the cache. Essentially, the cache is a content-addressable store of binary-encoded Dhall expressions, so look-up from the cache entails binary decoding.

I wanted to give an update on the approach I I believe will ultimately address this problem.

I'm primarily focusing on reducing the size of cached expressions because I believe that improving decoding performance is a dead end. My reasoning is that even if we could improve decoding performance by 10x (and I don't think we can any time soon) that would still not be the right order of magnitude for where Dhall needs to be in order to build out to the Dhall analog of Helm.

I believe the most promising approach to reducing the size of cached expressions is to preserve let bindings for types in normal forms and inferred types instead of inlining them. In other words, an expression like:

let Age = Natural

in  λ(x : Age) → x + 0

... would normalize to:

let Age = Natural

in  λ(x : Age) → x

... and have an inferred type of:

let Age = Natural

in  ∀(x : Age) → Age

... instead of the current behavior, which is to normalize to:

λ(x : Natural) → x

... and have an inferred type of:

∀(x : Natural) → Natural

There are a few reasons I prefer this solution:

  • I believe it would lead to a dramatic space reduction, since every type would now only appear once in normal forms

    ... assuming users let-bind them, which they are highly likely to do.

    This would lead to the biggest space savings for code using union literals but even other simpler code would see significant compression (since the types would be no larger than those the user authored by hand, and human-generated code is small).

  • It would improve the readability of normal forms and inferred types

    In fact, something similar has been requested at least once before in https://github.com/dhall-lang/dhall-to-cabal/issues/22 (cc: @ocharles)

I'm still thinking through the implications and details of this change and this would require proposing a change to the language standard which might not necessarily be accepted. However, I believe the improved readability of normal forms will persuade other implementations to agree to the change.

If this change were completed I'm fairly confident it would fix the performance issues that people are reporting when creating both Kubernetes-related utilities and lists of Kubernetes resources. However, I would verify this performance improvement first before proposing the change.

Just another update on this. The original tack of preserving let-bound names in types proved to be more difficult to standardize than I anticipated, mainly because:

  • it would be disruptive to existing implementations
  • it adds a non-trivial amount of complexity to the standard at a time where we're trying to keep it simple

I haven't given up on the first approach entirely, but I'm now exploring a second less invasive approach.

The next approach I'm taking is to try adding a new type of integrity check (say, cache:sha256:…) which hashes the import that has been retrieved and parsed but not further interpreted. In other words, it encodes, hashes, and caches the expression exactly the way the user wrote the expression (including unresolved imports), before the expression has been transitively resolved, type-checked, and normalized. The only restriction would be that all transitive imports of the expression would need to be protected by integrity checks of their own (so that the final result of interpreting the expression is still immutable). In other words, this would be effectively storing a Merkle tree in the Dhall binary cache (just like how the Nix store works).

This would give a performance improvement for the same reason as the first approach I suggested: human-authored code tends to be small, so there won't be as much for the interpreter to decode.

I made some progress on debugging the performance problems with sourcegraph-dhall and it turns out that the issue might not be what I thought it was. I originally thought that the problem was storing β-normalized expressions in our cache, but that was only partly true. When I performed an experiment to store and cached non-normalized expressions that did lead to a speed-up (~2x faster), but there was still a larger bottleneck that the optimization revealed.

The problem appears to be that the same expression (i.e. the entirety dhall-kubernetes in this case) appears multiple times in the transitive closure and cache, even when the closure is not β-normalized. To see why, imagine that we have the following chain of imports:

-- ./largeImport.dhall

{-  Imagine that this file represents some large expression,
    like dhall-kubernetes or some other large package
-}
…
-- ./A.dhall
let largeImport = ./largeImport.dhall

in  largeImport.foo
-- ./B.dhall
let largeImport = ./largeImport.dhall

in  largeImport.bar
-- ./C.dhall
let A = ./A.dhall

let B = ./B.dhall

in  { A, B }

If we attempt to cache ./C.dhall then the largeImport expression will show up twice in the fully-resolved expression, even if we do not β-normalize anything. This duplication is the reason why we end up with abnormally large cache sizes.

If my diagnosis is correct, then the solution is to actually change how we cache things (at least for the Haskell implementation's semi-semantic cache, and possibly also for the standard semantic cache, too), so that we permit caching unresolved expressions, so long as the imports are protected by integrity checks. Moreover, we can add integrity checks on the fly to the cached representation if their transitive dependencies are "pure".

I'll use the above example to make things more concrete. Suppose that I want to cache ./C.dhall. In order to do that I'd need to first cache all of its transitive dependencies in depth-first order, so the algorithm would be:

  • Hash and cache ./largeImport.dhall

    … assume for this example that largeImport.dhall has no transitive imports of its own and that its hash is sha256:fff…fff

  • Hash and cache ./A.dhall

    Specifically, cache this expression:

    let largeImport = ./largeImport.dhall sha256:fff…fff
    
    in  largeImport.foo
    

    … and assume that its hash is sha256:aaa…aaa

  • Hash and cache ./B.dhall

    Specifically, cache this expression:

    let largeImport = ./largeImport.dhall sha256:fff…fff
    
    in  largeImport.bar
    
  • Hash and cache ./C.dhall

    Specifically, cache this expression:

    let A = ./A.dhall sha256:aaa…aaa
    
    let B = ./B.dhall sha256:bbb…bbb
    
    in  { A, B }
    

Then when we need to resolve ./C.dhall we would fetch it from the cache and interpret it like a normal Dhall expression (including import resolution). Storing the unresolved representation for each cached file would ensure that we only store large expressions at most once within the cache, which will make things much more compact, which will in turn greatly accelerate decoding speed (because there will be far less to decode).

@Gabriel439 is there another implementation of Dhall (Rust or Go ?) that scales better (to be used with dhall-kubernetes (dhall-to-yaml) ?

@PierreR: I'm not sure. A different implementation might get a constant factor improvement from a faster CBOR decoding logic, but other than that the problem I'm describing would affect all implementations because it's an issue with the language standard that needs to be fixed.

@Gabriel439 is there any trick that we might apply to the cache itself ?

We have been transitioning toward something more "monorepo" like. Unfortunately a full rebuild is now taking up to 30min.

Here are the workarounds we have been applied so far:

  • regenerate dhall-kubernetes (with dhall-openapi) to include only the k8s resources we need
  • using shake to rebuild only what is required
  • freezing our top-level package.dhall at the top of every file (instead of using a general import)

Is there something else we could try before the issue gets resolved ?

@PierreR: Not that I know of. The main issue (which is the one I'm trying to fix) is that dhall interpreters do not permit cached expressions to contain unresolved imports, so even if you were to do cache surgery to factor things in the way I described the interpreters would reject the cached expression.

@PierreR: Sorry, I missed your other suggestions when responding. There definitely is the option of trimming dhall-kubernetes down to only the resources you need. This can be done by wrapping the dhall-kubernetes import in a file that only projects out what you need, like this:

-- ./package-small.dhall
let kubernetes =
      https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/master/package.dhall sha256:d541487f153cee9890ebe4145bae8899e91cd81e2f4a5b65b06dfc325fb1ae7e

in
  kubernetes.{ DaemonSet, Deployment, … }

… and then if you freeze and import ./package-small.dhall it should make everything across the board smaller.

However, when creating ./package-small.dhall, take care NOT to include Resource, since that's where most of the size comes from. In fact, anything you can do to eliminate the use of the Resource union will greatly improve performance since that's what's bloating the cache.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

german1608 picture german1608  ·  18Comments

Michael-Kateregga picture Michael-Kateregga  ·  26Comments

sjakobi picture sjakobi  ·  18Comments

quasicomputational picture quasicomputational  ·  19Comments

EggBaconAndSpam picture EggBaconAndSpam  ·  21Comments