Graphql-yoga: Integrate graphql-crunch

Created on 3 May 2018  ·  25Comments  ·  Source: dotansimha/graphql-yoga

Just came across this tweet and could not resist from sharing:

screen shot 2018-05-03 at 07 29 45

I haven't tested graphql-crunch myself, but it appears to be a great tool doing a similar job to graphql-deduplicator, which is a part of graphql-yoga already.

Adding graphql-crunch to Apollo Server and Apollo Client is very similar to how this is done for graphql-deduplicator, according to the README. It can also become graphql-yoga's built-in feature that gets triggered by a flag and thus save gigabytes of bandwidth worldwide 🌍

WDYT of graphql-crunch? I'll be happy to submit a PR that adds it if there is interest 🙂

Most helpful comment

Hi @kachkaev — thanks for your interest. Allow me to shed a little more light on graphql-crunch.

We actually considered using graphql-deduplicator before building graphql-crunch. However, the way that we use our API immediately uncovered a major problem with graphql-deduplicator. (I mean no disrespect in this explanation, since the existence of graphql-deduplicator inspired us to pursue this problem.)

graphql-deduplicator depends upon the identity-mapping/cache normalization behavior built into the Apollo InMemoryCache. (Relay has similar behavior, though I'm less familiar with it.) It assumes that objects with the same (__typename, id) pair are identical, and that only one needs to be included in the response. Thus, it prunes the extra copies, leaving only the (__typename, id) fields in the duplicates, which the cache then automatically links back to the fully-hydrated, canonical version.

We had two problems with this behavior:

  1. It takes a dependency upon configurable behavior within Apollo's InMemoryCache, which is fundamentally fragile, and doesn't generalize to other spec-compliant cache implementations, like apollo-cache-hermes (which we use in our product).

  2. It doesn't work properly for all cases, even when using InMemoryCache. graphql-deduplicator will only preserve the first object it encounters in the response payload for a given (__typename, id) pair, which means that if the _same_ (__typename, id) pair occurs multiple times in your payload but has a _different_ set of fields, you will _lose_ the fields that were not present in the first copy encountered. (If my explanation isn't clear, let me know and I'll concoct a repro case.)

To fix graphql-deduplicator would've required a complete rewrite, so we built graphql-crunch.

graphql-crunch is a completely lossless, very simple JSON compression algorithm. In fact, there's nothing about it that's specific to GraphQL — you could just as easily use it to compress and inflate the responses in a REST API response. This means that whatever you roundtrip through graphql-crunch's crunch and uncrunch methods will be deeply-equal.

We figured that maintaining this invariant would be the safest way to provide a reliable tool, which still accomplishes the same level of compression.

In the tweet thread from above, some people raised concerns about how we benchmarked response sizes (uncompressed vs gzip), and whether we benchmarked the uncrunch time. The short answer is: yes, we benchmarked with gzip enabled, and we also included the uncrunch time in our timing benchmarks. graphql-crunch still came out ahead, especially for our app, which has a lot of nested, duplicate 'user account' objects throughout most of its payloads.

The graphql-crunch README shows that even with gzip compression enabled, you can still save a huge amount of bandwidth on certain responses. Take a look at the table at the end of that section.

The uncrunch algorithm is a linear time algorithm that uses the same reference for duplicated objects. It saves quite a bit of time and garbage by _not_ copying the duplicated objects — a behavior which is totally fine for all GraphQL client implementations, since never mutate the response (and, in fact, typically go on to Object.freeze it in development mode).

We focused our benchmarking on React Native clients on iOS and Android. React Native runs on JavaScriptCore, which can be surprisingly slow for JSON.parse. It's much faster to parse a payload that is 75% smaller and then run the uncrunch algorithm than it is to just JSON.parse the full, original payload. (And this is to say nothing of the network savings.) I don't have the specific numbers at hand for the parsing benchmarks, but we'll try to get a blog post together that includes all of this information.

(In fact, we even microbenchmarked the performance of pre-allocating a fixed size array for this line of code. I tweeted out the benchmark here.)

@stevekrenzel and I think what you've built with graphql-yoga is great. The GraphQL tooling ecosystem can feel fairly intimidating to a newcomer, and we're glad to see 'batteries-included' bundles like this. I hope this info helps increase your confidence about including our library in your toolkit.

All 25 comments

Hi @kachkaev — thanks for your interest. Allow me to shed a little more light on graphql-crunch.

We actually considered using graphql-deduplicator before building graphql-crunch. However, the way that we use our API immediately uncovered a major problem with graphql-deduplicator. (I mean no disrespect in this explanation, since the existence of graphql-deduplicator inspired us to pursue this problem.)

graphql-deduplicator depends upon the identity-mapping/cache normalization behavior built into the Apollo InMemoryCache. (Relay has similar behavior, though I'm less familiar with it.) It assumes that objects with the same (__typename, id) pair are identical, and that only one needs to be included in the response. Thus, it prunes the extra copies, leaving only the (__typename, id) fields in the duplicates, which the cache then automatically links back to the fully-hydrated, canonical version.

We had two problems with this behavior:

  1. It takes a dependency upon configurable behavior within Apollo's InMemoryCache, which is fundamentally fragile, and doesn't generalize to other spec-compliant cache implementations, like apollo-cache-hermes (which we use in our product).

  2. It doesn't work properly for all cases, even when using InMemoryCache. graphql-deduplicator will only preserve the first object it encounters in the response payload for a given (__typename, id) pair, which means that if the _same_ (__typename, id) pair occurs multiple times in your payload but has a _different_ set of fields, you will _lose_ the fields that were not present in the first copy encountered. (If my explanation isn't clear, let me know and I'll concoct a repro case.)

To fix graphql-deduplicator would've required a complete rewrite, so we built graphql-crunch.

graphql-crunch is a completely lossless, very simple JSON compression algorithm. In fact, there's nothing about it that's specific to GraphQL — you could just as easily use it to compress and inflate the responses in a REST API response. This means that whatever you roundtrip through graphql-crunch's crunch and uncrunch methods will be deeply-equal.

We figured that maintaining this invariant would be the safest way to provide a reliable tool, which still accomplishes the same level of compression.

In the tweet thread from above, some people raised concerns about how we benchmarked response sizes (uncompressed vs gzip), and whether we benchmarked the uncrunch time. The short answer is: yes, we benchmarked with gzip enabled, and we also included the uncrunch time in our timing benchmarks. graphql-crunch still came out ahead, especially for our app, which has a lot of nested, duplicate 'user account' objects throughout most of its payloads.

The graphql-crunch README shows that even with gzip compression enabled, you can still save a huge amount of bandwidth on certain responses. Take a look at the table at the end of that section.

The uncrunch algorithm is a linear time algorithm that uses the same reference for duplicated objects. It saves quite a bit of time and garbage by _not_ copying the duplicated objects — a behavior which is totally fine for all GraphQL client implementations, since never mutate the response (and, in fact, typically go on to Object.freeze it in development mode).

We focused our benchmarking on React Native clients on iOS and Android. React Native runs on JavaScriptCore, which can be surprisingly slow for JSON.parse. It's much faster to parse a payload that is 75% smaller and then run the uncrunch algorithm than it is to just JSON.parse the full, original payload. (And this is to say nothing of the network savings.) I don't have the specific numbers at hand for the parsing benchmarks, but we'll try to get a blog post together that includes all of this information.

(In fact, we even microbenchmarked the performance of pre-allocating a fixed size array for this line of code. I tweeted out the benchmark here.)

@stevekrenzel and I think what you've built with graphql-yoga is great. The GraphQL tooling ecosystem can feel fairly intimidating to a newcomer, and we're glad to see 'batteries-included' bundles like this. I hope this info helps increase your confidence about including our library in your toolkit.

Thanks for your comment @jamesreggio. I’m totally with you about the performance gains and other benefits of your library and would love it to be a part of graphql-yoga. The only concern I have about graphql-crunch is that its name is not crunch :–D Just joking, I totally understand that this is justified by the problem you had originally ;–)

I read your source code and was shocked with how small, simple and smart the library is!

Interesting, @kachkaev the implementation of this would be exactly equal to the one I did for graphql-deduplicator (but using a different header), types for graphql-crunch will also be required. @jamesreggio would you like to have types in your package or in definitelyTyped ?

It’d be great if TypeScript types we stored directly in graphql-crunch to avoid a need to maintain @types/graphql-crunch, which is more likely to go out of sync. Doing this is just about adding one really small index.d.ts file to the package – that’s no big harm or effort.

@lfades WDYT of supporting ?deduplicate and ?crunch GET parameters in addition to HTTP headers? That looks like a more accessible way for simple testing scenarios. That’s what READMEs for both libraries recommend if I remember correctly (on mobile phone now).

We can definitely add type definitions. You can track that here: https://github.com/banterfm/graphql-crunch/issues/2

And yes, I definitely recommend the headers and query-param approach.

I used a header cause I found it more reasonable after reading this issue: https://github.com/gajus/graphql-deduplicator/issues/5, I don't have any real preference there.

@jamesreggio I would like to add the types if you don't mind

I believe that having both GET and headers will be a useful thing. This will work for various scenarios and won’t hurt anyone :–)

This sounds very interesting! Could you summarize the pros & cons for replacing graphql-deduplicator with graphql-crunch for Yoga?

@schickling I don't think there is a need to choose between the two, at least for backwards-compatibility reasons. Both compression libraries are pretty small and can be painlessly included in parallel like so:

  1. If a request has ?crunch or X-GraphQL-Crunch, use graphql-crunch.
  2. If a request has ?deduplicate or X-GraphQL-Deduplicate, use graphql-deduplicate.
  3. If both, throw an exception.
  4. If none, return the payload as is.

I'd wait a week or two for a reply in https://github.com/banterfm/graphql-crunch/issues/1 and can submit a PR to graphql-yoga afterwards. It'd be great to see if the guys are willing to compress keys before enabling graphql-crunch to avoid breaking changes or a need for something like ?crunch-with-keys in Yoga.

Maybe bundling the two capabilities is a bad idea from the get go, think about express.js and how elegant and thin it is.

These customizations should be part of the examples, the core should only provide hooks to implement them.

@schickling — I explained the differences between graphql-crunch and graphql-deduplicator in my comment here. It's primarily a matter of correctness: graphql-crunch is a simple, lossless JSON compressor, whereas graphql-deduplicator can result in data loss due to its naive first-object-wins approach to deduplication.

@orefalo — the whole idea around graphql-yoga is to be a 'batteries included' framework that doesn't require npm installing a dozen packages to get started. If you want to avoid the very minor bloat of including both of these libraries, you can obtain total control by 'ejecting'.

@kachkaev — may I suggest dropping the X- prefix from the headers. The IETF deprecated that practice a couple years ago.

@jamesreggio I don't have a strong opinion here, I'm just suggesting X-GraphQL-Crunch because X-GraphQL-Deduplicate is already included. I think I'll go for ?crunch in my projects, but I'm happy to add support for other query strings or headers if that's convenient to someone else. Checking against a couple of options is super-cheap anyway and I also don't see how this can be abused¹.

As of now, I still guess it's a right thing to support X-GraphQL-Crunch, just for the symmetry with X-GraphQL-Duplicate that we can't remove.


¹ Of course, if we don't count that someone would want to DDOS a server a bit more efficiently by turning compression on 😅 In any case, we'll have a feature flag in options to disable graphql-crunch, similar to deduplicator: false.

Just thinking forward, what about adding capabilities around

  • query complexity
  • additional scalar types
  • validation directives
  • formatting directives
  • mocking

full stack vs lean frameworks - won't argue, its a matter of opinions. ;-/

@orefalo feel free to suggest any of those as a PR! I agree with @jamesreggio that graphql-yoga is a framework with all batteries included – the more commonly used features the better.

Have a look at create-react-app that has inspired this project: they started small about two years ago, but now they support nearly any perk that a front-end developer would ever need! Despite all that, CRA is the best choice for writing _Hello world_ in React and it's pretty fast. graphql-yoga serves a similar purpose, it's just focused on ‘GraphQL server’ rather than ‘React client’ at its core.

@jamesreggio I didn't know anything about that deprecation, maybe we should deprecate that option here and only use a query param 🤔.

@kachkaev I added some tests for graphql-deduplicator in this PR. The tests for graphql-crunch will be very much the same 💃

@lfades we can deprecate X-GraphQL-Deduplicate by simply hiding it from the docs, but we still have to support it at least until [email protected] to avoid breaking changes. If ?crunch is the only trigger for graphql-crunch middleware, I'm totally happy with this. What I'd like though is to see some ’symmetry’ between crunch and deduplicator when it comes to docs and graphql-yoga options.

Gotcha — I didn't realize that X-GraphQL-Deduplicate is already supported. With that in mind, I think symmetry is a greater virtue (because it reduces the amount of developer surprise), and would be fine with X-GraphQL-Crunch.

My colleague is working on the per-key crunching which will be a breaking change (hopefully the one and only), so I definitely recommend waiting until it's merged. Should be very soon.

@jamesreggio speaking of breaking changes, WDYT of renaming graphql-crunch into something like data-crunch (or whatever else available) to emphasise that it's not GraphQL-specific and works for any JSON? Seeing GraphQL in the package name gave me a slightly misleading initial impression about what your module does, but digging a bit deeper was a real opening!

@kachkaev — hah. If we did that, we'd spin out a different repo and have graphql-crunch depend upon it. There's a lot of value in having the name simultaneously communicate value prop (compression) _and_ a concrete use case (for GraphQL).

In fact, I'm sure there are plenty of other lossless JSON compression libraries out there that none of us have considered because we wouldn't want to do the legwork to verify that they produce consistently superior results within our domain of concern (GraphQL payloads).

Hey @jamesreggio! Could you share your plans on refactoring graphql-crunch? Are you still planning this? I mean https://github.com/banterfm/graphql-crunch/issues/1

@jamesreggio - not to discount anything you have done (because it's very cool) but is there a reason that you did not choose something like Apache Avro? In the big data world this is already a standard protocol for this: https://en.wikipedia.org/wiki/Apache_Avro

Here is a nodejs implementation of the spec: https://github.com/mtth/avsc

It does need a schema but we should be able to get that from our info object in graphql. This might be a really good fit or at least worth investigating. I have been watching this thread and its progress and just wanted to put in my 2 cents :)

Hi @abelovic — yes, we didn't use other serialization solutions because they introduce a dramatic complexity overhead that we didn't want or need. Our goal was to create a drop-in, zero-configuration, application-level compression library that is narrowly tailored to the domain.

The library that we built satisfies these requirements. It's trivial to adopt, its source takes fewer than three minutes to read, and it benchmarks extremely well (above and beyond the gains from gzipping) for GraphQL responses that have significant subtree overlap.

Its implementation even has a beneficial side-effect wherein equivalent subtrees are referentially equal at deserialization time, which certain GraphQL clients (e.g., apollo-cache-hermes) are able to utilize to avoid repeating work.

I'm sure it's _possible_ to map GraphQL schemas to interface definitions in a number of binary exchange formats, but I'm not convinced you'll see meaningful gains beyond what graphql-crunch + gzipping provides. (In fact, I'm not convinced that Avro would perform well with massive duplication. I believe it would still include repetitive entries, but just in a more compact form.) Furthermore, you'd have to consider the impacts of these formats on forward compatibility (i.e., if you add a field to a type, will older apps — which don't query for nor know about said field — still be able to deserialize the response)?

@kachkaev, we haven't had time to focus on deduplicating keys. Short spans of consecutive bytes (for example, JSON object keys) are exactly what gzip and other Huffman-encoded algorithms excel at compressing, so I suspect the marginal gains would be minimal. @stevekrenzel can chime in, though, because I think he implemented a strawman and saw some benchmark regressions.

Due to inactivity of this issue we have marked it stale. It will be closed if no further activity occurs.

Hey :wave:, It seems like this issue has been inactive for some time. In need for maintaining clear overview of the issues concerning the latest version of graphql-yoga we'll close it.
Feel free to reopen it at any time if you believe we should futher discuss its content. :slightly_smiling_face:

Was this page helpful?
0 / 5 - 0 ratings