Graphql-js: Over fetching data on the server

Created on 10 Dec 2016  路  9Comments  路  Source: graphql/graphql-js

@leebyron I have an app where user can open a tab "comments", to load comments of a specific post.
My query looks like this:

viewer { 
  posts(id:1) { 
   id
   comments(first: 10) { 
    edges { 
    node { 
      text 
       name
        ...  
     }
   }
  }
 }
}

The post field resolver fetches all of the data about the post, then passes the data to PostType where comments field's resolver loads the comments.
This is an issue, because the server has to fetch all of the data about the post, just to get the comments. The user might scroll down multiple times, so this is a bottleneck for the database.

I could:

  1. Create a separate query loadComments, but this adds complexity since the number of these kind of "loadMore" queries will grow. Also later I might want to fetch some information about the post, so this brings us back to original issue.
  2. Load data from PostType field's resolvers, but this means that too many queries would be sent to the database. This also makes it hard for batching & caching with dataloader.
  3. Cache data in post field resolver with dataloader. This way when the client asks for more comments, we can ask the cache for info about the post - reducing database calls. This also allows to use cache in idFetcher() (graphql-relay-js) when using relay for things like forceFetch().

I wonder what's the best approach, and how you solved this at Facebook?

Thanks

question

Most helpful comment

@shakaIsReal

It seems the crux of your question boils down to fetching fields by an id, and then only requesting the id field on that object. A simpler version:

{
  someThing(id: 123) {
    id
  }
}

One possible way to execute this would be to resolve the someThing(id: 123) field, and in doing so go run a database command or network request or whatever to load in all of the data related to our "SomeThing". Then we follow to the id field, where we return 123. This seems like it could be more work than necessary if our id argument is guaranteed to be the same value as the id returned.

Instead we could resolve someThing(id: 123) field to a kind of unresolved object that doesn't immediately load data, but promises to do so before loading any field on the SomeThing. Then we could have an optimization where if you specifically ask for id, then it can just return the same id that was originally requested. If we asked for any other field, it would first go load the data before returning it. Now we're doing less work!

However there are some critical problems with this approach. First: what if id: 123 doesn't exist in the underlying data? Returning { "id": 123 } would be the wrong answer, we would have expected null!

Similarly if we queried { someThing(id: 123) { joinUsingID { edges { value } } } } as similar to a SQL JOIN, we wouldn't expect back: { "someThing": { "joinUsingID": { "edges": [] }}}, we would expect back: { "someThing": null }.

Another problem arises if we want to do access control on data available on our API, which Facebook does on every object (though we call it "privacy rules" instead of access control, same thing). In this case perhaps id: 123 does exist in the underlying data, but the current viewer isn't allowed to view it. In this case we also want to get back null, and if we asked for a "join" style connection, we want to ensure our access control rules are handled correctly such that if you can't query the "post" then you also can't query the "comments" via it. Almost always, this access control requires some information about the post.

Because of these caveats, Facebook doesn't take some of the approaches you mentioned to reduce requests to the database.

Here are some things we do at Facebook:

Caching

First and foremost, we cache as much as possible. All along as I've been saying "load the data" - at Facebook that's usually shorthand for "check the cache first, otherwise load from the database". We invalidate our caches at write time so they can be long-lived. We use tech like memcached (but redis can also be a good option) to save the results of raw database queries.

While memcached and redis are much faster than SQL, they still often require going between services or physical machines, so another layer of cache that you might consider is an in-memory cache in your API service's runtime. This can be trickier to implement though, you'll need a good cache policy and need to ensure you don't leak sensitive data between requests.

When using caching, subsequent loads to { post(id: 1) { id, ... } will just be a fast cache lookup (though we still run the access control rules every time). When paginating through lists of things, this layer of caching makes loading the post much cheaper.

Partial Fetches

The initial place to start with fetching data from storage is to load all the data you might need for an object. Maybe that's a SELECT * from SQL, for example. For smaller types, that's probably just fine, but for types with lots of data that might be overkill. Users are often a good example of a type that can end up with a lot of fields of data.

One technique we've used before is to only fetch the critical data about a type when first encountering it (typically at least the data required to do access control). Then any fields requested for data loaded in that initial critical load can be returned directly, but if a field is requested that didn't come with the initial data, then a subsequent load is made. For example, since it's not that common to load a Facebook User's birthday, we don't load that information every time we encounter a User.


Hopefully these are useful techniques that can help you optimize your server. For your particular use case, partial fetches might be helpful - you could load only the critical information to know if a Post hasn't been deleted and the viewer can still access it and to access its comments, which could be much faster than loading an entire Post if loading Posts is a bottleneck. Then in general I always suggest caching for a production system. If loading a Post is a bottleneck for your database, then a layer of cache in front of your database can help alleviate that.

Also I'll point out that these techniques are actually more general than GraphQL and were in use at Facebook before we developed GraphQL. They just happen to be part of what allows us to serve a billion mobile devices from one API.

All 9 comments

I'm not leebyron or claims to be working on FB.
but IMO, the 3ed approach is the right one, and i believe that this is what facebook choose as they invented the dataloader approach..

Related to this, I am wondering what the general concept in GraphQL is to prevent unnecessary resolves.

e.g in the above example assume the number of users and comments in the millions and I want to look for any user that recently commented on a specific post or including a specific keyword.

in SQL this would be done via an INNER JOIN. In document databases (e.g mongodb) with the table contents flattened it would be similarly one query (of course flattening might not always be possible).

in GraphQL as iIunderstand it now, it would query all users and for all users all matching comments. Thus most resulting pages would likely be mostly irrelevant, depending on the specificity of the actual filter.

the way to solve this would be by first querying and filtering the comments and walking upwards towards the user which the developer might not know ahead of time depending on how 'free' the user interface is in allowing queries.

can anybody point me toward how these scenarios are handled or addressed in GraphQL ?

@matthiasg take a look at https://github.com/facebook/dataloader

This seems related to https://github.com/facebook/relay/issues/1457 - in general a field resolver should fetch as little information as possible. So instead of loading everything about a post in the post field resolver, the loading of each field would be deferred to that field's resolver. In the example, rather than loading comments in the post resolver, just load the post ID and then load the comments in the comments resolver.

@josephsavona Thanks. I made a new issue https://github.com/facebook/dataloader/issues/64

@k15a i saw dataloader but it doesn't solve the issue unless we also transform the incoming queries.

the basic problem is: "Given 2 or more related potentially large tables find all items in X that have related Y where Y.field > 2" for example. A classic case of join of course. meaning it can efficiently figure out that it should first scan/index-scan through Y filtering for field>2 and then join it with X. If X is the primary item in a graphql query (because thats what the ui is interested about) then the graphql resolver by default doesnt allow for reversing the said query though (it would need to know not only X -> Y but also Y -> X ). otherwise the system might scan through potentially millions of X before it finds some with matching Y.

I think i will have to hand-roll that. All other solutions try to solve it by looking at the entire graphql from the top (ast) and generate sql from that but even transforming the incoming query would allow more efficient data access also on non sql stores. at least thats my theory, will have to test that :)

i had hoped this rather typical case for us was already addressed directly or by some convention, thats why i brought it up.

@matthiasg

e.g in the above example assume the number of users and comments in the millions and I want to look for any user that recently commented on a specific post or including a specific keyword.

in GraphQL as iIunderstand it now, it would query all users and for all users all matching comments. Thus most resulting pages would likely be mostly irrelevant, depending on the specificity of the actual filter.

GraphQL itself doesn't query or filter anything - it's only responsible for calling your resolver functions in the correct order they're found in the query. Any more sophisticated behavior is up to you. This is an inherent tradeoff between expressibility and portability. For example compared to SQL which has expressive keywords for WHERE, GROUP BY, ORDER BY, JOIN, etc, GraphQL just has fields and arguments. However SQL can only work on a SQL-aware storage, where GraphQL works in arbitrary execution environments that may or may not be backed by storage at all.

The specific example you made here, looking for a user that recently commented with a specific keyword, highlights a difference between two types of database requests: Query and Search. Query requests are about requesting information about a specific item of data, as well as relationships from that item. Search is about narrowing down an entire collection of data to items which match some criteria. Really different storage backends end up being used for each as there are inherent tradeoffs between the two.

GraphQL can handle both kinds of requests, however its field syntax and algorithm of calling resolvers is really designed for Query style requests, while Search are often modeled as root fields.

For your example, I imagine a query like:

{
  findRecentComment(filter: { keyword: "graphql" }) {
    post { id }
    comment {
      id
      actor { name }
      message
  }
}

@leebyron I did come to that same conclusion. And I do think it is valid to have GraphQL be non-specific with respect to deep searches. We are going to go ahead with GraphQL for traversal scenarios with limited search optimization and do our usual materialized view route or handwritten SQL for more complex scenarios.

thanks for the feedback.

@shakaIsReal

It seems the crux of your question boils down to fetching fields by an id, and then only requesting the id field on that object. A simpler version:

{
  someThing(id: 123) {
    id
  }
}

One possible way to execute this would be to resolve the someThing(id: 123) field, and in doing so go run a database command or network request or whatever to load in all of the data related to our "SomeThing". Then we follow to the id field, where we return 123. This seems like it could be more work than necessary if our id argument is guaranteed to be the same value as the id returned.

Instead we could resolve someThing(id: 123) field to a kind of unresolved object that doesn't immediately load data, but promises to do so before loading any field on the SomeThing. Then we could have an optimization where if you specifically ask for id, then it can just return the same id that was originally requested. If we asked for any other field, it would first go load the data before returning it. Now we're doing less work!

However there are some critical problems with this approach. First: what if id: 123 doesn't exist in the underlying data? Returning { "id": 123 } would be the wrong answer, we would have expected null!

Similarly if we queried { someThing(id: 123) { joinUsingID { edges { value } } } } as similar to a SQL JOIN, we wouldn't expect back: { "someThing": { "joinUsingID": { "edges": [] }}}, we would expect back: { "someThing": null }.

Another problem arises if we want to do access control on data available on our API, which Facebook does on every object (though we call it "privacy rules" instead of access control, same thing). In this case perhaps id: 123 does exist in the underlying data, but the current viewer isn't allowed to view it. In this case we also want to get back null, and if we asked for a "join" style connection, we want to ensure our access control rules are handled correctly such that if you can't query the "post" then you also can't query the "comments" via it. Almost always, this access control requires some information about the post.

Because of these caveats, Facebook doesn't take some of the approaches you mentioned to reduce requests to the database.

Here are some things we do at Facebook:

Caching

First and foremost, we cache as much as possible. All along as I've been saying "load the data" - at Facebook that's usually shorthand for "check the cache first, otherwise load from the database". We invalidate our caches at write time so they can be long-lived. We use tech like memcached (but redis can also be a good option) to save the results of raw database queries.

While memcached and redis are much faster than SQL, they still often require going between services or physical machines, so another layer of cache that you might consider is an in-memory cache in your API service's runtime. This can be trickier to implement though, you'll need a good cache policy and need to ensure you don't leak sensitive data between requests.

When using caching, subsequent loads to { post(id: 1) { id, ... } will just be a fast cache lookup (though we still run the access control rules every time). When paginating through lists of things, this layer of caching makes loading the post much cheaper.

Partial Fetches

The initial place to start with fetching data from storage is to load all the data you might need for an object. Maybe that's a SELECT * from SQL, for example. For smaller types, that's probably just fine, but for types with lots of data that might be overkill. Users are often a good example of a type that can end up with a lot of fields of data.

One technique we've used before is to only fetch the critical data about a type when first encountering it (typically at least the data required to do access control). Then any fields requested for data loaded in that initial critical load can be returned directly, but if a field is requested that didn't come with the initial data, then a subsequent load is made. For example, since it's not that common to load a Facebook User's birthday, we don't load that information every time we encounter a User.


Hopefully these are useful techniques that can help you optimize your server. For your particular use case, partial fetches might be helpful - you could load only the critical information to know if a Post hasn't been deleted and the viewer can still access it and to access its comments, which could be much faster than loading an entire Post if loading Posts is a bottleneck. Then in general I always suggest caching for a production system. If loading a Post is a bottleneck for your database, then a layer of cache in front of your database can help alleviate that.

Also I'll point out that these techniques are actually more general than GraphQL and were in use at Facebook before we developed GraphQL. They just happen to be part of what allows us to serve a billion mobile devices from one API.

Was this page helpful?
0 / 5 - 0 ratings