Gatsby: Future of GraphQL sources in Gatsby

Created on 19 Jul 2019  Β·  17Comments  Β·  Source: gatsbyjs/gatsby

Last year we introduced gatsby-source-graphql. GraphQL APIs are growing and it's weird that Gatsby didn't support them automatically. However time proved that the initial design there wasn't correct. This attempts to summarize what we can do in future to support GraphQL APIs better.

How Gatsby source plugins work

Gatsby has a node store, which is Gatsby cache for all Node (so id-referenceable) objects. Source nodes fill in the store ahead of time. Gatsby creates GraphQL schema based on those types and schema customization data. Gatsby pages use GraphQL queries to get this data and render.

Important thing to notice is that:

  1. Data is retrieved ahead of time
  2. Gatsby creates the final GraphQL API

Gatsby relies on having data in the node store ahead of time a lot. This provides fast builds, ability to track changes and ability to run transforms on the data to do additional data processing (such as sharp). In addition, Gatsby sites rely on data being all available at the build time, because they often base their page creation on that.

Tradeoffs when designing source plugin

Because Gatsby and the source API both use GraphQL, it's a tradeoff between having a schema that is like Gatsby's schema and having a schema that is close to source API. Familiarity goes both ways and being familiar with Gatsby schema would mean being familiar with all schemas. However, people only familiar with source schema might get confused by a Gatsby schema that isn't at all like what they use internally.

We have lots of plugins that don't use GraphQL. In my opinion having consistent Gatsby
schema is a priority.

How we currently use GraphQL API?

Most source plugins with GraphQL APIs fetch all data by defining GraphQL queries same way as they would other APIs. Queries are defined by source plugins and GraphQL schema on Gatsby site is mostly inferred from the result. This feels a bit like a waste, because we are wasting actually having an introspectable source from the source side and having control over resulting schema on Gatsby side.

Proposed solution

  1. We still fetch stuff ahead of time. We need controls over what we fetch
  2. We shouldn't ask plugin creators to write queries by hand. Instead queries can be generated from the source api introspection.
  3. We try to not deviate too much from source node structure, but we always choose similarity to Gatsby over similarity to third-party api.

For example, a plugin implementation could be smth like that:

const NODE_TYPES = ['Blog', 'Comment', 'User']
const QUERIES = [
  `query allUsers { allUsers }`
  `query allBlog { allBlog }`
  `query allComment { allComment }`
]

export.sourceNodes = () => {
  sourceFromGraphQLAPI({
    url: 'http://www.example.com',
    queries: QUERIES,
    nodeTypes: NODE_TYPES
  })
}
GraphQL

Most helpful comment

As a user of both gatsby-source-wordpress and gatsby-source-graphql (wp-graphql) I would personally be happy to have features from both worlds. Particularly the unified api of all "traditional" gatsby source plugins makes gatsby great and makes querying and filtering intuitive across all different sources and I think that implementing an automatic or semi-automatic transformer from external graphql schema to gatsby's schema is possible (should be already more straightforward with schemas that implement relay).

However what I like about sourcing from graphql sources and would have lost this way is that nodes can be "live reloaded" whenever the data is queried. I have kinda taken this feature so far, that I'm running gatsby develop as live api service and parsing the gatsby's graphql queries from the sources and transforming them to apollo's gql to be able to reuse them live with apollo (usable for live previews or just refreshing stale data without waiting for rebuild/sourcing ). However I was wondering if gatsby node couldn't have some sort of mechanism to mark itself as a live (refreshable) and then you could be able to choose between getting the data live from external service or gatsby's cache when querying (maybe through arguments). This would turn gatsby into even more powerful api service and you wouldn't have to fetch all the external graphql data when bootstrapping. I also think this could kinda be doable automagically from introspecting the schema from an external graphql sources and also be possible to backport to traditional nodes (with the help of the source plugin authors). I think this would greatly enhance developer experience and also would have opened the door for many new features including live previews without the need to rebuild/resource.

All 17 comments

@freiksenet This is interesting.

First, let me make sure I'm understanding this properly.

At the config level, the user would write queries to the origin API that make use of the origin API's GraphQL Schema directly.

The results of the origin queries would be added to the Gatsby Cache.

Then, when the user wants to build pages and write pageQueries, they will write their pageQueries using the Gatsby Schema, and _not_ the origin Schema.

There might be some similarities between the Gatsby Schema and the origin Schema, but they _will not_ be the same, and users cannot use queries interchangeably between the origin and Gatsby.

Is that all accurate?

Assuming my understanding is accurate, below are some thoughts/concerns:

Can't couple queries to components

With this approach, the queries (for data from the origin API) can't be coupled with the components that need the data. From a codebase scaling perspective, I personally think one of the best features of GraphQL is being able to couple queries/fragments with the components that need them.

This approach breaks that experience, because at the page level the user is querying from the Gatsby cache, and if they haven't already adjusted their query in the config, the Gatsby Schema might not have the field they want to query for. So the pages are now tightly coupled with some config setting and that could get difficult to navigate over time as the codebase grows.

It seems like it could lead to some headaches in the developer experience.

How would we handle connections, filters, sorting, etc?

Right now, external GraphQL APIs handle things like sorting, filtering, etc in many different ways.

If the config queries hydrate the Gatsby cache, and the page queries are used to query the cache, but _not_ the origin, how can users create pages based on filtered paramaters?

For example, if we queried _all_ posts from a WordPress blog, but needed to create a date-based archive of posts (such as https://www.denverpost.com/2019/07/19/), how would one do that?

In WPGraphQL directly, a query to build the above page, consisting of posts on July 19, 2019 would look like the following:

query GET_POSTS_BY_DATE {
  posts(where: {dateQuery: {month: 7, day: 19, year: 2019}}) {
    nodes {
      id
      title
      date
      author {
         name
      }
      featuredImage {
          sourceUrl( size: THUMBNAIL )
      } 
    }
  }
}

If the Gatsby cache already had copies of all of the posts (and authors, and media), then how would the same page be built using something like a Gatsby pageQuery?

Same thing goes with filtering by pretty much anything. Like in WordPress we can filter posts by a specific tag or category or author, etc.

Or in Github repositories can be filtered by stargazer count, etc.

How do you envision users being able to build pages that are specific in the nodes they need, instead of _all_ nodes of a given Type?


Looking forward to chatting more.

Perhaps as a dogfooding experiment we could convert something like this existing WPGraphQL example site to the new API and work out any kinks and have a good idea of how external APIs can play nice with this new approach πŸ€”: https://github.com/wp-graphql/gatsby-wpgraphql-blog-example

At the config level, the user would write queries to the origin API that make use of the origin API's GraphQL Schema directly.

This would happen in the _source_ plugin not in a user's config.

If the Gatsby cache already had copies of all of the posts (and authors, and media), then how would the same page be built using something like a Gatsby pageQuery?

We'd use the same filtering/sorting mechanisms that Gatsby already provides. Gatsby would still drive the GraphQL API. Ensuring similarity across source plugins is more important than making the experience in Gatsby as close as possible as e.g. wpgraphql. See his point 3 "We try to not deviate too much from source node structure, but we always choose similarity to Gatsby over similarity to third-party api."

@freiksenet how will this work with e.g. CMSs which can have arbitrary types? Perhaps the source plugin author programmatically describes how to do a bulk query of the remote source? There'll also be cases where the remote source might have a bulk sync API that's independent of GraphQL β€” that's how I've been thinking we scale WP/Drupal (and others) is that Gatsby would ideally call a sync API for each type which would very efficiently pull in all data from the source.

At the config level, the user would write queries to the origin API that make use of the origin API's GraphQL Schema directly.

This would happen in the source plugin not in a user's config.

@KyleAMathews it would have to happen at the users config, because no 2 WordPress sites will have the same WPGraphQL Schema. Like you asked how will this work with e.g. CMSs which can have arbitrary types?. . .WPGraphQL does have arbitrary Types based on how each WordPress site is setup.

For example a WordPress site running WPGraphQL + WooCommerce has a lot more Types for things like Products, Orders, etc.

A WP site running WPGraphQL for Advanced Custom Fields will have a much different schema than a site that's not using Advanced Custom fields.

because no 2 WordPress sites will have the same WPGraphQL Schema

Right but the source plugin could do the work of translating from how they think about data to how Gatsby thinks about data. It should be possible to make it automatic I think in the vast majority of cases.

Right but the source plugin could do the work of translating from how they think about data to how Gatsby thinks about data. It should be possible to make it automatic I think in the vast majority of cases.

πŸ€” I'm having a hard time visualizing this. But open to exploring for sure.

Do you have an example of how this is done elsewhere? Where a GraphQL API with arbitrary Types and fields can be programmatically queried without knowing about the Types and fields in advance?


At the moment, one of the only ways I can think to make this kind of thing programmatic would require additional work on the WPGraphQL side (which isn't necessarily a bad thing, and I know a guy that can help πŸ˜†), but not all external GraphQL APIs will be able to make changes like I can to WPGraphQL.

For example, if I have all types that support Relay Style Connections implement a Connection Interface, then we can have the source plugin query the WPGraphQL Schema and detect all fields that return a RelayConnection, then could at least programmatically query all nodes of those connections.

But, because WPGraphQL is highly extendable, users could add fields to their WPGraphQL Schema in very unpredictable ways.

If a user registered their own fields to the WPGraphQL Schema which provides access to nodes in a different way than core WPGraphQL Connections, it could be _very_ difficult (maybe impossible) to query the nodes from WPGraphQL programmatically and get them into the Gatsby cache without direct user config.

I suppose in those cases, the user could specify additional queries and config.

I think at a minimum, users will likely need to specify fragments of the Types they want to query in some config level, because we can't just wildcard query for _everything_. Even if could wildcard query for all fields, for example if we use Schema introspection to dynamically build queries for _all_ fields, that might not be a good thing to do for performance and _possibly_ security.

WordPress is often used for _much more_ than _just_ the front-end of a site. So having a consumer like Gatsby query _all possible fields_ could potentially expose data to the Gatsby cache that isn't intended to be exposed outside of WordPress, and at minimum would increase build times by querying data from the WordPress server that the user doesn't have any need for.

One of the wins in performance for WPGraphQL vs. WP REST API is that you specify what you want with WPGraphQL and that significantly reduces processing on the WP server.

@freiksenet how will this work with e.g. CMSs which can have arbitrary types? Perhaps the source plugin author programmatically describes how to do a bulk query of the remote source? There'll also be cases where the remote source might have a bulk sync API that's independent of GraphQL β€” that's how I've been thinking we scale WP/Drupal (and others) is that Gatsby would ideally call a sync API for each type which would very efficiently pull in all data from the source.

Depends on the plugin - source plugin can access introspection directly and decide based on that. If API has arbitrary types, then it has generic way to access those types, that can be either deduced or inferred from some kind of metadata.

With this approach, the queries (for data from the origin API) can't be coupled with the components that need the data. From a codebase scaling perspective, I personally think one of the best features of GraphQL is being able to couple queries/fragments with the components that need them.

They can be coupled, it will be Gatsby GraphQL queries/fragment that are coupled, not WPGraphQL.

How would we handle connections, filters, sorting, etc?

As we have full control over gatsby filters/sorting we'll use that to establish relationship between pages and nodes in cache.

Source plugins and source plugin config is the source of truth for what kind of data is retrieved, so if user needs more or less data, then source plugin or it's config is where they'll change stuff.

Do you have an example of how this is done elsewhere? Where a GraphQL API with arbitrary Types and fields can be programmatically queried without knowing about the Types and fields in advance?

We know about types and fields in advance because we have introspection. I envision that we will construct a spanning tree of the query that can fetch all the data for a particular type, while stopping at Node or Connection boundaries. This way it will be a spanning tree that only fulfills data for a particular node and whichever connections it has. Relationships are derived by querying for an id for connections/nodes, naturally exact logic of how this is determined is left for a source plugin to define.

@freiksenet , @jasonbahl made the following comment too that still isn't very clear to me ...

if we use Schema introspection to dynamically build queries for all fields, that might not be a good thing to do for performance and possibly security

In your reply you stated...

Source plugins and source plugin config is the source of truth for what kind of data is retrieved, so if user needs more or less data, then source plugin or it's config is where they'll change stuff.

Am I reading correctly that for the sake of performance the source plugin config would restrict what gets added to cache, but not necessarily address the security concern? Unless you restrict sensitive nodes from the source via config (or allow them but use client only routes with an auth layer).

[PRO] I personally am all for preloading because it allows us to abstract away our source by creating node interfaces. Something I am having a hard time with with gatsby-source-graphql and had to do with Apollo Client instead.

[CON?] Now for the bit I believe @jasonbahl is concerned about (and I have some reservation over) is that with the current implementation we are getting a new type we can query. Using that query we can pull from our source API exactly the nodes we want using the source API graph tree. Which is incredibly handy for page queries and allowing us to build Gatsby Themes using WordPress and WPGraphQL. There is of course the concern that we would lose this ability But! By abstracting away the source into interfaces we could build Gatsby Themes that are firstly WPGraphQL friendly, but that can also be used for other source plugins.

[CON] Abstracting the source away takes us back to the concern that WPGraphQL is so dynamic based on a site's needs and additional WordPress plugins that extends the tree, that it would be extremely difficult to satisfy every possible tree that could be generated. Unless we can bring WPGraphQL fragments into the Gatsby tree somehow.

I guess what we really would love to see is some kind of POC to relief some concerns... but personally, I feel this proposal is more pro than con.

Disclaimer: I don't know enough about WPGraphQL internals to validate if this approach is good or bad for said project. However, I do manage GraphQL APIs that are quite dynamic in their nature and would need to understand if a generic approach is feasible, or if writing a specific source plugin is my answer.

tl;dr --- This is exciting and I'm looking forward to helping test some different approaches.

Am I reading correctly that for the sake of performance the source plugin config would restrict what gets added to cache, but not necessarily address the security concern? Unless you restrict sensitive nodes from the source via config (or allow them but use client only routes with an auth layer).

What I meant is that it would be either source plugin's or it's config's job to determine what data is sensitive and whether they want to include it. Currently Gatsby source plugins tend to grab all the data that is available, so this isn't a concern we currently though of extensively. Also security greatly depends on how access control is defined for the particular source.

By abstracting away the source into interfaces we could build Gatsby Themes that are firstly WPGraphQL friendly, but that can also be used for other source plugins.

Exactly. The whole point of using Gatsby cache system is that we can build more stuff by relying on our knowledge of that cache and of Gatsby GraphQL. Compatibility to the original source is always secondary, because there is one Gatsby, but hundreds of different sources. Gatsby consistency > consistency with the source.

@freiksenet I think I understand the proposal better now. At first this sounded like a gatsby-source-graphql rewrite, but this is more. This is sourcing GraphQL as part of core in sourceNodes which is a lot more flexible. It means that rather than even pondering about source plugins, I could source from GraphQL endpoints directly in my themes.

πŸ•ΊπŸ»πŸ•ΊπŸ»πŸ•ΊπŸ»I like this!!! πŸ•ΊπŸ»πŸ•ΊπŸ»πŸ•ΊπŸ»

Just a few thoughts from a person who mostly does WordPress and wants to improve how it works with Gatsby.
Having a universal schema sounds awesome. It's a great improvement, I like this!
Implementing a universal schema source plugin for all the various data/fields/posts/plugins that WordPress supports even if feasible would take time. Understandably, there will be some use cases where the universal approach won't cut it. It could also be inefficient. For example with media files. WordPress sites tend to accumulate tons of unused images over their lifetime and pulling all these unneeded images and transforming them would take ages.

Can we have it both ways? There are use cases for querying APIs directly with gatsby-source-graphql - prototyping, small projects / personal sites, fast onboarding and so on. GraphQL API provides great developer experience - it's easy to understand, get only what you want, nice coupling with components.

However, there are certain issues with Gatsby queries being rather static (and too eager with parsing), e.g. you can't pass variables to staticQueries.

If this could be improved it would help with building WordPress source themes that could be used more widely. At the moment it looks like Gatsby/WordPress integration needs tweaking and some custom coding most of the time. It won't just work out of the box (for an end-user).

As a user of both gatsby-source-wordpress and gatsby-source-graphql (wp-graphql) I would personally be happy to have features from both worlds. Particularly the unified api of all "traditional" gatsby source plugins makes gatsby great and makes querying and filtering intuitive across all different sources and I think that implementing an automatic or semi-automatic transformer from external graphql schema to gatsby's schema is possible (should be already more straightforward with schemas that implement relay).

However what I like about sourcing from graphql sources and would have lost this way is that nodes can be "live reloaded" whenever the data is queried. I have kinda taken this feature so far, that I'm running gatsby develop as live api service and parsing the gatsby's graphql queries from the sources and transforming them to apollo's gql to be able to reuse them live with apollo (usable for live previews or just refreshing stale data without waiting for rebuild/sourcing ). However I was wondering if gatsby node couldn't have some sort of mechanism to mark itself as a live (refreshable) and then you could be able to choose between getting the data live from external service or gatsby's cache when querying (maybe through arguments). This would turn gatsby into even more powerful api service and you wouldn't have to fetch all the external graphql data when bootstrapping. I also think this could kinda be doable automagically from introspecting the schema from an external graphql sources and also be possible to backport to traditional nodes (with the help of the source plugin authors). I think this would greatly enhance developer experience and also would have opened the door for many new features including live previews without the need to rebuild/resource.

For anyone interested in updates on this: we've built an alpha version of the new toolkit for GraphQL sourcing. Alpha version is available at https://github.com/vladar/gatsby-graphql-toolkit (but it will be eventually merged into the monorepo)

It works quite similar to how @pristas-peter describes above:

I think that implementing an automatic or semi-automatic transformer from external graphql schema to gatsby's schema is possible (should be already more straightforward with schemas that implement relay).

However I was wondering if gatsby node couldn't have some sort of mechanism to mark itself as a live (refreshable) and then you could be able to choose between getting the data live from external service or gatsby's cache when querying (maybe through arguments)

Gatsby allows you to refresh individual nodes. So you can set up a subscription or get a delta of changes since the previous build to update individual nodes. Related doc: https://github.com/vladar/gatsby-graphql-toolkit#sourcing-changes-delta

Super cool work!!!

I think we can close this now that the GraphQL toolkit is published: https://github.com/gatsbyjs/gatsby-graphql-toolkit

Was this page helpful?
0 / 5 - 0 ratings