Gatsby: [Gatsby Develop] Building a performant preview server (>10k nodes with dependent pages)

Created on 14 Aug 2019  Β·  8Comments  Β·  Source: gatsbyjs/gatsby

that initial description of the issue is already dated, follow the progress in the replies.


Summary

Here comes the essence of this question/issue:

In Gatsby's develop server with the refresh webhook endpoint enabled we create all pages during bootstrap phase to reflect the current state of our external store/CMS. How can we add/update/remove single pages without relying on createPages lifecycle as this would iterate over all pages first and later run all page queries ignoring the fact that almost nothing changed. Even an empty webhook/refresh call can cause a rebuild time of many minutes.

Could createPagesStatefully(to create all pages once during bootstrap) together with onCreateNode (for any successive update on the nodes, call createPage for example) be a viable approach?

What's following are some explanations around the topic, some details and of course an example project to showcase the problem. Everything we learned and researched comes directly from the excellent docs and source files. 90% of the experiment is based on the awesome work of @DSchau who created almost everything in Gatsby's e2e suite πŸ‘Œ

That's how the experiment/demo project behaves
bash

You can see that all pages are recreated whenever something is received by the webhook. Here comes the full story:

Relevant information

When talking about large scale environments we are talking about Gatsby Installations with a nodes count beyond 50.000 - 100.000 and the same count of derived pages. Imagine 50.000 News Articles served by some Headless CMS for this summary. Building with Gatsby in such an environment works for most people as timing is acceptable and the upcoming incremental build should help in cases where a single new node should only generate one additional page for example. It gets pretty interesting in Gatsby's Develop Mode with the page-hot-reloader and the webhook with payload functionality being activated (ENABLE_GATSBY_REFRESH_ENDPOINT).

In such an environment, how to handle node updates efficiently in Gatsby's develop server after the bootstrap ? The bootstrap phase itself will take some time to build & reflect the current state of all data which is fine β€” but how to prevent Gatsby from (re)-creating all pages and running all page queries when only a single node is updated/added/deleted depending on the webhook payload ?

The product Gatsby Preview seems to help some people with that challenge but it's unfortunately not an option when the client's infrastructure is located in a closed network. Hence our current challenge is related to serve a custom preview based on Gatsby's develop server.

To be prepared for the technical challenges, we dug through many package sources including the core of Gatsby itself and we read all of the excellent documentation about the Gatsby Internals. Kudos for that awesome summary! Things are still not 100% clear to us but the mental image is already starting to build up.

We already achieved a working prototype by pinging naively the __refresh endpoint with no payload with a handful of nodes and generated pages being processed. This was a really nice experience but when we scaled things up it's gone south. We tried to build 5.000 pages and it took many minutes already (~20min to rebuild all pages after a single node update). There are no images no involved, it's the page processing. I save you the details of that installation and created an isolated experiment instead.

Example Project/Experiment

The experiment is based on @DSchau 's work on webhook/fake-source in Gatsby's e2e suite

Here is our example project:
https://github.com/satellytes/gatsby-large-scale-preview-experiment

Run it and trigger some different webhook calls from a second session. Check the README for all of our thoughts when we created the experiment. It's somehow overlapping with this issue description but might help clarifying things.

INITIAL_NODES_TO_CREATE=1000 yarn develop

yarn webhook:full-sync
yarn webhook:new-item
yarn webhook:webhook-empty

When running the example we create a set of initial nodes (INITIAL_NODES_TO_CREATE) by calling our new method api.hugeInitialSync only once in sourceNodes. The existing api.sync method is modified to accept a parameters updateAllNodes: true/false which will cause all nodes being touched as a field updated is incremented.

When the refresh endpoint is hit we can now decide among those scenarios:

  1. add new items (through the webhook, already present in the e2e project)
  2. touch all nodes and create a new node from inside (triggered by a new flag touchAll in the webhook payload)
  3. do nothing

The problem

Everytime you post to the webhook, every single page is recreated - because we tell Gatsby to do so in createPages which is called by the api runner if any page is dirty. It doesn't matter if the payload is empty or filled.

  const { data } = await graphql(`
    {
      allFakeData {
        nodes {
          title
          fields {
            slug
          }
        }
      }
    }
  `)

  data.allFakeData.nodes.forEach((node, index) => {
    createPage({})
    //...

See our file gatsby-node.js for full sources.

The createPages lifecycle is the idiomatic approach which works during build time and it works for most people (including us with a few pages) also during develop time with the hot reload functionality.

As said, with the preview mode activated we trigger an update (with or without a payload) which creates every page and all page queries are run again in addition (this happends later in the lifecycle and also costs quite some time). That makes any small update blocking the development server for minutes depending on your machine and node count. We are unsure how to prevent Gatsby from doing so in the experiment with the fake api source.

Goals:

  • Create all pages of the current state of your API data during bootstrap
  • During the following lifetime of the server, wait for data changes and update according nodes accordingly
  • Whenever a node is added/updated: create that page
  • Whenever a node is deleted: delete the page
  • Prevent Gatsby from running all page queries for that node type as I know that only a single node instance changed.

Here some approaches:

  • Remove createPages and call createPage in onCreateNode instead as it's available through the boundActionCreators.
  • Do we need to leverage the createPagesStatefully lifecycle hook? We tried that and indeed the pages are not re-created upon refresh as intended; however, all the page queries are re-evaluated nevertheless.
  • Do we have to manage page-node dependencies manually with createPageDependency to prevent an updated node to trigger an update for all nodes of the same type ?
  • Should we access the internal emitter and use some low level data ? Maybe the store?

It would be awesome if we could get a little discussion running around this topic β€” as this might be of interest for other people working with many pages + gatsby develop server/preview.

Source Insights

We have checked many parts of Gatsby's sources, here some interesting files we dug through:

  • page-hot-reloader.js
    You can see that nodes added/deleted set the pagesDirty flag which causes all pages to be created once the api runner has settled.
  • develop.js
    That's where the refresh/preview mode (ENABLE_GATSBY_REFRESH_ENDPOINT) is activated. We can see that sourceNodes is triggered
  • utils/source-nodes.js
    We can see how the api runner is activated. That's when we technically understood why the webhook causes the lifecycle createPages to be called.
  • gatsby-source-graphql/src/gatsby-node.js
    We found createPageDependency in the wild (inside a plugin/source) only in the gatsby-source-graphql plugin.

  • What's happening in the internal redux store can bee seen here packages/gatsby/src/redux. Did not help much. We looked up the things happening around page dependencies.

I'm sorry for the length of the topic. I wanted to provide as many information as possible. I also joined the Discord channel but I think the topic is worth to be discussed in this question issue.

Thanks for reading and I appreciate any input on this topic.

stale? question or discussion

Most helpful comment

@georgiee would love your insights if you've got a functioning solution -- paying for the Gatsby Preview currently. Encountering issues re: support responsiveness and evaluating building our own solution.

All 8 comments

We made some progress.

  • createPagesStatefully ist the way to go at least for the initial bootstrap to generate most of the pages. There are indeed stateful because we are not going to change them usually.

  • When a node is updated those pages still update β€” and only the pages connected with the nodes. I don't know what was wrong in my example.

  • For any new node we can't create a page for them in onCreateNode as those pages are the default/dynamic/non-stateful ones.

  • The hot reloading mechanisms kills every page that isn't touched see https://github.com/gatsbyjs/gatsby/blob/0260f88a43123cfc3b17124c4aba5e11aebc28ea/packages/gatsby/src/bootstrap/page-hot-reloader.js#L43-L52

  • So we are basically not supposed to use createPage outside the createPages lifecycle as those pages are deleted in the next life cycle round

Idea: For every new node mark them as new so we can query them (instead of all nodes that already have a page). That way we can have a smaller set of pages we have to query and rebuild.

Well let's make this issue useful for other souls searching for a preview. I will add useful links to articles but mostly source files in Gatsby in this post:

  • Create a First Class Source Plugin for Gatsby Preview (official article)
    nice large picture how to author such a source plugin. The example is not using the refresh endpoint but instead there is a long-living web socket connection being created.
  • gatsby-source-prismic-graphql
    A totally different approach by Prismic with their dynamic preview page. That's a little bit too much fiddling with GraphQL instead of trying to working together with the Gatsby Framework. But it works and it's a possible fallback if anything else fails.
  • gatsby-source-graphql
    We rely on GraphQL as our endpoint and of course we are using the official plugin that does the schema stitching. We don't know yet if this is fine or if we have to create our own fork of it, as we are currently under the assumption (after some tests) that with that plugin all nodes of a single type get somehow touched. Maybe through the page dependency on the node type here:
    https://github.com/gatsbyjs/gatsby/blob/aed2414e5740774ffbcd9ecb29f8901e6188a1c9/packages/gatsby-source-graphql/src/gatsby-node.js#L91-L97
  • When working with the GraphQL schema stitching I think we might have to use resolvers (something like this https://www.gatsbyjs.org/docs/schema-sift#processednodetype-resolve-function and this https://www.gatsbyjs.org/docs/page-node-dependencies/) to get our hands on the nodes. But that feeling is only blurry at the moment.

I try to continue/edit this list.

Can't believe this, the initial example setup is wrong:

activity.setStatus(
   `Creating ${index + 1} of ${totalPages} total pages`
);

The activity timer drastically slows down the example and gives the illusion createPages is running slow. This doesn't mean that we don't have real performance problems in our build but our whole isolation of the problem and the debugging is based on false facts.

You can mimic this behaviour by dropping this in your createPages:

activity = reporter.activityTimer(`create pages`)
activity.start();

for(let i = 0; i < 1000; i++) {
    activity.setStatus(
      `[DUMMY] Creating ${i + 1} of ${1000} total pages`
    );
  }
  activity.end();

This will take 7 seconds on my machine just to run the for loop. I found the activity timer as it's being used by the page queries info spinner. The main difference: graphql queries are being reported asynchronous while I'm using a synchronous for loop. Might be worth to raise this as an issue for the reporter/activity functionality

Hiya!

This issue has gone quiet. Spooky quiet. πŸ‘»

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! πŸ’ͺπŸ’œ

Let's close this until we have a more specific problem to talk about.

@georgiee would love your insights if you've got a functioning solution -- paying for the Gatsby Preview currently. Encountering issues re: support responsiveness and evaluating building our own solution.

@nadinagerlach Apologies for the issues with support responsiveness. I've gone ahead and responded to all your tickets and taken care of the issue as well! πŸ™‚

Hello @nadinagerlach,
we currently focus on getting the actual page implementations done and postponed the work on the preview server. We have had some agile spikes to explore possibilities.

Some things we considered:

  • Continuously check other gatsby source plugins (especially the ones that support Gatsby Preview) to gains some technical insights.
  • Continue to explore the stateful pages approach. The idea in a shell: Create all pages statefully (kind of a baseline). Whenever a CMS change comes in derive a dynamic page to benefit from the Gatsby update magic. That way we have a huge bucket of static pages that are not changing and a bucket of pages being changed in the recent past. The latter bucket needs to be emptied from time to time to keep the performance. Could be a restart during the night.
  • Our data comes from an external GraphQL server and we include it through schema stitching. This also means our data never lands in the internal redux store that is used by Gatsby to build the data graph for the internal GraphQL endpoint. Any created page is still sitting in that store but we are unsure how this relates/behaves in the overall preview (develop) and build workflow. We are collecting some knowledge on this too.

The last time I personally worked on our preview server problem was Summer 2019. A lot of things happened since then and maybe some more resources on building a preview server appeared? There is an excellent documentation section about all the internal of Gatsby called Gatsby Internal. Reading that together with the Gatsby Source Code helped a lot β€” but it would help a lot to have more guidance for building an own preview server as it's such a crucial part for a Gatsby installation beyond a specific size.

I hope you have a better experience and I would be happy to hear about your preview experiences πŸ™

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jimfilippou picture jimfilippou  Β·  3Comments

ferMartz picture ferMartz  Β·  3Comments

ghost picture ghost  Β·  3Comments

Oppenheimer1 picture Oppenheimer1  Β·  3Comments

3CordGuy picture 3CordGuy  Β·  3Comments