Gatsby: Building large amount of pages (~16k) on Gatsby V2 performance issues

Created on 16 Aug 2018  Β·  58Comments  Β·  Source: gatsbyjs/gatsby

Summary

I'm building a website that contains lots of pages (~16k) with graphQL requests on each page. I've done some benchmarks and the build at the moment ~12k pages takes ~25 minutes.

Relevant information

The website fetch data from different sources (contentful and JSON files that are added to graphQL). That data is then used on every page with its own graphQL query on each site.

``` success source and transform nodes β€” 7.582 s
success building schema β€” 1.363 s
success createPages β€” 0.818 s
success createPagesStatefully β€” 4.484 s
success onPreExtractQueries β€” 0.008 s
success update schema β€” 0.765 s
success extract queries from components β€” 0.244 s
success run graphql queries β€” 1021.016 s β€” 11582/11582 11.34 queries/second
success write out page data β€” 0.105 s
success write out redirect data β€” 0.001 s
success onPostBootstrap β€” 0.262 s

info bootstrap finished - 1049.157 s

success Building production JavaScript and CSS bundles β€” 47.377 s
success Building static HTML for pages β€” 443.926 s β€” 11582/11582 26.59 pages/second
info Done building in 1545.152 sec
✨ Done in 1552.26s.
```

Possibile optimization could be to remove most of the queries from each page and do one bigger query in gatsby-node.js? But we still have a build time for the static pages around 450 seconds.

File contents (if changed)

gatsby-node.js: I'm fetching all the pages in queries and then loop through that array to create every page

stale? question or discussion

Most helpful comment

@sheerun and what's the way?

All 58 comments

What do the pages look like and their queries? My initial guess is you're using gatsby-image a ton and generating a bunch of blur-up effects?

I'm not using gatsby-image at the moment but the page consist of text and images on each page. The images are loaded with a url string into another lazy-loading lib.

Is there any good/easy way to monitor the graphQL queries, for example to understand which queries takes the most time and the frequency ?

Can you paste some queries and the data they return?

There are ways to trace graphql queries. We haven't got this working just yet but hope to extend our earlier work here https://next.gatsbyjs.org/docs/performance-tracing/

Also doing normal perf analysis could turn up some problems. Follow this guide and do a performance analysis in chrome dev tools while graphql queries are running https://next.gatsbyjs.org/docs/debugging-the-build-process/#chrome-devtools-for-node

Do the graphql queries hit an internal memory store of the data that you generate in an earlier step? I'm curious about this issue as well, 200k pages run this step at ~30 queries per second, which makes it take about 2 hours.

@chuntley these aren't normal query speeds which is why we're trying to debug what's going on in the site. gatsbyjs.org e.g. does ~300 queries / second.

Query running is single threaded as well atm which we'll make multi-threaded in the future.

@chuntley or are you saying you have another site w/ 200k pages?

Here's an example how the queries looks like. There is ~6 similar queries in sequential after each other with similar structure (this query in the component that is passed to _createPage()_)

query_1(id: {eq: $id}) {
 ...[fragment_name_1]
}

query_2(id: {eq: $id}) {
 ...[fragment_name_2]
}

...
fragment [fragment_name_X] on Query{
 attribute
 nestedAttribute {
  id
 }
 ...
}

Some of the fragments have around 20-30 attributes. Majority of the attributes are strings. Thanks for the links, I will take a look at it and see it I can did deeper into what it taking so long time.

@chuntley not what I've seen at least.

By the time my build process gets to this step, memory usage is around 2gb. Is there a chance that node performance decreases once you go past the original 1.5gb limit?

@chuntley it can. Best thing to do is do a performance analysis as I mentioned earlier as you can then see which functions are using the most time.

Something interesting to note is that the performance usually starts high, but as it goes a long (noticeably around 10-20k documents processed), it beings to slow down. For example (using the same data set with a limit set):

10k pages: 300 per second, start to finish
200k pages: Starts at 70 per second, finishes at 12 per second

i'm having a similar issue, but with relatively small amount of files. i have around ~2500 mds, and i get FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory during the run graphql queries β€” 691/2527 7.01 queries/second step. the queries for most of the pages are simply getting the html for the markdown. it simply kills it around 1.5G memory.

edit: sorry, i was using rc.0, after upgrading to rc.15 the problem is no longer present

Just a general note for people posting/reading here β€” performance is complex. There's a _ton_ of things that can affect your site building performance from Gatsby's code itself, plugins you're using, react components you're using, js libs you're using, and your own code of course. So it's the most useful if you run into performance problems if you can reproduce the same problems with one of our benchmark sites of by making some small changes to them that you share.

The only way we can make improvements is if we can see the same problems on our own machine.

@KyleAMathews i tried to cut down all the other stuff and still reproduce the issue. i have created a repo at https://github.com/eLod/gatsby-bench, it produces the error FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory while extract queries from components on my machine for both gatsby develop and build (it is killed around 350-400/782 for me). i tried to check with zipkin, but i only see the traces until the query extraction starts and also i don't see memory usage. the content is a c library documentation generated with doxygen & moxygen.

@m-allanson @pieh @DSchau @rase- and I met to investigate this issue this morning.

@pieh has been talking to @eLod about his site and his comment https://github.com/gatsbyjs/gatsby/issues/7373#issuecomment-420288443

We tried removing the cache.sets in transformer-remark as he suggested and also saw that this solved the rapidly growing memory seen w/ larger markdown sites.

This seems mostly due to avoiding copying objects in memory (to the cache).

We also write out the cache at an extremely fast clip (every 250ms) which uses CPU/memory to stringify the data which gets more problematic as the cache gets larger. Removing that sped up query running quite a bit. https://github.com/gatsbyjs/gatsby/blob/33b4c76732906e710d69cc860bc0ea141c5be327/packages/gatsby/src/utils/cache.js#L69

@KyleAMathews I've continued a bit investigating our data model for speed improvements. Will get back to you if I find anything interesting.

Regarding your benchmark test, I have some questions regarding speed. What I can see now is that gatsby develop and gatsby build have a hugh difference in queries/second. Have you seen any similar on your machine?

The run speed is around 40-60 queries/second in gatsby develop and ~400 queries/second when I run gatsby build. Is this normal? What I can remember (in the Gatsby v2 beta) I saw a bit more even numbers between develop and build (but that might be me dreaming?)

We where having similar issues with 1700 pages. Was able to increase performance by removing a graphql query from a template and passing that data through pageContext via a source plugin. This allowed it to run once instead of 1700 times. Beware though, passing to much data can cause memory issues. Obvious mistake, but hope this helps someone.

Yes i agree about this (i have got 5 languages), much better to prefetch everything in gatsby-node and store results in context.
As well i think it's better to separate ' View and controller '
I use a cache filesystem coupled with queries in gatsby-node to avoid to refetch not updated pages.

The gatsby-node is a bit complicate for beginners but it's such a really nice feature of gatsby
Will be nice if Gatsby give more example to teach that.

some thought:
Unfortunately gatsby don't provide something to manually fetch a graphql query without have to pass in a StaticQuery.
I would like to fetch a graphql query before the render of Component like in a componentWillMount..?
there is some way to do that (of course we are not here in case of template or a pages context)

I'm currently building a blog with articles containing lots of images and getting between 2-4 queries per second on build (1 query = 1 article). It is my understanding that gatsby-image requires creating thumbnails of images on disk at build-time. Is it possible that creating these images make queries slow-er?

Can attest to doing the expensive stuff in one shot in gatsby-node.js. I'm using Gatsby with Hasura to build about 40k pages (currently unpublished). Getting all the data in one shot takes about 25s, whereas god only knows what the 40k queries run serially would take. I didn't bother to figure it out, as just 1000 pages took many minutes.

@eads fun project! :-D

And woah... that's a big time diff :-(

@KyleAMathews I am running into some memory issues, but I'll open a separate issue.

But definitely, one 30s query in gatsby-node.js works a lot better than 40k 0.5s queries in this case and I highly recommend it as an approach.

Oh should probably note that your situation is a bit new @eads in that you're using gatsby-source-graphql with a remote API which means every call has network latency. Currently we hard code things so we only run 4 queries at a time. With remote APIs, we should run way more queries concurrently to speed things up.

Moving convo over to here from Not the Gatsby Gazette 2018-11-28 - good shout @pieh

I've created a PR adding in some CPU control in html-renderer-queue.js (multi-core builds) which includes some tweaks we've made to improve our larger site builds.

Our main site has ~25k nodes, most of which have a combination of static data (that runs through Gatsby's static build) and dynamic data on React app load. We've managed to reduce build times of ~10mins down to ~6mins using these CPU controls - specifically by using logical_cores instead of physical_cores. Rather biased tests as its using our app, but encouraging results.

Courting thoughts from people involved with large site build from this issue...

Many examples of gatsby select data relevant for single page with graphql query like so:

  query($id: String!) {
    # Select the post which equals this id.
    postsJson(id: { eq: $id }) {
      ...PostDetail_details
    }
  }

What is the performance of such query? Does it matter how many records there are? If selecting single record by id is not O(1), this could potentially make whole build O(n^2) operation.

We've got a similar problem. In our case there is a cms delivering different language versions of a website (currently only english and german). In our gatsby-node.js we get all the entries for posts, pages and so on and use createPage() to generate them with siteId and language given in as context so that template queries can identify the language to use in for content queried in the template.

Due to the fact that every navigation or global information that we use in the layout needs to be translated in the language the page is created in we quickly landed at around 450 queries that run about four minutes. That's because we have six templates and each template gets, for _every_ page/post/whatever, all the global and navigation information language based. Of course, I'm aware of the fact that this, already in sheer theory, cannot scale well.

Here's an example of two template queries:

_Query for templates/home.js:_

export const query = graphql`
  query TemplateHome($id: [Int]!, $siteId: Int!) {
    craft {
      # Globals
      globals(siteId: $siteId) {
        ...GlobalNavigationSocialQuery
        ...GlobalCtaSignUpQuery
        ...GlobalCookieInfoQuery
        ...GlobalSeoQuery
      }

      # NavigationHeader
      navigationHeaderItems: entries(
        section: navigationHeader
        siteId: $siteId
      ) {
        ...NavigationHeaderQuery
      }

      # NavigationFooter
      navigationFooterItems: entries(
        section: navigationFooter
        siteId: $siteId
      ) {
        ...NavigationFooterQuery
      }

      # NavigationMain
      navigationMainItems: entries(section: navMain, siteId: $siteId) {
        [some_fields]
      }

      # Single: home
      entry(id: $id, siteId: $siteId) {
        ... on Craft_Home {
          [some_fields]
        }
      }
    }
  }
`;

_Query for templates/pages.js:_

export const query = graphql`
  query TemplatePages($id: [Int]!, $siteId: Int!) {
    craft {
      # Globals
      globals(siteId: $siteId) {
        ...GlobalNavigationSocialQuery
        ...GlobalCtaSignUpQuery
        ...GlobalFooterSectionQuery
        ...GlobalCookieInfoQuery
        ...GlobalSeoQuery
      }

      # NavigationHeader
      navigationHeaderItems: entries(
        section: navigationHeader
        siteId: $siteId
      ) {
        ...NavigationHeaderQuery
      }

      # NavigationFooter
      navigationFooterItems: entries(
        section: navigationFooter
        siteId: $siteId
      ) {
        ...NavigationFooterQuery
      }

      # Page
      entry(id: $id, siteId: $siteId) {
        ... on Craft_Pages {
          [some_fields]
        }
      }
    }
  }
`;

So I would love to know if putting things like navigations and global information into the context when using createPage(), like @JordanDDisch explained they did, is a valid way of doing translated pages. And if so, can anyone provide an example on how they would approach going through that information before createPage() and using that information there. Maybe @simonjoom has some pointers too?

Before the second language came along we used StaticQueries and all was well with the world.

Since this is not only performance related maybe point me into the right direction regarding the right place to discuss it. I'm out of pages to read in some docs and out of words to google for.

Thanks ✌️

@seamofreality Can't help you much on translations. We don't do them (thank god). But we do get 10s graphql queries, after moving those pesky queries out of page templates!
screen shot 2018-12-20 at 9 41 16 am

@JordanDDisch Thanks for your quick reply! You've been mentioning, that you pushed data through pageContext, I was wondering if you have an example on how you did that.
We just had Static Queries for navigations and global information that stays the same over pages, but like I explained, with different languages, we were missing the language context in those queries and had to move them.

@seamofreality @JordanDDisch

We have a similar setup and moved all (most of them and will move all of them soon) queries to gatsby-node.js and passes the data through pageContext. The translations should not be an issue, it all depends where you want to do the queries and the structure around it. We have around ~35 different translations on our site with around 600 pages on each locale.

@seamofreality https://gist.github.com/JordanDDisch/5fa7f3972b9a4ff91cb7469c01eea1a6/fe91cce8bd34019b1d6e26bd3d2e7af93c34de5e thats how we pass stuff through page context. Not sure if that helps.

@stoltzrobin have you run into memory issues from moving most of your graphql queries to gatsby-node.js ?

@stoltzrobin I've also moved stuff to sourceNodes but found the authoring experience suffers. You add a new Markdown doc that uses mapping to get linked to another and it requires stopping gatsby develop and rm -rf .cache, many times a day.

I need to find an API similar to sourceNodes but which runs on each filesystem change.

Hiya!

This issue has gone quiet. Spooky quiet. πŸ‘»

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

Thanks for being a part of the Gatsby community! πŸ’ͺπŸ’œ

Would it be possible to change the development server to run page-exported queries on demand, rather than running all queries upfront?

It’s necessary for a production build, but it probably makes some people wait a long time before being able to develop, or it makes them work around the issue by passing data via pageContext.

Edit: Answered in my spectrum question.

@seamofreality https://gist.github.com/JordanDDisch/5fa7f3972b9a4ff91cb7469c01eea1a6/fe91cce8bd34019b1d6e26bd3d2e7af93c34de5e thats how we pass stuff through page context. Not sure if that helps.

@stoltzrobin have you run into memory issues from moving most of your graphql queries to gatsby-node.js ?

Sorry for late response, but yea we have had some memory issues (not sure tho if it was due to the movement of queries) but we solved it by adding a filter to contentful #12939 (to minimize the amount of nodes created)

@stoltzrobin I've also moved stuff to sourceNodes but found the authoring experience suffers. You add a new Markdown doc that uses mapping to get linked to another and it requires stopping gatsby develop and rm -rf .cache, many times a day.

I need to find an API similar to sourceNodes but which runs on each filesystem change.

Sorry for late response, yea we have had this issues as well. But we moved away from running an dev server for authers and we always build the application instead and trying to use the internal cache as good as possible to lower build times. This have worked good for us.

In CI/CD environment, would you recommend to cache the .cache and public folders?

In CI/CD environment, would you recommend to cache the .cache and public folders?

We save our .cache and public folder on S3 and retrieve them when building the page again. This lowered our build time by quite much.

i'm evaluating Gatsby for a React website with 2.5 million pages right now which would really benefit from the SEO/perf benefits of Gatsby... unfortunately, the build times look untenable. it'd be neat if gatsby build could run concurrently across multiple servers or like, something with 96 CPUs

e: maybe it'd be possible to make the 10k most frequented pages static, and the others stay dynamic?

@ashtonsix I’m having big trouble with ~1500 pages. I would highly recommend you write your own very simple code for building that amount of pages.

From my experience, the images take the most time. if you can turn that off in the config and just load images from absolute URLs

What would ya'll recommend I do when trying to source and transform over 160k (160,000+) nodes using gatsby-source-mysql? MySQL just timeouts when I do a select query of the entire database. If I put a limit on it, it works fine, but I need the entire database for this app.

You'd probably want then to add support to gatsby-source-mysql for paging so it doesn't try to query everything at once.

@KyleAMathews I ended up paginating the queries by month, but it still times out with this error...

β ‹ building schema
\node_modules\yoga-layout-prebuilt\yoga-layout\build\Release\nbind.js:53
        throw ex;
        ^

Error: Quit inactivity timeout
    at Quit.<anonymous> (\node_modules\mysql\lib\protocol\Protocol.js:160:17)
    at Quit.emit (events.js:198:13)
    at Quit._onTimeout (\node_modules\mysql\lib\protocol\sequences\Sequence.js:124:8)
    at Timer._onTimeout (\node_modules\mysql\lib\protocol\Timer.js:32:23)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)

Here's the relevant code snippet for the custom query paginator I wrote.

let queries = []
currentMonth = 1
for ( let i = 0 ; i < monthsSinceLaunch + 1 ; i++ ) {
  let month = moment().subtract(currentMonth, 'months')
  let monthStr = month.format('YYYY-MM-')
  queries.push({
    statement: `SELECT * FROM clips WHERE created_at \
    BETWEEN cast('${monthStr + '01'}' as DATE) \
    AND cast('${monthStr + '31'}' as DATE);`,
    idFieldName: 'id',
    name: `${month.format('MM') + month.format('MMM') + month.format('YYYY')}Clips`
  })
  currentMonth++
}

Example Output

{ statement:
     'SELECT * FROM clips WHERE created_at     BETWEEN cast(\'2019-03-01\' as DATE)     AND cast(\'2019-03-31\' as DATE);',
    idFieldName: 'id',
    name: '03Mar2019Clips' }

In case anyone is interested in optimizing build times and isn't familiar with pageContext, I wrote an article explaining it: https://nickdrane.com/optimizing-gatsby-build-times-for-large-websites-using-pagecontext

@nadrane Nice article but your caveat at the bottom was the killer for me. Gatsby's change tracking is broken by this speedup and you wind up having to delete your cache on any small change, otherwise you won't see changes.

@pauleveritt Are we you sure about that? I thought that although the hot-reloading stops working, cache-busting works fine. It's my understanding that Gatsby has a filewatcher configured to look for changes against gatsbyNode.js, and when it changes, the dev environment rebuilds.

@nadrane You're right, if the edit is to gatsby-node.js itself, it rebuilds. If the edit is to some markdown file that affects a query in done once in gatsby-node.js, it's an open question.

As an example, let's say you have a site with authors and a GraphQL query in gatsby-node.js that collects the collection of authors once, then passes it into the context of each page. The page then gets the current author and displays the title.

A change to an author's title won't result in each page displaying that value getting updated.

@pauleveritt Yeah that's a good question. I'm not even sure how Gatsby handles this without the optimization. I have to assume that Gatsby Source Filesystem is setting up file watchers for us.

Regardless, I'm curious to learn if you think this is a practical concern. The reason I say that is because I'd imagine you'd only want to use this optimization in performance critical scenarios, and I'd suspect that any query against your filesystem is going to be fast already. In my experience, the place where this strategy is most valuable is when each query crosses over the network, introducing network latency into each request

Hiya!

This issue has gone quiet. Spooky quiet. πŸ‘»

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! πŸ’ͺπŸ’œ

+1

Hey again!

It’s been 30 days since anything happened on this issue, so our friendly neighborhood robot (that’s me!) is going to close it.

Please keep in mind that I’m only a robot, so if I’ve closed this issue in error, I’m HUMAN_EMOTION_SORRY. Please feel free to reopen this issue or create a new one if you need anything else.

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks again for being part of the Gatsby community!

It seems Gatsby found a way to solve this issue

@sheerun and what's the way?

We've been having out-of-memory issues in our CI environment. The error occurs during Building static HTML for pages stage. After some debugging, we noticed that the number of jest workers was 18, but we were already setting the number of workers to 1 by using the GATSBY_CPU_COUNT environment variable.

We did notice that now that environment variable is being completely ignored, see that true is being passed as argument hence the environment variable being ignored: https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/worker/pool.js#L6. the

By manually editing the file and setting to true, all our memory issues went away, and the speed of HTML pages build increased 10X. Before it was 30pages/second now almost 300 pages/second.

Hope this helps.

We had the same problem described in the previous comment (all credit to @leonfs for debugging it); our builds fail with out of memory errors on a CI env which reports 18 cores; forcing the number of reported cores to 1 (by overwriting node_modules/gatsby-core-utils/dist/cpu-core-count.js) fixes the problem.

Gatsby really needs to provide some way to properly control this.

@leonfs @juliangoacher that issue happens to be fixed yesterday, is this still a problem with that fix?

Any chance I could build your site and check for additional perf bottlenecks on your config and our (Gatsby) build pipeline?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

signalwerk picture signalwerk  Β·  3Comments

magicly picture magicly  Β·  3Comments

timbrandin picture timbrandin  Β·  3Comments

theduke picture theduke  Β·  3Comments

hobochild picture hobochild  Β·  3Comments