Gatsby: [Request] Real-world Gatsby sites (50k+ pages)

Created on 14 Nov 2019  ·  60Comments  ·  Source: gatsbyjs/gatsby

Hello kind Gatsby user 👋

My name is Peter, and I’m a Gatsby employee focused on performance and scalability.

We at Gatsby are always looking for ways to improve the performance of building out your Gatsby applications and making Gatsby not only scalable to building out hundreds of thousands of pages (or more!) but making this process as lightning quick as the resulting Gatsby application.

To best support this endeavor, we need _your_ help! We have benchmarks in the repo, but they tend to be quite contrived and not necessarily indicative of real-world usage of Gatsby.

Specifically, we're looking for sites that:

  • Have 50k pages (or more!)
  • Are using Gatsby v2 (ideally the latest version of Gatsby)
  • Can be relatively easily set up (e.g. no or minimally complicated build process, just gatsby build would be ideal!)
  • Secrets can be shared with us privately (see my e-mail below)

For this first batch, we'll be using these real-world applications to identify low-hanging fruit as it relates to performance so we can make the Gatsby build process ever faster and ever more scalable.

Does this sound like you? Please share a link to your application's source code below _or_ e-mail any necessary details to peter at gatsbyjs dot com. We appreciate you 💜

Thanks! Onwards and upwards 📈🚀

help wanted performance

Most helpful comment

For anyone still tracking this. I haven't forgotten! It just takes time. I'm still interested in large example sites. This helps me to uncover problems that smaller sites don't exhibit. Generally this is "big oh" stuff, but that's just the low hanging fruit.

Solving these problems helps us to improve the build pipeline in general. This in turn helps you.

It's not us that it helps, though, because it might also help you in particular! Here are some examples of the direct impact of this effort so far (I might edit these into the top post for visibility);

  • "govbook", by @eads

    • Changing slug to id improved build time dramatically

    • Fixed a problem in #19691 with generating mathPath paths which surfaced a big oh problem (newly added regexes were visited in subsequent loops which blew up at scale)

    • Fixed a problem in #19866 with the reporter not being debounced causing severe delays for no good reason

Not reported here but related to scaling up images:

  • Large amount of images causing segfaults in sharp (image lib)

Type inference bottlenecks:

  • Reported by @disintegrator in #20197

    • Dropped 5 minute of build time by adding a graphql type schema

Many page site with no external deps:

  • "ifsc", reported by @tsriram in #20338

    • 150k pages based on a local csv, no images

    • dropped build time from >4.5 hours to 3 minutes

    • The run queries step was running at about 10 queries / second (should be ~1k q/s for simple queries)

    • First I changed the filter to index by id instead of slug, which improved the speed a little bit.

    • Fixed a minor bug in that heuristic, bumped the speed to 70 queries / second

    • Implemented a circumvention heuristic for id in #20609 meaning a heavy filtering part did not need to run. This dropped the 4.5 hours down to 5 minutes

    • Then took this approach and went for a generic approach for simple "flat" filters, meaning that this now also works on slug or any other property as long as the filter is a single property path leading up to eq. Merged in #20729 it meant the ifsc site was able to run at full speed without modifications



      • A graphql JIT shaved off another 2 minutes, implemented in #20477 by @vladar



    • This fix should lead to a generic massive improvement on many sites

    • But not every site. Govbook, for example, only runs the query once, so no improvement there.

With that we can currently build a site with 200k+ pages in like 5 minutes with Gatsby. Images do bog this down, as is inherent to images. And you can always do things that throws you off the happy perf path (like pass in a lot of data through context without using type schemas).

So, please, keep showing me your sites at scale. I can't promise you I'll have time for it immediately but I can promise you that I'll take a look once I have the time. And who knows what it might fix.

Please chime in if you've noticed your scaling site perf has improved (or regressed?).

If build times remains a problem, have you tried our new Builds service? :)

Things on my shortlist in no particular order;

  • v8 serialization problems (#17233)
  • graphql results sorting
  • benchmarking service to track regressions
  • triage new large sites, find the next bottleneck
  • improve reliability of displayed/reported timing data
  • what about loki

Thanks for sharing everyone :100:

All 60 comments

Hi there! Yes I'd very much like to help. My biggest Gatsby site has ~24k pages but I figure it's still a pretty decent one to take a look at. You can see it live at https://govbook.chicagoreporter.com/en/ and the code is open source at https://github.com/thechicagoreporter/govbook

It runs on Gatsby v2. No secrets required, and we track the source data in the repo for now, so you shouldn't even need to pull down fresh data. It does depend on having SQLite installed; I don't think anything else is required other than the standard Gatsby dependencies.

Also tagging in @artfulaction who is contracted to work on the project through at least the end of the year.

I've had a helluva time getting it to build on Netlify and AWS Amplify, so that's been a persistent issue. Thus far to locally develop the site I just limit the size of the query manually which isn't ideal either.

The database that drives it is about 8k rows, and there's a row per page. How is it over 24k pages then? Because there's a spanish version, an english version, and the redirect page. Any language we add will add another 8k pages to the build, and we're very much hoping to get a few more languages (especially Mandarin and Polish) into the site in the next year or so.

Longer term, we hope to bake out VCF files for each of the contacts in the database, so that will add tens of thousands of additional files as well and should represent an interesting use case for the Gatsby toolchain.

Thanks for doing this :clap:

@eads this is a great example of what I'm looking for. Thank you :)

Hi,

I don't quite fit your requirements, but I do have a Gatsby build that takes upwards of 40 minutes on my local machine, crashes Zeit, _and_ makes Netlify choke. Both also choke when I try to upload the resulting static package of 18,000+ files.

It's been fun :D

Right now Gatsby's cloud service comes the closest to working.

Here's the repo and specific branch: https://github.com/Swizec/swizec-blog/tree/many-posts

It's about 1400 pages in all, but the image and embed processing kills everything. Even have to increase Node's heap size to make the build survive.

You can't see it live anywhere because I'm trying to avoid having to setup my own VPS+CDN and such. That was one of my original motivations for moving to Gatsby in the first place – hoping for an easy way to host+setup with modern tools.

@Swizec thanks! Your site may not have as many pages but those images sure keep the cores spinning. Not sure how much we can improve on that since image processing is simply expensive. However, we do see some problems with the social plugin, and some room for improvement on fetching external resoures. Thank you :)

cc @brod-ie @ashtonsix You've mentioned in other issues that you have mega big sites. Any chance we could get a slice of that for benchmarking? It could ultimately help your site as well.

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

Hi @pvdz and @sidharthachatterjee,

Coming here from https://github.com/gatsbyjs/gatsby/issues/19718. Glad to help on improving Gatsby performance!

Our case scenario:

  • 10,000 JSON source files which turn into 10,000 HTML pages and 10,000 AMP pages. Thus, builds generate 20,000 pages
  • Build time: ~2 minutes on a m5.2xlarge EC2 instance. We are quite happy with these times.
  • We don't use GraphQL nor gatsby-plugin-sharp. We have a really complex set of JSON source files without a standardized schema, so, on a loop, we read all JSON files on gatsby-node.js, and call createPage for each one, passing the parsed JSON data on page context. All data needed to create the page is in those JSON files.
  • On the same loop, we call createPage for AMP pages, passing a different component as a template.
  • Gatsby version 2.17.10

This is the build log we get:

success open and validate gatsby-configs - 0.786s
success load plugins - 0.798s
success onPreInit - 0.059s
success delete html and css files from previous builds - 0.010s
success initialize cache - 0.008s
success copy gatsby files - 0.019s
success onPreBootstrap - 0.003s
success source and transform nodes - 2.530s
success building schema - 1.080s
success createPages - 20.736s
success createPagesStatefully - 0.053s
success onPreExtractQueries - 0.001s
success update schema - 0.025s
success extract queries from components - 0.316s
success write out requires - 0.072s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.027s
success onPostBootstrap - 0.059s
info bootstrap finished - 28.844 s

success Building production JavaScript and CSS bundles - 7.128s
success Rewriting compilation hashes - 0.002s
success run queries - 34.002s - 19563/19563 575.36/s
success Building static HTML for pages - 37.959s - 19555/19555 515.15/s

info Done building in 105.479798822 sec

We would like to improve the following:

  • createPages step: ~20 seconds
  • "Building production JavaScript and CSS bundles" step: we would like a build flag to disable generating bundles for specific paths, like AMP.
  • "run queries" step: we don't completely understand this step. We think Gatsby it's saving "page-data.json" files to disk on this step, is that right? We would like to disable creating "page-data.json" files for specific paths, like AMP.

All in all, we would like some kind of flag to make Gatsby work as a "fully static site generator". I mean:

  • Without generating page-data.json
  • Without building production JavaScript and CSS bundles
  • Without adding JS assets to page bottom (and their corresponding preload links on head)

I know this could sound stupid: "why turn Gatsby into a traditional SSR like Hugo or Jekyll?". Well, apart from solving our scaling issues with AMP, I can't imagine working without React components, even if they are only used to generate static HTML without any further JS interactivity. Hugo and Jekyll are fine, but React's simplicity and working with components are key for us (and for lot of people, I think).

I can't publicly share any further detail here, but I'll reach you by email with more details.

Thanks!

I had a huge problem with scalability with Gatsby earlier.

Issue: https://github.com/gatsbyjs/gatsby/issues/17233

I had to switch to Next.js because of this. Happy to see that Gatsby team is prioritizing scalability 👍

@rjyo the regression came with the shadowing feature that landed a few days ago. We're looking into the regression and how to best mitigate it. I don't suppose I could build your site myself for benchmarking purposes? :) Thanks for the feedback!

@asilgag we kind of need the page-data.json per page, if nothing else, for later parallalization. Each page becomes an individual job and that way we would be able to spread the load on multiple cores, something we can't do just yet right now. We should be able to improve the situation though. And if you don't save page-data.json to disk you'd have to retain it in memory, which certainly does not scale for most people (although some can certainly just throw money at it). I will take your suggestions into consideration when contemplating next steps into scaling perf and get back to you on them. Thank you!

@pvdz I just upgrade to 2.18.4 and the performance regression has gone! The createPage took 20s more than 2.17.x builds and updateSchema's time went down from 20s+ to less than 1s. i.e. The sum is quite steady.

Thanks for your information!

@ganapativs Sorry to hear that! I am definitely interested in your case and will be looking into it, regardless. Thanks for the test case :)

@pvdz After running on 2.18.4 with dozens of hourly builds on CI, around 50% of the builds failed on createPages

...
error "gatsby-node.js" threw an error while running the createPages lifecycle:
Cannot read property 'rocket' of null
  TypeError: Cannot read property 'rocket' of null
...

where rocket should be returned from the GraphQL request. note: There are a bunch of queries running in createPages, and most had already finished without any problem.

Redo the job will, again, have a 50% about success rate.

Hope you guys can find the problem. Please contact me directly were there any debug info I can provide.

Thanks!

@rjyo that doesn't sound good. Can you open a new issue (if not already done so) for this? And try it on 2.18.5 ? This may contain a fix that could have already fixed your problem.

@pvdz Thanks! I just tried 2.18.5 and the first attempt just went on well. The build time is quite similar to those of 2.17.x. Less time on createPage and what it takes on updateSchema just comes back now.

I'll let it run for some more and let your know the results.

Thanks again!

Glad to hear that :) I'm working on keeping better tabs on scaling performance regressions. Please do feel free to ping when you see something regress unexpectedly. That goes for anyone.

@eads good news! If you weren't using the CI=true flag yet, you're going to get an even better build time :D If you are using it already, well, good :) I'm changing the logger which drops the govbook build time from 210s to 140s for me locally :D ( https://github.com/gatsbyjs/gatsby/pull/19866 )

For anyone else; This PR affects the progress bar so if you were testing large sites with default settings, you should get a perf win as well.

Note that if you're building in a ci then setting CI=true is a good idea. It'll reduce log spam. After the aforementioned PR gets merged it won't matter much anymore in terms of Gatsby perf.

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

@rjyo Can you share the running site url, I am really curious about the website...

@prashant1k99 that sounds like #5002 :)

@pvdz Look at #9083 there are also 2 users with large pages:

  • 200k @Tawfiqh
  • 160k @dcworldwide

I have a Gatsby site that's currently not live as I'm still trying to work out if Gatsby is gonna work out because I have 200k+ rows in a MySQL database and each row would be a single page.

Is this a site you would want to use? It's relatively simple. It is a Twitch.tv clip aggregator that just embeds an iframe on each page along with a comment system.

@crock Yeah absolutely! Can you post the build durations (for each step) you're currently getting?

We don't have a large number of pages but we do have a large number of nodes in our graph which is killing our build performance at the stage where Gatsby is building the GraphQL schema. I've described the problem in greater depth here: https://github.com/gatsbyjs/gatsby/issues/20197

We were able to triage @disintegrator 's problem down to "unnecessary" inference and creating a type schema for the context dropped the biggest build step (type inference) from 5 minutes down to 11 seconds. See that issue for more details.

This is something we probably want to try and automate (detect, warn, auto-create schema, win)

@pvdz Look at #5002 there are also 1 user with large pages:

For anyone still tracking this. I haven't forgotten! It just takes time. I'm still interested in large example sites. This helps me to uncover problems that smaller sites don't exhibit. Generally this is "big oh" stuff, but that's just the low hanging fruit.

Solving these problems helps us to improve the build pipeline in general. This in turn helps you.

It's not us that it helps, though, because it might also help you in particular! Here are some examples of the direct impact of this effort so far (I might edit these into the top post for visibility);

  • "govbook", by @eads

    • Changing slug to id improved build time dramatically

    • Fixed a problem in #19691 with generating mathPath paths which surfaced a big oh problem (newly added regexes were visited in subsequent loops which blew up at scale)

    • Fixed a problem in #19866 with the reporter not being debounced causing severe delays for no good reason

Not reported here but related to scaling up images:

  • Large amount of images causing segfaults in sharp (image lib)

Type inference bottlenecks:

  • Reported by @disintegrator in #20197

    • Dropped 5 minute of build time by adding a graphql type schema

Many page site with no external deps:

  • "ifsc", reported by @tsriram in #20338

    • 150k pages based on a local csv, no images

    • dropped build time from >4.5 hours to 3 minutes

    • The run queries step was running at about 10 queries / second (should be ~1k q/s for simple queries)

    • First I changed the filter to index by id instead of slug, which improved the speed a little bit.

    • Fixed a minor bug in that heuristic, bumped the speed to 70 queries / second

    • Implemented a circumvention heuristic for id in #20609 meaning a heavy filtering part did not need to run. This dropped the 4.5 hours down to 5 minutes

    • Then took this approach and went for a generic approach for simple "flat" filters, meaning that this now also works on slug or any other property as long as the filter is a single property path leading up to eq. Merged in #20729 it meant the ifsc site was able to run at full speed without modifications



      • A graphql JIT shaved off another 2 minutes, implemented in #20477 by @vladar



    • This fix should lead to a generic massive improvement on many sites

    • But not every site. Govbook, for example, only runs the query once, so no improvement there.

With that we can currently build a site with 200k+ pages in like 5 minutes with Gatsby. Images do bog this down, as is inherent to images. And you can always do things that throws you off the happy perf path (like pass in a lot of data through context without using type schemas).

So, please, keep showing me your sites at scale. I can't promise you I'll have time for it immediately but I can promise you that I'll take a look once I have the time. And who knows what it might fix.

Please chime in if you've noticed your scaling site perf has improved (or regressed?).

If build times remains a problem, have you tried our new Builds service? :)

Things on my shortlist in no particular order;

  • v8 serialization problems (#17233)
  • graphql results sorting
  • benchmarking service to track regressions
  • triage new large sites, find the next bottleneck
  • improve reliability of displayed/reported timing data
  • what about loki

Thanks for sharing everyone :100:

@pvdz I'm still following! Many thanks for all your work on this. I'm excited to try out the improvements in the coming weeks and will report back. Maybe my site will finally build!

@eads it's not building? I've been using that site to benchmark certain things for a while. It should build with little problems. There's plenty of room for improvement, like adding a graphql schema and not passing the entire data structure through the context, using static queries, etc. But as it is, it should run fine. And with my latest fix you wouldn't even need to switch to filter by id (it would still be slightly faster, but that delta is minor now).

Hey @pvdz,

I have a fairly small (<100 pages) site where gatsby build (both locally and in CI) fails on Node 10.16.3 with:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Repo is here and failing build is here.

Any hints to what may cause the failure?

@moroshko that sounds interesting. 100 pages shouldn't trigger that, although it might depending on what's in there. Have you tried expanding the available memory? (node --max_old_space_size=2000 node_modules/.bin/gatsby build) I'll try to have a look at it this week.

@pvdz Increasing the available memory helped the build to pass locally. But, I couldn't find how to increase the memory in CI (GitHub actions).

@moroshko your problem is related to webpack. In particular minification and in very particular sourcemapping. For example, if you run gatsby build --no-uglify it passes just fine and the build completes in ~60s (40s is for webpack, that's the "building production JS and CSS bundles"-step).

If I look at the public folder afterwards I see a bunch of 2.5mb JS files. Those are definitely the source of this problem. Quesiton is now, why. The big file names seem to match the components in the siteMetadata in gatsby-config.js. My guess is webpack is somehow creating chunks which include "all the things", for whatever reason. Can you see if you can solve it that way? Please circle back to us if you think this is a problem within Gatsby (open a new ticket so we can track it properly).

-rw-rw-r--  1 2731406 Feb 13 20:07 component---src-pages-components-button-index-js-771ab1f296f72e990cbb.js
-rw-rw-r--  1 3200061 Feb 13 20:07 component---src-pages-components-button-index-js-771ab1f296f72e990cbb.js.map
-rw-rw-r--  1 6292 Feb 13 20:07 component---src-pages-components-button-resources-mdx-bb7e80fcbb5680269f34.js
-rw-rw-r--  1    3435 Feb 13 20:07 component---src-pages-components-button-resources-mdx-bb7e80fcbb5680269f34.js.map
-rw-rw-r--  1    6267 Feb 13 20:07 component---src-pages-components-button-usage-mdx-dff9888c164b4bd2d48b.js
-rw-rw-r--  1    3398 Feb 13 20:07 component---src-pages-components-button-usage-mdx-dff9888c164b4bd2d48b.js.map
-rw-rw-r--  1 2731772 Feb 13 20:07 component---src-pages-components-checkbox-index-js-2b6ff181a939e025c46c.js
-rw-rw-r--  1 3200643 Feb 13 20:07 component---src-pages-components-checkbox-index-js-2b6ff181a939e025c46c.js.map
-rw-rw-r--  1    6299 Feb 13 20:07 component---src-pages-components-checkbox-resources-mdx-a4404e129d9bcd1ac5a8.js
-rw-rw-r--  1    3443 Feb 13 20:07 component---src-pages-components-checkbox-resources-mdx-a4404e129d9bcd1ac5a8.js.map
-rw-rw-r--  1    6274 Feb 13 20:07 component---src-pages-components-checkbox-usage-mdx-73fb39d8b1900a45b135.js
-rw-rw-r--  1    3406 Feb 13 20:07 component---src-pages-components-checkbox-usage-mdx-73fb39d8b1900a45b135.js.map
-rw-rw-r--  1 2732963 Feb 13 20:07 component---src-pages-components-container-index-js-a75b142fb6ebe61f7e32.js
-rw-rw-r--  1 3198189 Feb 13 20:07 component---src-pages-components-container-index-js-a75b142fb6ebe61f7e32.js.map
-rw-rw-r--  1    6302 Feb 13 20:07 component---src-pages-components-container-resources-mdx-b4610919ab8ddbc64b94.js
-rw-rw-r--  1    3447 Feb 13 20:07 component---src-pages-components-container-resources-mdx-b4610919ab8ddbc64b94.js.map
-rw-rw-r--  1    6277 Feb 13 20:07 component---src-pages-components-container-usage-mdx-a830382f44eead740053.js
-rw-rw-r--  1    3410 Feb 13 20:07 component---src-pages-components-container-usage-mdx-a830382f44eead740053.js.map
-rw-rw-r--  1 2731169 Feb 13 20:07 component---src-pages-components-date-picker-index-js-fd3831a07f5212003dd9.js
-rw-rw-r--  1 3199955 Feb 13 20:07 component---src-pages-components-date-picker-index-js-fd3831a07f5212003dd9.js.map
etc

@pvdz Thanks for looking into this!

Can you see if you can solve it that way?

I'm not sure what are you suggesting here. My understanding is that Gatsby is using webpack internally and it's up to Gatsby to do use webpack in the most optimized way.

Hey @pvdz, thank you so much for helping with these cases!!

I have a site I'd love get some help with. It's only about 2k pages, and right now I'm only using 100 test pages, but it still takes ~2min build time (locally), sometimes more (on Gatsby Cloud—idk). Here's the repo. The data source is at api.satirev.org, which is a DigitalOcean droplet with the Directus api set up. It's the tiniest size they have, could that have something to do with the slow speed? UPDATE: I tested this with a local ddev setup and had the same result. So I don't think it's network problems.
See below:
Screen Shot 2020-02-15 at 8 18 00 PM

I feel like at this size, I should be able to build pretty lickety splickety on re-runs, but for some reason, running queries and js/css bundling seem to be going real slow.

UPDATE: changed my page queries to eq: $id instead of the $dataID and it sped things up a bit! Though apparently that isn't supposed to be the case anymore.

@jacobrienstra the breakdown seems to be; 30s webpack (flat cost), 90s images (they are expensive, period), and 30s queries. The rest is change.

Though apparently that isn't supposed to be the case anymore.

You're kind of right so I checked the history of your repo:

query FullArticle($dataId: Int!) { dataArticle(dataId: { eq: $dataId }) {

Perhaps the Int part is failing to hit the heuristic. I don't know how that gets translated down the line, but the heuristic requires a plain string, number, or boolean as the eq type (these JS types may or may not map directly to graphql types though). Anyways, good to hear switching to ID helped. What is the query time now? I wouldn't expect it to make a huge impact on 100 pages.

30s for webpack is still kind of long. Can you try without minification (gatsby build --no-uglify)? If that makes a huge difference then webpack is creating huge JS files (you can see them in ./public afterwards) which may be a source of problems as well.

The images are trickier. Some preprocessing could help, like if they're huge files you can shrink them one time to the target size, that way sharp doesn't have to needlessly process megabytes of imagery every time. In the end, images are expensive. It is what it is.

@jacobrienstra I'll look into the filter problem. Seems that int! should just be a number by the time it reaches the heuristic, so that should not be a reason to ignore it. One other reason I can think of is a miss, like when dataId doesn't exist or something. Anyways, I'll add it to my todo-list. (Would be great if you could turn this into a fully local repro!) Cheers

Thank you @pvdz !!!
Yeah I'm not sure about the dataId thing. There shouldn't be any misses, the dataId is the id given to them in the database, so it should always be there. Perhaps there's something wonky about graphql typing, idk.

--no-uglify doesn't seem to make a difference, I remember trying that. I am going to try to reduce the number of dependencies and do as much myself as I can, but it does seem a long time to build js. The only big dep I have is material-ui

On a fresh build I got
27s/32s/76s for webpack, queries, and images respectively.
On subsequent builds I got
24s/24s for webpack and queries. Which is better!

I don't actually mind the longer image build time, because they persist in the cache and I don't have to do them every time. The goal is to get it so that when a user publishes a new article and triggers a build, it'll be up on the site as quickly as possible. I think since all the images are cached it'll be fine.
And I do plan to do preprocessing, I'm migrating everything from an old Drupal site so I can do it then, and I think any new images are preprocessed by Directus at least down to a certain size.

Any other ideas as to how to reduce the queries runtime? I wonder if it's something to do with the source plugin, perhaps it's not batching things it could be? It seems slow for 100ish queries, at least compared to the benchmarks I ran.

Oh! And yes I did move it to its own repo lol idk why it was still in that messy one https://github.com/jacobrienstra/satirev.org-gatsby

Hi @pvdz,

I manage a couple of news websites with over >500k contents and looking to use Gatsby as staticizer. Right now we're using a custom solution that staticize a single content via API. This has the advantage that a single page can be online in less than a second, a critical requirement for news websites. On the other end, changing templates requires a very slow republish of each content.

Gatsby could be a great solution, but rebuilding 500k pages at each new content or template change is not an option. Even the solution of incremental data changes, as reported on the site https://www.gatsbyjs.org/docs/page-build-optimizations-for-incremental-data-changes/, is not valid as it requires a query on all of the contents to check which one has changed. On news websites you could have even multiple changes in one second, such query would require a massive data exchange.

If Gatsby has something to generate single pages via API i think it could be used for news websites and I could also help with the testing.

Thanks a lot!

Hi @pvdz - we are building a large enterprise site on Gatsby and are experiencing incredibly long build times. We know that some of the time is due to a big chunk of content, but we are desperate to find the root cause and get it fixed.
Do you have the time and are you up for this challenge? :) And what would you need from our side to initially be able to understand our setup and do an analysis?

Any help is highly appreciated! :)

@giupas @mikaelmoller Hey, thanks for your messages. Sorry for taking so long to respond, it's been a little weird the past two weeks and some github notifications slipped through.

@giupas this is more a question for Cloud or Builds. Somebody will reach out in private about this, I think we can make this work! :)

@mikaelmoller I can triage it. First what I need is a build output, so I can see which parts require the most time. Then an example of the gatsby-node and a template, to see what kind of queries you're running and how you're passing on data. What kind of site is it? Markdown, mdx, something else? Have you tried the usual suspects? Things like adding a graphql scheme to prevent type inference, putting as little data in the context as possible, precomputing images, etc? Best would be if I can just look, or even locally build, the site.

[email protected] contains https://github.com/gatsbyjs/gatsby/pull/22574 which should improve performance for sites with many nodes that use queries containing multiple eq filters.

Before this optimization was only applied to queries with single eq filters. I'm in the process of also adding support for other operators.

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

@pvdz I have a site with much less pages using gatsby-source-contentful but already crashes Zeit/Vercel. Maybe you want to take a look? #23463

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

@gerardoboss I've had issues with gatsby-source-mysql in the past when dealing with very large datasets. It timeouts after a while. The best option is to write a custom source plugin and break up the sql queries into smaller ones.

@crock thank you so much, Im thinking maybe go with a CSV I try it to brake it in queries of 10K records, but the result is exactly the same, dont know where to look for the problem, there is no error o log that tells me what is wrong, if it was an error or time out.

So I will try to a CSV, probably need to convert to json or something need to check.

Thank you so much. I'll update if I am able to do it.

@gerardoboss have you tried to give the nodejs process more memory? You can do something like node --max_old_space_size=4000 node_modules/.bin/gatsby build to bump the memory available to nodejs which you'll need to do for larger sites. How much you need really depends on your setup and is different for every site. Generally for 50k sites I'd expect 2gb to 4gb to be enough. If you have a public repo I can checkout I can take a look.

@xmflsct I see you were able to resolve it, great! :) Fwiw, the contentful plugin adds a lot of internal nodes (the core unit of information inside Gatsby) which is resulting in scaling problems. I've seen sites with 15k pages rack up over a million internal nodes because it was creating a node for each piece of text in Contentful. I have no concrete way forward here, but that's been my observation.

@pvdz it worked flawlessly! Thanks a lot!

Screen Shot 2020-04-28 at 12 36 16 PM

should be added note at the troubleshooting page https://www.gatsbyjs.org/docs/troubleshooting-common-errors/ about max_old_space_size?

Going to close this issue. Thanks everyone who participated. Your contributions have made a great impact to the perf of Gatsby :d

Feel free to keep posting large sites (public repo, something I can build locally). The ones so far serve as excellent benchmarks.

At this point my definition of large sites are 100k to 1m page sites. Although it's more accurate to speak in terms of internal node count, which is around 1 million. You can see the node counts by running gatsby build --verbose. The node counts will be printed during bootstrap. (Page nodes separately shortly after). A page with 1 million nodes builds in roughly 20 to 60 minutes, depending on sourcing, plugins, and type of website.

So a large site will have a million+ nodes internally and I'm still working on raising that ceiling :)

Be well. Reach out if you need help.

@pvdz I need help with my build times my site is slow and it just has like 1200 pages but it does contain around 12k images. Can you please help me

@daiky00 have you tried Gatsby Cloud btw? It speeds up processing large numbers of images a lot by parallelization across cloud functions and better caching between builds

@KyleAMathews I am using Netlify with their parallel image processing plugin with google cloud. But the build is slow can you help me improve or take a look at it the repo is private though so I will need to give you access. I am also using their incremental build solution which is great but my problem is with the first build

@daiky00 you'd need to ask Netlify for help then as the constraint on build speed would be their plugin. You should also try Gatsby Cloud to compare the experience.

@KyleAMathews I would have try it if it wasn't as expensive $99 a month is too much and the build is not slow because of netlify when I do it locally is the same. I want to speed up the building time can you help me?

Yeah happy to take a look

@KyleAMathews I will give you access to repo and you tell me what you see wrong

ok @KyleAMathews I invited you

@KyleAMathews also use the search-page branch for latest code

@daiky00 got it running locally and built it twice — the first run took 8:12 & the second run took 1:26 as the image generation was cached the second run. Most of the 1:26 is now spent in refetching the data & creating pages. What kind of build speeds are you seeing?

I totally hear you that $99/month is too much — we're actually launching a much cheaper price plan soon that gives you inc builds (which is even faster than the 1:26) — email me @ [email protected] if you'd like early access.

Cool :)

@KyleAMathews not right now I need to deploy this site ASAP to production. But I will contact you when I do 😃👍 and thanks

Was this page helpful?
0 / 5 - 0 ratings

Related issues

3CordGuy picture 3CordGuy  ·  3Comments

Oppenheimer1 picture Oppenheimer1  ·  3Comments

andykais picture andykais  ·  3Comments

ferMartz picture ferMartz  ·  3Comments

theduke picture theduke  ·  3Comments