Gatsby: Uneven .md vs. .mdx build times

Created on 24 Mar 2020  路  9Comments  路  Source: gatsbyjs/gatsby

Description

We run a documentation website that has hit a performance bottleneck at around 1000 pages. This led us to test the difference between gatsby-transformer-remark and gatsby-plugin-mdx to compare .md and .mdx build times.

We realize that these are not the same plugin, but our expectations were that the build times of each would be closer in-line with one another (for the exact same files).

We used the following repo to benchmark results: https://github.com/johnatspreadstreet/gatsby-md-vs-mdx

Here were the results of our test using the auto generated files:

| | | |
|----------------------------|-------|---------|
| Source and Transform Nodes | | |
| # of Pages | md | mdx |
| 100 | 0.17s | 3.12s |
| 1000 | 0.90s | 23.05s |
| 8000 | 5.53s | 192.80s |

Steps to reproduce

GitHub repo: https://github.com/johnatspreadstreet/gatsby-md-vs-mdx

For markdown files:

  1. Make sure line 54 of md.generate.js is set to create .md
  2. Make sure gatsby-transformer-remark is used (and not gatsby-plugin-mdx) in the gatsby-config.js file
  3. Run npm run bench or yarn run bench

For mdx files:

  1. Make sure line 54 of md.generate.js is set to create .mdx
  2. Make sure gatsby-plugin-mdx is used (and not gatsby-transformer-remark) in the gatsby-config.js file
  3. Run npm run bench or yarn run bench

Expected result

Results of the gatsby build process for .md and .mdx files should be within reasonable distance of one another.

Actual result

Build times of .mdx were between 18 and 34 times longer for the source and transform nodes step vs. .md files.

Environment

System:
OS: Windows 10
CPU: (16) x64 Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
Binaries:
Yarn: 1.18.0 - C:\Program Files (x86)\Yarn\bin\yarn.CMD
npm: 6.9.0 - C:\Program Files\nodejs\npm.CMD
Languages:
Python: 2.7.15 - /c/Users/JYoun/.windows-build-tools/python27/python
Browsers:
Edge: 44.18362.449.0
npmPackages:
gatsby: ^2.19.5 => 2.19.45
gatsby-image: ^2.2.39 => 2.2.44
gatsby-plugin-benchmark-reporting: * => 0.0.13
gatsby-plugin-feed: ^2.3.26 => 2.3.29
gatsby-plugin-google-analytics: ^2.1.34 => 2.1.38
gatsby-plugin-manifest: ^2.2.38 => 2.2.48
gatsby-plugin-mdx: ^1.0.83 => 1.0.83
gatsby-plugin-offline: ^3.0.32 => 3.0.41
gatsby-plugin-react-helmet: ^3.1.21 => 3.1.24
gatsby-plugin-sharp: ^2.4.0 => 2.4.13
gatsby-plugin-typography: ^2.3.21 => 2.3.25
gatsby-remark-copy-linked-files: ^2.1.36 => 2.1.40
gatsby-remark-images: ^3.1.42 => 3.1.50
gatsby-remark-prismjs: ^3.3.30 => 3.3.36
gatsby-remark-responsive-iframe: ^2.2.31 => 2.2.34
gatsby-remark-smartypants: ^2.1.20 => 2.1.23
gatsby-source-filesystem: ^2.1.46 => 2.1.56
gatsby-transformer-remark: ^2.6.48 => 2.6.59
gatsby-transformer-sharp: ^2.3.13 => 2.3.19

MDX scaling builds bug

Most helpful comment

@johno Thanks for the updates here. Here is the second round of testing after pulling down the new packages (benchmark repo: https://github.com/johnatspreadstreet/gatsby-md-vs-mdx):

| | | | |
|----------------------------|-----------------|------------------|------|
| Source and Transform Nodes | | | |
| # of Pages | mdx (pre-patch) | mdx (post-patch) | dif |
| 100 | 3.12 | 3.7 | 0.58 |
| 1000 | 23.05 | 24.3 | 1.25 |
| 8000 | 192.8 | 183.4 | -9.4 |

Looks like ~5% better performance in the upper bucket, with pretty much even performance in the lower buckets.

If you guys need me to test anything additional, or hunches, happy to do so.

All 9 comments

Hi. Thanks for flagging it. Markdown is a concern of mine in general but this mdx hockey sticking is pretty bad. We'll have to take a look at it sooner or later.

@pvdz Thank you so much for hopping on. I am happy to assist in any way that I can, as this interaction is pretty critical for us moving forward.

We currently have ~1,000 pages and another ~9,000 queued up for when we can fix the build time issue.

@johno is investigating :purple_heart:

@pvdz Great, thank you so much.

@johno Let me know any way I can assist.

@johno Thanks for the updates here. Here is the second round of testing after pulling down the new packages (benchmark repo: https://github.com/johnatspreadstreet/gatsby-md-vs-mdx):

| | | | |
|----------------------------|-----------------|------------------|------|
| Source and Transform Nodes | | | |
| # of Pages | mdx (pre-patch) | mdx (post-patch) | dif |
| 100 | 3.12 | 3.7 | 0.58 |
| 1000 | 23.05 | 24.3 | 1.25 |
| 8000 | 192.8 | 183.4 | -9.4 |

Looks like ~5% better performance in the upper bucket, with pretty much even performance in the lower buckets.

If you guys need me to test anything additional, or hunches, happy to do so.

@johnatspreadstreet If you want, you can take the code that the patch changed and follow it. It's more than likely that something is looping over those nodes, like a .map or .forEach or even just a regular for. I suspect something will be obvious in that area quite quickly. Othewise we'll get to it soon enough.

Here's where I'm at so far: yesterday I ran benchmarks and could confirm I'm seeing the same growth in time during sourcing/transforming nodes. Each time I double the number of MDX pages I see ~2x increase in time in "source and transform nodes". It appears to be continually growing, too. 馃彃

Files | Total build | Source and transform nodes |
------ | ---------- | ----------------------------- |
500 | 30s | 13s
1k | 51s | 23s (+1.7x)
2k | 90s | 47s (+2x)
4k | 180s | 100s (+2x)
8k | 440s | 230s (+2.3x)
16k | 1100s | 530s (+2.3x)

After some digging I found a usage of a manual node filter for type using getNodes rather than getNodesByType. This, unsurprisingly, caused a lot of unnecessary node traversal at scale. That fix (in #22555) saw a ~30% reduction in build times for 16k pages (using my benchmark site so YMMV).

My benchmark results after the change

Files | Total build | Source and transform nodes |
------ | ---------- | ----------------------------- |
8k | 350s | 150s
16k | 750s | 400s

I'll keep digging into this as I get time and will report back any additional findings and performance improvements as we get them PRed in. I suspect there's a lot more low hanging fruit.


It's also important to note that MDX will always be quite a bit slower than MD since it's doing a lot more under the covers, but it definitely shouldn't be 馃搱馃檧

@johno That is very good context, thank you much.

Is there any documentation you can point me towards that actually covers the process of converting mdx -> what is needed to display in React?

This might provide additional context: for large mdx posts, I get the following error:

 ERROR 
[BABEL] Note: The code generator has deoptimised the styling of undefined as it exceeds the max of 500KB.

warn Query takes too long:
...

This leads to a significant performance hit - the kind mentioned by the OP. I verified this on the official gatsby-starter-blog starter (after modifying it to use MDX rather than md).

Was this page helpful?
5 / 5 - 1 ratings