Gatsby: App bundle size inflated by matchPath data on 25k page site

Created on 24 Feb 2020  路  16Comments  路  Source: gatsbyjs/gatsby

Summary

We have recently noticed our app.js bundle has shot up in size, and after closer inspection it looks like the majority of the code is data relating to matchPaths.

{
   "path":"/example/path",
   "matchPath":"/example/*"
}

Relevant information

  • We have a total of 17 matchPaths.
  • We create around 25k pages using createPage.
  • When looking at our app.js, it seems every page is included in the matchPath array.
  • May be related to https://github.com/gatsbyjs/gatsby/pull/17412

I think I may have tracked down the code that does this to getMatchPaths in gatsby/src/bootstrap/requires-writer.js. When I debug, matchPathPages has a length of 25237, and then I think this matchPath data is written to match-paths.json. Would this have an effect on the app.js bundle size?

Environment (if relevant)

System:
    OS: macOS High Sierra 10.13.6
    CPU: (12) x64 Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 10.16.3 - /usr/local/bin/node
    Yarn: 1.17.3 - /usr/local/bin/yarn
    npm: 6.9.0 - /usr/local/bin/npm
  Languages:
    Python: 2.7.16 - /usr/local/bin/python
  Browsers:
    Chrome: 80.0.3987.116
    Firefox: 72.0.2
    Safari: 13.0.5
webpacbabel feature or enhancement

Most helpful comment

Ok, so this seems like combination of things here. Introduction of virtual modules (at least the way they were implemented) cause match-paths to have different handling for chunk splitting (as it virtually lands in node_modules, there are some node_modules rules for chunking). Will be working on adjusting that this change to get back to behaviour it had before introduction of virtual modules

All 16 comments

Hiya!

This issue has gone quiet. Spooky quiet. 馃懟

We get a lot of issues, so we currently close issues after 30 days of inactivity. It鈥檚 been at least 20 days since the last update here.
If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 馃挭馃挏

Hey again!

It鈥檚 been 30 days since anything happened on this issue, so our friendly neighborhood robot (that鈥檚 me!) is going to close it.
Please keep in mind that I鈥檓 only a robot, so if I鈥檝e closed this issue in error, I鈥檓 HUMAN_EMOTION_SORRY. Please feel free to reopen this issue or create a new one if you need anything else.
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks again for being part of the Gatsby community! 馃挭馃挏

Hey sorry this never got responded to. Could you create a small reproduction of the problem? That'd be the best next step towards investigating the bug.

Sorry for the late reply on this @KyleAMathews

I have created a codesandbox with my approach here.

The app.js file is here.

As you can see there are unneccessary matchPath entries as you get towards the bottom of the file. I only have four matchPaths set up in gatsby-node.js, but there is an entry for every 'static' page.

Hi @wardpeet, sorry for mentioning you directly but I think the bit of code which is causing me an issue here is from one of your PRs - https://github.com/gatsbyjs/gatsby/pull/17412.

If I comment out the following code, the matchPath array has the correct amount of elements:

  // Pages can live in matchPaths, to keep them working without doing another network request
  // we save them in matchPath. We use `@reach/router` path ranking to score paths/matchPaths
  // and sort them so more specific paths are before less specific paths.
  // More info in https://github.com/gatsbyjs/gatsby/issues/16097
  // small speedup: don't bother traversing when no matchPaths found.
  if (matchPathPages.length) {
    const newMatches = []
    pages.forEach((page, index) => {
      const isInsideMatchPath = !!matchPathPages.find(
        pageWithMatchPath =>
          !page.matchPath && match(pageWithMatchPath.matchPath, page.path)
      )

      if (isInsideMatchPath) {
        newMatches.push(
          createMatchPathEntry(
            {
              ...page,
              matchPath: page.path,
            },
            index
          )
        )
      }
    })
    // Add afterwards because the new matches are not relevant for the existing search
    matchPathPages.push(...newMatches)
  }

I understand that code is there for a reason as it was part of the fix for https://github.com/gatsbyjs/gatsby/issues/16097, but I just want to get some understanding of whether the solution should be on my end or yours.

Our app has a requirement to have static pages which can have client-only sub-routes, which is why we have pages created through createPage but then also a matchPath set up for them. You can see an example of our approach on the codesandbox in my comment above.

Correct me if I'm wrong - I think that code checks all pages to see if their URL matches any of the matchPaths that are set up, and if there is a match adds them to the matchPath array which is in turn added to the app.js bundle. This is why we are seeing such an inflated bundle size, because we have ~25k pages and there is an matchPath entry for all of these.

Any help would be greatly appreciated.

@KyleAMathews @wardpeet - as far as I can tell, we should be able to trust that Gatsby has generated all the required assets for statically generated pages (their index.html, page-data.json, etc) so the fix (https://github.com/gatsbyjs/gatsby/pull/17412) for https://github.com/gatsbyjs/gatsby/issues/16097 shouldn't need to include static pages in matchPaths (which it seems it does). Therefore, could matchPaths be solely for client-side dynamic only routes?

Let me sketch the problem here a bit. When gatsby loads on page, the next navigation will be a client side navigation. So, no page refresh. We try to load a page-data.json file for each URL. If it doesn't exist it will return a 404. When people use matchpaths, this fails and we need to know what the rootPath is. That's why matchpaths exists.

If we kept doing a page-data fetch that could result in many 404s which wouldn't be ideal. We had this logic for a while but a lot of complains happened so that's why we reverted to this pattern.

We're probably able to improve the algorithm a bit.

Hi @wardpeet, also suffering from this issue.

I would like to put a few select client only routes on the index (e.g. /pay), but also have plenty of static pages (e.g. /shiny-new-toy, /buy-buy-buy). Gatsby understands those static pages exist and can be routed to, but the number of the static pages has brought back the "page manifest problem":

[
{
    "path": "/buy-buy-buy",
    "matchPath": "/buy-buy-buy"
 }, //x 10's of thousands of these static page mappings
 {
    "path": "/",
    "matchPath": "/*"
 }
]

With a finite amount of client only pages, could the problem not be avoided by permitting a simple mapping alongside the globs that are currently supported, i.e.

//gatsby-node.js

exports.onCreatePage = async ({ page, actions }) => {
  const { createPage } = actions
  if (page.path === "/") {
    page.matchPaths = ["/pay", "myotherclientsideroute"]
    createPage(page)
  }
}

Which then produces

[
 {
    "path": "/",
    "matchPaths": ["/pay", "myotherclientsideroute"]
 }
]

And then changing find-path to check accordingly as well. What do you think?

I also bumped into this problem having 25k+ pages. As @ConorLindsay94 suggested commenting out the mentioned block of code solved it to me too.

Update [2020-10-06]: No, this move has fixed my issues only partially. In the end, I rollbacked to 2. 15.22 (Sep. 2019 version) to make localizations work without having huge match-paths.json.

@KyleAMathews @wardpeet is this something that you would be happy having opensource contributions for? Or is it something you're already working on? If we were to contribute, are there any specific areas to look out for, or has our (@ConorLindsay94) initial work highlighted the main area?

@pieh @wardpeet we have noticed that as a result of changes in https://github.com/gatsbyjs/gatsby/pull/25057 that the matchPaths are no longer included in app-[hash].js files. This has a positive side-effect for this issue. We haven't done thorough testing yet of our client-only nested routers, which we will do shortly, but I wanted to ask if this side-effect (which is positive so far) was intended?

@pieh @wardpeet we have noticed that as a result of changes in #25057 that the matchPaths are no longer included in app-[hash].js files. This has a positive side-effect for this issue. We haven't done thorough testing yet of our client-only nested routers, which we will do shortly, but I wanted to ask if this side-effect (which is positive so far) was intended?

Hmmm, this PR shouldn't have such impact - what this PR should have done is just change from using .cache/match-paths.json to $virtual/match-paths.json, but the size of match-paths shouldn't be impacted at all. This might be regression TBH, will drill a bit into it

Using default it seems like match-paths.json are still bundled into app chunk for me (at least using default start) - and this is expected (as I mentioned in comment above)

I wonder - this might be result of changes to webpack chunking setup maybe that we done some time ago? If that would be the case then match-paths might not end up in app chunk, it might just end up somewhere else (especially in case when this array is a large one).

Other option is that there is regression, but it's conditional - there might be some specific setup that cause this - removing match-paths from bundle would/could cause issues with handling of client side routes. I guess in some cases the tradeoff could be worth it to decrease bundle size at the cost of some runtime problems? (i.e. what potentially could happen is that navigating to a path handled by matchPath but not a path might cause browser reload -> loss of state etc, but the site/app generally could still "work" in general)

@pieh apologies, it was a false positive (true negative for this issue, I suppose). I have run various tests and it does seem that webpack is now putting matchPaths array elsewhere. Thank you for your help though.

I've run tests using different examples sites, slowly increasing the number of dynamic pages, using something like:

exports.createPages = ({ actions }) => {
  const { createPage } = actions
  for (let i = 0; i < 2000; i++) {
    createPage({
      path: `/${i}`,
      matchPath: `/${i}/*`,
      component: require.resolve(`./src/pages/index.js`),
    })
  }
}

When the array gets to a certain size, webpack is creating an independent chunk for the matchPaths, called something like public/188075ab-f28a602826f6270b17df.js (hash changes each build of course). I guess this would be a filesize related decision, where webpack is deciding to chunk.

It seems the change in webpack behavior is between 2.23.20 and 2.23.21, which is where webpack-virtual-modules is introduced, so could be related to those changes, or could be elsewhere, I'd have to dig further - but I don't think that is needed at this point. This issue and the original problem still exists for larger sites with mixed static and dynamic routes, that require matchPath registrations.

In fact, now having multiple large .js files on first load might in fact make the problem exponentially worse, for site load performance and for Google vitals scores. Before we had just a super large app-[hash].js file - which was problem enough, now we have a still-quite-large app-[hash].js file and the addition of a super large [matchPaths].js file.

Ok, so this seems like combination of things here. Introduction of virtual modules (at least the way they were implemented) cause match-paths to have different handling for chunk splitting (as it virtually lands in node_modules, there are some node_modules rules for chunking). Will be working on adjusting that this change to get back to behaviour it had before introduction of virtual modules

https://github.com/gatsbyjs/gatsby/pull/25720 is PR that should fix the chunk splitting change (regression?)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

3CordGuy picture 3CordGuy  路  3Comments

theduke picture theduke  路  3Comments

timbrandin picture timbrandin  路  3Comments

hobochild picture hobochild  路  3Comments

Oppenheimer1 picture Oppenheimer1  路  3Comments