I'd like to add support for a plugin option to skip creating parent-child link in gatsby-transformer-json. Currently gatsby-transformer-json creates a link between the parent node (mostly a file node created by gatsby-source-filesystem, but this could be any node created by any source plugin). This is done by calling createParentChildLink after creating the JSON node.
I have about 145k JSON nodes (split across multiple files). When I run gatsby build, "source and transform nodes" step takes about 180-200s on average (on my laptop, ~130s on CI). If I choose to skip the createParentChildLink call in gatsby-transformer-json, this step gets completed in ~20s. This ~8-10x performance improvement in "source and transform nodes" step is really huge and helps in reducing the overall build time.
From my understanding, this parent-child link is required mainly for these two use cases:
gatsby-source-filesystem was deleted or modified, Gatsby needs to delete/modify the respective JSON nodes.Is there any other use case for maintaining this parent-child relationship?
While these are valid use cases, this createParentChildLink call (and the subsequent action to store this in redux store) seems to be an overhead when the data is huge.
Since I know for sure that my JSON source files are not going to change during develop/build process, the price I pay for parent-child link looks very expensive.
So I would like to add an option to gatsby-transformer-json to skip this call to create parent-child link. While I have tested this only with gatsby-transformer-json, this is likely true for all tranformer plugins.
Default behaviour of the plugin won't change with this option, only when someone (like me) doesn't want to have parent-child link, they can choose to skip this step to save some build time.
An example of how this will look in gatsby-config:
{
resolve: `gatsby-transformer-json`,
options: {
skipParentChildLink: true
}
}
And it's probably the case with many people where the source files change during the develop phase but not during build. In such cases, users can have config like this:
{
resolve: `gatsby-transformer-json`,
options: {
skipParentChildLink: process.env.gatsby_executing_command === "build"
}
}
Does this make sense? Or am I overlooking something which could be broken by this behaviour?
I have made these changes locally and tested against my site. If we're okay with this, I'm happy to submit PR(s) with these changes.
Cc: @pvdz - thought you'd be interested :)
One caveat I can think of is that the option is rather obscure and that it might be difficult to explain when you should and shouldn't use it. Or what you should expect to no longer work if you do use it.
However, a 10x improvement across the board is definitely interesting. cc @vladar and @freiksenet for insights. Maybe there's a way to handle this situation better such that we get the perf improvements by default?
I think your description of createParentChildLink is pretty accurate.
As for the option idea... In general, we try to avoid options like this. But we should probably investigate why it takes so long. This action doesn't do anything special - it basically adds an id of the child node to parents' children array (via redux action).
It shouldn't be much slower than the base createNode call (which also runs through redux). So if your createNode calls are completed in 20 seconds, I would expect createParentChildLink shouldn't take much more than this.
As always, it would be great to have a reproduction at hand. Then we can probably optimize this case for everyone.
@vladar sure, you can check https://github.com/tsriram/ifsc for reproduction.
I'm just guessing here - I think action creator itself is fine, but we listen to ADD_CHILD_NODE_TO_PARENT_NODE in multiple places to invalidate some caches ( https://github.com/gatsbyjs/gatsby/search?q=ADD_CHILD_NODE_TO_PARENT_NODE&unscoped_q=ADD_CHILD_NODE_TO_PARENT_NODE ). This might be the case of starting to listen to it too early in one (or more) of those places(?)
@pieh Agree, most likely it happens because of some of the reducers. But even those seem to be lightweight in comparison to CREATE_NODE action (at least from the first look).
Oh, you mean graphql-runner and query. Yeah, highly likely
Ok, I take back my initial assumption - it's defenitely createParentChildLink here - and more precisely this _.uniq call:
https://github.com/gatsbyjs/gatsby/blob/6979816b4f71f4fa9bc79824b0fc6883271e0dda/packages/gatsby/src/redux/actions/public.js#L1030
That's because we create many, ... many JSON children for each File node in this case, we spend ridiculous amount of time there:
zoomed out profile for single onCreateNode with single File node that will create huge amount of children:

Bit more zoom in to see individual children:

We actually spend ~12% of build time doing this _.uniq (that's total build time, not just sourceNodes - should have limit profiling to sourceNodes heh) :

Quick win without changing data structure could be to just append children there and make single uniq once all transformers do work maybe? (but that's assuming that people don't call call createParentChildrenLink in weird places)
We actually spend ~12% of build time doing this _.uniq
:cry:
One idea is to simplify it even further;
uniq is only called to make sure the new child is deduped. So instead doing an if !.includes before pushing would suffice.Set instead. I think we should do this regardless but I'm worried the usages are going to be quite heinous to find.As an intermediate step; we can try to change it to [...new Set(parent.children)] for the same effect, see if that perf is better.
I'm worried about changing structure from Array to Set - users/plugins operate on pure node data in some cases so some plugins might rely on children being array. (that was my primary reason for mentioning "not changing data structure")
I'm more than fine with !includes check before adding item to children or just changing _.uniq to something else
Quick checks on 100k nodes;
[...new Set()] hack had similar timingsinlcudes hack dropped it to 20sWhile I'm very unhappy about doing a repeated O((n^2)/2) unguided string search through 100k+ elements (--> .includes), I suspect that's still just a subset of the work that _.uniq is doing so that feels like a good first step forward.
This seems to hold for 150k nodes; from 45s to 30s. I'll put up a PR.
One question related to this; couldn't we simply check the parent of the current child and prevent this whole step if they are already the same?
Ok, so #22126 will tackle this partially. That's already great, thanks @tsriram ! But as you showed there's still room for improvement. Ideally this check would be O(1). Assuming the child's parent state is sound, we would only need to check that state to determine whether we need to update the parent state. And in that case, we should prevent the whole createParentChildLink call in the first place.
@tsriram The above PR shaves 30s of your total runtime, about 9%. Do you feel this PR will still add much to it after that?
@pvdz awesome! This has brought down the "source and transform nodes" to ~40s. Yeah, ideally I'd like to avoid the link altogether, but if you all think that's not the best way, I think this is the best we can get for now :)
While I wouldn't mind another 50% cut, I think this is acceptable without becoming too specific.
If you have further ideas please share them. I'm super happy we were able to merge this fix. Thanks a lot! :D