Gatsby: Question: does `gatsby-source-filesystem` not cache downloaded remote files between builds?

Created on 9 Mar 2018  ยท  14Comments  ยท  Source: gatsbyjs/gatsby

I'm still digging into why my site with gatsby-source-wordpress is taking so long to build. (#4293)

For context: I have a site with about 1,000 posts and about 2,000 (large) images.
My builds are currently taking 20-30 minutes to finish. If I skip downloading images, it takes 2.

Looking at create-remote-file-node it looks like the cache is only used to remember headers that were received during the current build, I guess to prevent downloading the same image twice during a single build in case the image is referenced multiple times.

From what I can tell, Gatsby will still download every remote file on every build.

Would it not be possible to store the whole image to prevent redownloading and reprocessing with gatsby-transformer-sharp?

Most helpful comment

I'm not sure if it's worth time/effort to do this just for file downloader - it depends how do you see permanent cache implemented (one you mentioned here - https://github.com/gatsbyjs/gatsby/pull/3977#issuecomment-365113609 ).

I could do initial part of just splitting global key-value cache into separate caches dedicated for each plugin ( this one where headers are stored - https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/cache.js ) - mainly this would be needed to not redownload files after cache invalidation (if we don't change gatsby-source-filesystem)

In next iterations we can implement helper function for plugins to get their dedicated file cache path and store nodes and plugin status in dedicated partial state jsons.

Ultimately I would see it something like this:

.cache
โ”œโ”€โ”€ plugins
โ”‚   โ””โ”€โ”€ gatsby-source-somename
โ”‚       โ””โ”€โ”€ files ( we can provide helper function to get path to that )
โ”‚           โ””โ”€โ”€ cached-file.jpg
โ”‚       โ”œโ”€โ”€ db.json ( dedicated key-value cache - this is what I would start on )
โ”‚       โ””โ”€โ”€ partial-redux-state.json ( nodes and plugin status - both version and what plugin will set using `setPluginStatus` function )
โ””โ”€โ”€ global-redux-state.json ( componentDataDependencies )

All 14 comments

Also, it appears that gatsby-source-wordpress is downloading the entire WordPress media library on each build, which for me is 3.5GB and quite a lot of time. Is this necessary?

Hey @awesomebob, would you like making PR to change that in gatsby-source-filesystem? gatsby-transformer-sharp will use images it produced in previous build if they are available, so it's mostly about downloading images

@pieh I would be willing to do that.

It looks like I just need to check if filename already exists with existsSync from fs-extra, which is what gatsby-transformer-sharp is already using.

Upon further reflection, I'm still going to run into problems even once #4493 is merged in.
Gatsby clears the .cache whenever a plugin is updated, and then gatsby-source-wordpress will try to download the entire media library again, sequentially, and that is taking longer than the 1 hour I have to work with in GitLab CI.

Should I try to parallelize that download, or what else can I do? I currently have 3.5GB in the WordPress media library, and that's going to keep growing.

Yeah โ€” files aren't redownloaded โ€” it's the clearing of the cache that's the problem.

Perhaps we should make the automatic cache clearing less aggressive? Just remove the redux-state.json data?

Ah you're right, it's not redownloading files on a 304.

Clearing the cache less often might help, but at some point I'm going to have to redownload the entire WordPress Media Library again. It seems like I might have shot myself in the foot for adding 3.5GB of media there.

I might need to look into pre-processing the images first instead of during the Gatsby build if it's going to take more than an hour for my large site.

That's why I was so confused as I thought this was already working but reading code didn't make it obvious.

I'd be up for exploring less aggressive cache invalidation as this might be quick win for cases like this.
Should we also delete: (?)

  • cache/db.json persistent key-value store (for cache object passed to plugins)
  • json directory (results of queries)
  • layouts directory (layout components wrapped with data from the query)
  • api-runner-browser.js and api-runner-ssr.js (api runners with requires to plugins that use those apis)
  • async-requires.js and sync-requires.js (used to let webpack handle bundling/splitting of components and results of queries)
  • pages.json (mapping to chunks with page/layout components and json files with results of queries)

I think that's all the things gatsby dynamically generates on itself (part from redux-state.json). My bet would be to delete it all when gatsby invlidates cache as it will get regenerated anyway.

Yeah, cache/db.json too. The others are written over on every bootstrap so not caches.

Yeah, this would be an easy quick win. You want to take this @pieh?

@KyleAMathews Yeah, I will try to prepare PR for tomorrow.

@awesomebob Downloading is already parallelized ( https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-wordpress/src/normalize.js#L382-L413 ). Maybe running so many concurrent requests is actually hurting more than it helps? Maybe it would be worth exploring some kind of async queue for that with sane limit of concurrent operations?

On the second thought - less aggressive cache invalidating with just deleting cache/db.json and redux-state.json will still cause redownloads anyway as content for If-None-Match header is stored in cache/db.json.

This would have to be more robust and probably would need separate cache instances for all plugins and deleting just caches for plugins that has changed. Thoughts?

Oh hmmm yeah. Perhaps the remote file downloader can just use its own cache for now.

I'm not sure if it's worth time/effort to do this just for file downloader - it depends how do you see permanent cache implemented (one you mentioned here - https://github.com/gatsbyjs/gatsby/pull/3977#issuecomment-365113609 ).

I could do initial part of just splitting global key-value cache into separate caches dedicated for each plugin ( this one where headers are stored - https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/cache.js ) - mainly this would be needed to not redownload files after cache invalidation (if we don't change gatsby-source-filesystem)

In next iterations we can implement helper function for plugins to get their dedicated file cache path and store nodes and plugin status in dedicated partial state jsons.

Ultimately I would see it something like this:

.cache
โ”œโ”€โ”€ plugins
โ”‚   โ””โ”€โ”€ gatsby-source-somename
โ”‚       โ””โ”€โ”€ files ( we can provide helper function to get path to that )
โ”‚           โ””โ”€โ”€ cached-file.jpg
โ”‚       โ”œโ”€โ”€ db.json ( dedicated key-value cache - this is what I would start on )
โ”‚       โ””โ”€โ”€ partial-redux-state.json ( nodes and plugin status - both version and what plugin will set using `setPluginStatus` function )
โ””โ”€โ”€ global-redux-state.json ( componentDataDependencies )

Invalidating the cache by plugin version would be a great start to make the cache permanent/immutable. Instead of separate caches though, how about we just prepend cache keys with the plugin version which would accomplish the same thing & avoid having dozens of caches.

Due to the high volume of issues, we're closing out older ones without recent activity. Please open a new issue if you need help!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ยท  3Comments

hobochild picture hobochild  ยท  3Comments

magicly picture magicly  ยท  3Comments

theduke picture theduke  ยท  3Comments

dustinhorton picture dustinhorton  ยท  3Comments