Gatsby: [gatsby-source-wordpress] Large WordPress site causing extremely slow build time (stuck at 'source and transform nodes')

Created on 22 Jul 2018  ·  156Comments  ·  Source: gatsbyjs/gatsby

Description

gatsby develop hangs on source and transform nodes after querying a large WordPress installation (~9000 posts, ~35 pages).

Is there any guides as to what's too big for Gatsby to handle in this regards?

Environment

  System:
    OS: macOS High Sierra 10.13.6
    CPU: x64 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 8.10.0 - ~/n/bin/node
    Yarn: 1.5.1 - ~/n/bin/yarn
    npm: 5.6.0 - ~/n/bin/npm
  Browsers:
    Chrome: 67.0.3396.99
    Safari: 11.1.2
  npmPackages:
    gatsby: ^1.9.273 => 1.9.273
    gatsby-image: ^1.0.54 => 1.0.54
    gatsby-link: ^1.6.45 => 1.6.45
    gatsby-plugin-google-analytics: ^1.0.27 => 1.0.31
    gatsby-plugin-postcss-sass: ^1.0.22 => 1.0.22
    gatsby-plugin-react-helmet: ^2.0.10 => 2.0.11
    gatsby-plugin-react-next: ^1.0.11 => 1.0.11
    gatsby-plugin-resolve-src: 1.1.3 => 1.1.3
    gatsby-plugin-sharp: ^1.6.48 => 1.6.48
    gatsby-plugin-svgr: ^1.0.1 => 1.0.1
    gatsby-source-filesystem: ^1.5.39 => 1.5.39
    gatsby-source-wordpress: ^2.0.93 => 2.0.93
    gatsby-transformer-sharp: ^1.6.27 => 1.6.27
  npmGlobalPackages:
    gatsby-cli: 1.1.58

edit: Just want to reiterate—this is not something easily fixable by deleted .cache/, .node_modules, etc. If that resolves your problem, you weren't experiencing this issue.

Most helpful comment

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Yeah the issue is really based on the fact that createRemoteFileNode uses concurrency of 200 which is too much for most WP servers. I have my images on CloudFront and was hitting some rate limits there.

I tried fixing the issue with a branched version of the source-plugin for a while but the issue really isn't in gatsby-source-wordpress it is in gatsby-source-filesystem. Ideally consumers of the createRemoteFileNode function would be able to pass in concurrency there. Then plugins could make the concurrency option available in their configs. I still would like to do a PR to address this issue!

The solution I have been using is just a simple script to modify the code inside node_modules. Really quite fragile and not ideal but it is a simple hack to modify the concurrency directly. Uses shelljs so it is supposed to work for windows users as well (haven't tried).

#!/usr/bin/env node
const path = require('path');
const shell = require('shelljs');

const FILE_PATH = path.resolve(
  __dirname,
  // add path to your root dir here,
  'node_modules',
  'gatsby-source-filesystem/create-remote-file-node.js'
);

shell.sed('-i', 'concurrent: 200', 'concurrent: 20', FILE_PATH);

All 156 comments

Can You prepare reproduction repo? Number of posts shouldn't be a problem (at least at this step) - v1 might get into memory problems but this would be in later build step and shouldn't get stuck

Was curious if it was an issue with Local by Flywheel, and able to build the site when serving WordPress via MAMP Pro.

But, I'm not even building post pages yet (am building the pages), and the execution time for that problematic step is 636.41s (just shy of 11 minutes).

const path = require('path')

exports.createPages = ({ boundActionCreators, graphql }) => {
  const { createPage } = boundActionCreators

  const postTemplate = path.resolve('./src/templates/Post/Post.js')

  graphql(
    `
      {
        allWordpressPost {
          edges {
            node {
              id
              slug
            }
          }
        }
      }
    `
  )
    .then((result) => {
      console.log('posts')
      // const { data, errors } = result

      // if (errors) console.log(errors)

      // if (!data) return

      //data.allWordpressPost.edges.forEach(({ node }) => {
      //  const { id, slug } = node

      //  createPage({
      //    component: postTemplate,
      //    context: {
      //      id,
      //    },
      //    path: slug,
      //  })
      //})
    })

edit: just enable createPage for posts and execution of that item rose to 14 minutes. Brutal, but also interesting that it's only 3 minutes longer for ~9000 more items. It's sitting on ⠁ run graphql queries for long time currently.

edit: that ran for 419.470 s, or 7 minutes.

@pieh Whoops, posted that before I saw you'd just replied. I can try to get this site up remotely tomorrow.

And meant to include, this last line is where it hangs via Local, and takes forever via MAMP.

$ gatsby develop
success delete html and css files from previous builds — 0.017 s
success open and validate gatsby-config — 0.226 s
info One or more of your plugins have changed since the last time you ran Gatsby. As
a precaution, we're deleting your site's cache to ensure there's not any stale
data
success copy gatsby files — 0.013 s
success onPreBootstrap — 0.159 s
⠁ source and transform nodes -> wordpress__acf_posts fetched : 100
⠁ source and transform nodes -> wordpress__acf_pages fetched : 34
⠂ source and transform nodes -> wordpress__acf_media fetched : 100
⠈ source and transform nodes -> wordpress__acf_categories fetched : 13
⢀ source and transform nodes -> wordpress__acf_tags fetched : 0
⠄ source and transform nodes -> wordpress__acf_users fetched : 11
⢀ source and transform nodes -> wordpress__POST fetched : 9092
⢀ source and transform nodes -> wordpress__PAGE fetched : 34
⠐ source and transform nodes -> wordpress__wp_media fetched : 7483
⡀ source and transform nodes -> wordpress__wp_types fetched : 1
⠁ source and transform nodes -> wordpress__wp_statuses fetched : 1
⢀ source and transform nodes -> wordpress__wp_taxonomies fetched : 1
⠄ source and transform nodes -> wordpress__CATEGORY fetched : 14
⠈ source and transform nodes -> wordpress__TAG fetched : 19
⠐ source and transform nodes -> wordpress__wp_users fetched : 11
⡀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "You are not currently logged in."
⠈ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
⡀ source and transform nodesThe server response was "404 Not Found"
Inner exception message : "No route was found matching the URL and request method"
success source and transform nodes — 636.410 s

@pieh Haven't confirmed this will successfully build (now with the WordPress remote, it's taking hours), but it certainly reveals the issue: https://github.com/dustinhorton/gatsby-issue

Should be able to just clone that and build.

Just ran twice for over 10 hours without the site finishing building. Please let me know what else I can provide for help debugging.

Could you try upgrading to v2? We've made a ton of speed improvements to different gatsby subsystems which should dramatically speed up large sites like this.

@KyleAMathews I'll give that a shot tonight—thanks.

@KyleAMathews v2 version @ https://github.com/dustinhorton/gatsby-v2-issue. Been building for about 50 minutes at this point.

Killing it now. Site still hasn't built.

Another thing you can try is to enable tracing https://next.gatsbyjs.org/docs/performance-tracing/

We haven't added tracing support yet to gatsby-source-wordpress but the tracing reports might help you figure out where it's stalling.

If anyone else is interested in looking into this, a great PR would be to add tracing support to gatsby-source-wordpress. Lemme know if you're interested!

Going to need to bail out on this unfortunately, as I need to spend all time I have porting over to a traditional theme—kind of crushed to not be able to use Gatsby. Everything else feels so backwards.

Sorry we haven't had a chance to look into this :-( Sprinting right now to get v2 out.

Is there a chance you could leave the WP site running? It definitely seems like there's a bug here that should be fixed.

I tweeted out asking for help so hopefully someone will jump on this soon :-)

https://twitter.com/gatsbyjs/status/1027079401287102465

Wow, that's rad—thanks so much. Site isn't going anywhere for the time being (and I'll migrate a copy and update repro repo if it needs to).

@dustinhorton for what it's worth I've also noticed issues building a larger (~1,000 post) project on Local by Flywheel compared to our production environment with a CDN in front of it.

REST responses for Gatsby are 10-20x longer from Local than from production, so the site takes forever to build. I haven't spent time debugging the issue in Local yet, but it's on my to-do list :)

@KyleAMathews I could take a look at adding tracing to source-wordpress.

@Khristophor that'd be great!

@dustinhorton I'm seeing 404's for the images on your sample site (https://dustinhorton.com/gatsby-wp/wp-content/uploads/2018/07/IMG_9906.jpg, for example) that might be inflating the build time. Any chance you could look in to the paths for those?

The WP_MEDIA requests run fairly quickly with results so figured I was in
the clear, but I can take a look at that later this week if you think it
may be the case.

On Wed, Aug 8, 2018 at 5:45 PM Chris Wiseman notifications@github.com
wrote:

@dustinhorton https://github.com/dustinhorton I'm seeing 404's for the
images on your sample site (
https://dustinhorton.com/gatsby-wp/wp-content/uploads/2018/07/IMG_9906.jpg,
for example) that might be inflating the build time. Any chance you could
look in to the paths for those?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-411562589,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAXFNRHTA-vqIwCTtioejUL-Ei3nM0Lbks5uO1vygaJpZM4VZ57n
.

That's true, but part of the source and transform step is to download all the media items it finds in the REST response:
https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-wordpress/src/normalize.js#L434

Getting 404's on 7504 images might be causing some problems ;)

Believe I've cleaned up all the 404s. Will try to build tonight. Thanks all.

Seemingly no change:

~/Sites/gatsby-issue-v2 (master)
→yarn build
yarn run v1.5.1
$ gatsby build
success open and validate gatsby-config — 0.009 s
success load plugins — 0.277 s
success onPreInit — 0.257 s
success delete html and css files from previous builds — 0.008 s
success initialize cache — 0.245 s
success copy gatsby files — 0.079 s
success onPreBootstrap — 0.001 s
⠁
=START PLUGIN=====================================

Site URL: http://dustinhorton.com/gatsby-wp
Site hosted on Wordpress.com: false
Using ACF: true
Using Auth: undefined undefined
Verbose output: true

Mama Route URL: http://dustinhorton.com/gatsby-wp/wp-json

⠁ source and transform nodesRoute discovered : /
Invalid route.
Route discovered : /oembed/1.0
Invalid route.
Route discovered : /oembed/1.0/embed
Invalid route.
Route discovered : /oembed/1.0/proxy
Invalid route.
Route discovered : /yoast/v1
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/configurator
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/reindex_posts
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/ryte
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/indexables/(?P<object_type>.*)/(?P<object_id>\d+)
Invalid route.
Route discovered : /yoast/v1/statistics
Valid route found. Will try to fetch.
Route discovered : /acf/v3
Invalid route.
Route discovered : /acf/v3/posts/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/posts
Valid route found. Will try to fetch.
Route discovered : /acf/v3/pages/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/pages
Valid route found. Will try to fetch.
Route discovered : /acf/v3/media/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/media
Valid route found. Will try to fetch.
Route discovered : /acf/v3/categories/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/categories
Valid route found. Will try to fetch.
Route discovered : /acf/v3/tags/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/tags
Valid route found. Will try to fetch.
Route discovered : /acf/v3/comments/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/comments
Valid route found. Will try to fetch.
Route discovered : /acf/v3/options/(?P<id>[\w\-\_]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/users/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/users
Valid route found. Will try to fetch.
Route discovered : /wp/v2
Invalid route.
Route discovered : /wp/v2/posts
Valid route found. Will try to fetch.
Route discovered : /wp/v2/posts/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/posts/(?P<parent>[\d]+)/revisions
Invalid route.
Route discovered : /wp/v2/posts/(?P<parent>[\d]+)/revisions/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/pages
Valid route found. Will try to fetch.
Route discovered : /wp/v2/pages/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/pages/(?P<parent>[\d]+)/revisions
Invalid route.
Route discovered : /wp/v2/pages/(?P<parent>[\d]+)/revisions/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/media
Valid route found. Will try to fetch.
Route discovered : /wp/v2/media/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/types
Valid route found. Will try to fetch.
Route discovered : /wp/v2/types/(?P<type>[\w-]+)
Invalid route.
Route discovered : /wp/v2/statuses
Valid route found. Will try to fetch.
Route discovered : /wp/v2/statuses/(?P<status>[\w-]+)
Invalid route.
Route discovered : /wp/v2/taxonomies
Valid route found. Will try to fetch.
Route discovered : /wp/v2/taxonomies/(?P<taxonomy>[\w-]+)
Invalid route.
Route discovered : /wp/v2/categories
Valid route found. Will try to fetch.
Route discovered : /wp/v2/categories/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/tags
Valid route found. Will try to fetch.
Route discovered : /wp/v2/tags/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/users
Valid route found. Will try to fetch.
Route discovered : /wp/v2/users/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/users/me
Valid route found. Will try to fetch.
Route discovered : /wp/v2/comments
Valid route found. Will try to fetch.
Route discovered : /wp/v2/comments/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/settings
Valid route found. Will try to fetch.
Added ACF Options route.

Fetching the JSON data from 25 valid API Routes...

=== [ Fetching wordpress__yoast_v1 ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1
⠈ source and transform nodes -> wordpress__yoast_v1 fetched : 1
Fetching the wordpress__yoast_v1 took: 936.166ms

=== [ Fetching wordpress__yoast_configurator ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/configurator
⢀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_configurator took: 846.014ms

=== [ Fetching wordpress__yoast_reindex_posts ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/reindex_posts
⢀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_reindex_posts took: 1010.589ms

=== [ Fetching wordpress__yoast_ryte ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/ryte
⠠ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_ryte took: 1022.977ms

=== [ Fetching wordpress__yoast_statistics ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/statistics
⠄ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_statistics took: 820.827ms

=== [ Fetching wordpress__acf_posts ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/posts
⠈ source and transform nodes -> wordpress__acf_posts fetched : 100
Fetching the wordpress__acf_posts took: 6352.670ms

=== [ Fetching wordpress__acf_pages ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/pages
⡀ source and transform nodes -> wordpress__acf_pages fetched : 34
Fetching the wordpress__acf_pages took: 2760.048ms

=== [ Fetching wordpress__acf_media ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/media
⠈ source and transform nodes -> wordpress__acf_media fetched : 100
Fetching the wordpress__acf_media took: 4273.250ms

=== [ Fetching wordpress__acf_categories ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/categories
⠁ source and transform nodes -> wordpress__acf_categories fetched : 13
Fetching the wordpress__acf_categories took: 1029.029ms

=== [ Fetching wordpress__acf_tags ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/tags
⠈ source and transform nodes -> wordpress__acf_tags fetched : 0
Fetching the wordpress__acf_tags took: 941.066ms

=== [ Fetching wordpress__acf_comments ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/comments
⢀ source and transform nodes -> wordpress__acf_comments fetched : 9
Fetching the wordpress__acf_comments took: 2868.036ms

=== [ Fetching wordpress__acf_users ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/users
⠠ source and transform nodes -> wordpress__acf_users fetched : 11
Fetching the wordpress__acf_users took: 2049.181ms

=== [ Fetching wordpress__POST ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/posts
⠁ source and transform nodes
Total entities : 9094
Pages to be requested : 91
⠁ source and transform nodes -> wordpress__POST fetched : 9094
Fetching the wordpress__POST took: 152767.807ms

=== [ Fetching wordpress__PAGE ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/pages
⢀ source and transform nodes -> wordpress__PAGE fetched : 34
Fetching the wordpress__PAGE took: 2194.895ms

=== [ Fetching wordpress__wp_media ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media
⢀ source and transform nodes
Total entities : 7504
Pages to be requested : 76
⢀ source and transform nodes -> wordpress__wp_media fetched : 7485
Fetching the wordpress__wp_media took: 132029.996ms

=== [ Fetching wordpress__wp_types ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/types
⢀ source and transform nodes -> wordpress__wp_types fetched : 1
Fetching the wordpress__wp_types took: 956.603ms

=== [ Fetching wordpress__wp_statuses ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/statuses
⢀ source and transform nodes -> wordpress__wp_statuses fetched : 1
Fetching the wordpress__wp_statuses took: 1017.845ms

=== [ Fetching wordpress__wp_taxonomies ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/taxonomies
⠠ source and transform nodes -> wordpress__wp_taxonomies fetched : 1
Fetching the wordpress__wp_taxonomies took: 1029.885ms

=== [ Fetching wordpress__CATEGORY ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/categories
⢀ source and transform nodes -> wordpress__CATEGORY fetched : 14
Fetching the wordpress__CATEGORY took: 943.710ms

=== [ Fetching wordpress__TAG ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/tags
⠠ source and transform nodes -> wordpress__TAG fetched : 19
Fetching the wordpress__TAG took: 1104.454ms

=== [ Fetching wordpress__wp_users ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/users
⡀ source and transform nodes -> wordpress__wp_users fetched : 11
Fetching the wordpress__wp_users took: 1325.604ms

=== [ Fetching wordpress__wp_me ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/users/me
⠂ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "You are not currently logged in."
Fetching the wordpress__wp_me took: 926.146ms

=== [ Fetching wordpress__wp_comments ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/comments
⠂ source and transform nodes
Total entities : 9410
Pages to be requested : 95
⡀ source and transform nodes -> wordpress__wp_comments fetched : 9397
Fetching the wordpress__wp_comments took: 85370.673ms

=== [ Fetching wordpress__wp_settings ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/settings
⠁ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__wp_settings took: 808.396ms

=== [ Fetching wordpress__acf_options ] === http://dustinhorton.com/gatsby-wp/wp-json/acf/v2/options
⠂ source and transform nodesThe server response was "404 Not Found"
Inner exception message : "No route was found matching the URL and request method"
Fetching the wordpress__acf_options took: 1059.276ms

=END PLUGIN=====================================: 412457.896ms
⠁ source and transform nodes

And it's been sitting there for about 8 hours.

@dustinhorton what kind of hosting are you using? I think it's just killing your production box with the amount of requests. I believe I got it to finish (after quite some time, not eight hours) setting concurrent connections to something low, like 1 or 2.

It's a decent VPS on Linode. I can get settings tweaked on it if that'd help. But the issue happens locally too.

https://github.com/gatsbyjs/gatsby/blob/46290c2b0e7894fca036bdcc658a5d1936c4221f/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L133-L159 this is sometimes not working correctly when we pull larger amount of files - network request get resolved but file write stream never finishes (or errors out). I think it would be great to add some kind of timeout after responseStream finish to wait for fsWriteStream to finish, and if it doesn't and destroy all resources and try to write file again (possibly make few retries) and actually errors out when it can't actually do that.

@pieh can you please send updated code for this file ?

/packages/gatsby-source-filesystem/src/create-remote-file-node.js

@aman-developer there is no fix for this yet - otherwise it would be published. Problem with this issue is there is no reliable way to reproduce this, so any fixes are guesses. Problem is in some cases (might be hardware and/or OS specific) filesystem writeStream doesn't finish and is getting stuck without throwing errors so any fix here really is trying to workaround problems in fs package / hardware/os not being reliable :/

Have you had issues reproing with my repo & site? It's consistent for me.

I use createRemoteFileNode to fetch remote images and I experience this same problem: download gets stuck at around 680/780ish.

In createRemoteFileNode, there is a listener to downloadProgress event that was added in https://github.com/sindresorhus/got/releases/tag/v8.0.0 but gatsby-source-filesystem uses got 7.1.0.

I tried upgrading got to the latest version 9.2.2 and could now successfully download all images.

Add this in package.json:

  "resolutions": {
    "got": "^9.2.2"
  }

Also there seems to be some critical fixes in got after 7.1.0 like stream errors not being correctly forwarded, etc. (https://github.com/sindresorhus/got/releases/tag/v8.0.1)

I tried updating got, but still sometimes get stuck, but it's worth doing it anyway. Just note that downloadProgress stuff will either need disabling or some nicer output, because terminal/console get's spammed with progress when using that

I was able to run gatsby develop after ~25 minutes but I had to reduce concurrency in create-remote-file-node.js from 200 to 20. I did get some 22 TimeoutErrors (but were redownloaded when executing gatsby develop again) after putting logs in that empty catch in processRemoteNode.

Not sure if it's because of got but maybe can experiment with other http clients...

...
success source and transform nodes — 1407.531 s
success building schema — 3.315 s
success createPages — 0.571 s
success createPagesStatefully — 2.797 s
success onPreExtractQueries — 0.012 s
success update schema — 3.268 s
warning There are conflicting field types in your data. GraphQL schema will omit those fields.
wordpress__wp_media.media_details.width:
 - type: number
   value: 916
 - type: string
   value: '224'
wordpress__wp_media.media_details.height:
 - type: number
   value: 916
 - type: string
   value: '225'
wordpress__wp_media.media_details.sizes.thumbnail.width:
 - type: number
   value: 150
 - type: string
   value: '150'
wordpress__wp_media.media_details.sizes.thumbnail.height:
 - type: number
   value: 150
 - type: string
   value: '150'
wordpress__wp_media.media_details.sizes.medium.width:
 - type: number
   value: 300
 - type: string
   value: '300'
wordpress__wp_media.media_details.sizes.medium.height:
 - type: number
   value: 300
 - type: string
   value: '200'
wordpress__wp_media.media_details.sizes.large.width:
 - type: number
   value: 768
 - type: string
   value: '1024'
wordpress__wp_media.media_details.sizes.large.height:
 - type: number
   value: 1024
 - type: string
   value: '682'
wordpress__wp_media.media_details.image_meta.aperture:
 - type: number
   value: 2.2
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.created_timestamp:
 - type: boolean
   value: false
 - type: number
   value: 1433226914
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.focal_length:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.iso:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.shutter_speed:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.orientation:
 - type: number
   value: 1
 - type: string
   value: '1'
warning Using the global `graphql` tag is deprecated, and will not be supported in v3.
Import it instead like:  import { graphql } from 'gatsby' in file:
/Users/tandingan.wlb/Projects/gatsby/gatsby-issue/src/templates/Post/Post.js
success extract queries from components — 0.120 s
success run graphql queries — 223.335 s — 9121/9121 40.84 queries/second
success write out page data — 0.119 s
success write out redirect data — 0.001 s
success onPostBootstrap — 0.027 s

info bootstrap finished - 1643.854 s
{ TimeoutError: Timeout awaiting 'request' for 30000ms
    at Immediate.timeoutHandler [as _onImmediate] (/Users/tandingan.wlb/Projects/gatsby/gatsby-issue/node_modules/got/source/timed-out.js:39:25)
    at runCallback (timers.js:694:11)
    at tryOnImmediate (timers.js:664:5)
    at processImmediate (timers.js:646:5)
  name: 'TimeoutError',
  code: 'ETIMEDOUT',
  host: 'dustinhorton.com',
  hostname: 'dustinhorton.com',
  method: 'GET',
  path: '/gatsby-wp/wp-content/uploads/2015/05/20150302_061259.jpg',
  socketPath: undefined,
  protocol: 'https:',
  url:
   'https://dustinhorton.com/gatsby-wp/wp-content/uploads/2015/05/20150302_061259.jpg',
  event: 'request' }

I'm getting the same errors with prismic

I upgraded to "got": "^9.2.2" now it's working houra!

Definitely need to take a look to upgrade our got version. This is intermittment problem so it might be coincidence that it worked. @RobinHerzog please let us know if you will run into similar problems with upgraded version of got

Updating got significantly reduced build time for my repro repo, but still consistently took nearly an hour last I tried.

@dustinhorton what portion of the build was pulling images (or source and transform data as we don't show explicitly how long downloading files take)?

I have 150MB images with a 1GB internet connection. Now it's working in. I need 30 sec to download et continue building.

I'm also having this issue consistently. Upgrading got did not solve this for me. Any success with adding additional tracing to source-wordpress so we can try and debug what the problem is?

Tried changing concurrentRequests and perPage, as well as upgrading got to the latest version, but none worked. Right now I can fetch categories, posts, pages and tags, but when I include users or media, right after =END PLUGIN===, the plugin returns with an error: TypeError: Cannot read property 'id' of undefined.

If I include all routes and blacklist the ones I don't have access to, I get =END PLUGIN=== but it never finishes... This goes for tons of websites I tested, so I figure it might be my system somehow. If anyone wants to test this, here's the config:

    {
      resolve: 'gatsby-source-wordpress',
      options: {
        // Other URLs I tried:
        // https://clubedovalor.com.br
        // http://rivainvestimentos.com.br
        // http://queroinvestiragora.com/
        // https://www.clubedospoupadores.com/
        baseUrl: "aprenda.guiainvest.com.br",
        protocol: "https",
        hostingWPCOM: false,
        useACF: false,
        concurrentRequests: 10,
        perPage: 50,
        // Going with the excluded routes path
        // excludedRoutes: [
        //   '/*/*/plugins',
        //   '/rock-convert/**',
        //   '/yoast/**',
        //   '/wp-super-cache/**',
        //   '/*/*/users/me',
        //   '/*/*/settings',
        // ],
        verboseOutput: true,
        includedRoutes: [
          "/*/*/categories",
          "/*/*/posts",
          "/*/*/pages",
          "/*/*/tags",
          // You can toggle between media and users (or both)
          // All 3 scenarios will fail with the `'id' of undefined`
          // problem
          // "/*/*/media",
          "/*/*/users",
        ],
      },

PS: One URL that I _did_ manage to fetch was https://wesbos.com/

HAPPY UPDATE: I managed to make it work (_for smaller sites_) with includedRoutes, even with users and/or media by including taxonomies in the query. Now I don't get the 'id' of undefined error :D

@pieh I believe users and media types are dependant upon taxonomies, so maybe they should come by default whenever the config contais either of these types? Let me know if I can help further troubleshooting! As a closing note, this taxonomies bug seems unrelated to the infinite build process. With sites larger than ~500 media files, I still can't finish the build process!

UPDATE Number 2: So, I've managed to make it work for queroinvestiragora.com, which has 600 media files but only 70 posts, it takes roughly 15 seconds after =END PLUGIN=== , but it works. However, www.clubedospoupadores.com has 702 media files and 336 posts and it won't compile.

PS: My config in these experiments is:

    {
      resolve: 'gatsby-source-wordpress',
      options: {
        baseUrl: "queroinvestiragora.com",
        protocol: "http",
        hostingWPCOM: false,
        useACF: false,
        concurrentRequests: 10, // I've also tried removing it and going with the default, it's the same result
        verboseOutput: true,
        includedRoutes: [
          "/*/*/categories",
          "/*/*/posts",
          "/*/*/pages",
          "/*/*/tags",
          "/*/*/media",
          "/*/*/users",
          "/*/*/taxonomies",
        ],
      },
    },

Hello,

I managed to add tracing using the steps outlined here https://www.gatsbyjs.org/docs/performance-tracing/. Unfortunately it did not provide much info as it simply told me that indeed source and transform nodes is taking quite long.

I have however done some of my own debugging on the issue after having some non-deterministic behavior involving images. When running either develop or build script I would get a case where not all of the images would be downloaded and the localFile nodes would not complete. After digging into the code I have determined that there seems to be an issue here

https://github.com/gatsbyjs/gatsby/blob/ad142af473fc8dc8555a5cf23a0dfca42fcbbe90/packages/gatsby-source-wordpress/src/normalize.js#L483-L506

For me createRemoteFile node was failing due to server timeout errors and defaults to returning null. I had to add some logging to createRemoteFile node as well to determine this and get the actual server responses. Since these nodes don't complete and do not have ID's they don't get registered in the cache. The tmp files are deleted and the gatsby-source-filesystem was incomplete. For whatever reason (I haven't looked that far yet) upon running the build script again the source-filesystem was then deleted probably because the script detects the filesystem is invalid or incomplete. It was this process that was for me creating a loop and causing errors on future builds as the filesystem never completes.

I'm working on a fix that seems to alleviate some of the issues at least regarding large amounts of images. When the develop or build script is successful in downloading all of the images the first time, it subsequently is not deleted and then the build process happens quite rapidly as the images are properly cached by gatsby-source-filesystem! My build went from 15 minutes down to 1 minute.

I'm not sure whether this is related to builds that have large amounts of posts. My issue was directly related to downloading 1.6 GB of image data.

This is my first time working with source plugins for gatsby so if anyone has any thoughts or advice regarding this I would appreciate it! I should be able to post my repo later today I am working on getting it to use my local version of gatsby-source-filesystem without complications.

Hello,

Following up on my comment from a few days ago. Here is my repo.

https://github.com/njmyers/byalejandradesign.com.git

I am using a monorepo in this project so here are some steps if you want to run the repository locally.

  1. Ensure you have the latest version of Yarn 1.12.3
  2. Clone the plugin branch git clone https://github.com/njmyers/byalejandradesign.com.git -b wordpress-plugin
  3. Run yarn && yarn bootstrap
  4. Navigate to the gatsby folder so you can look just at that folder cd packages/web
  5. Run yarn develop or yarn build-web. It should complete successfully the first time and subsequent runs of the same command will result in much quicker builds! Source and transform nodes takes 222s for me where as it was taking 3 times that earlier and/or not completing.
  6. If you want to see what is actually happening during source and transform you can look in your file browser at /packages/web/.cache/gatsby-source-filesystem you will see that the files are being created there.

I rewrote the downloadMediaFiles function completely. You can see that file at this link https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

It is probably more verbose then it needs to be but I had to do this in order to figure out everything that is happening. The functionality that I changed is adding a promise rejection when createRemoteFileNode returns null. I then use a function downloadRunner to throttle the requests so that they don't all hit the server at once as well as a retry on promise rejections. I added 200ms throttle between each createRemoteFileNode request. I'm sure this value could be tweaked and some of this might be better suited to adding to createRemoteFileNode directly.

If anyone is curious the WP install is EC2 micro instance while the images are behind a CloudFront distribution. Personally I never had any issues with getting posts my issue was with getting images and I believe that most of the issues people are having are due to this.

Also if anyone wants to trace or debug their own site I suggest starting here...

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L240-L244

I added logging to the catch clause and was able to determine that the image nodes were not being processed correctly as I was getting timeout errors and then returning null.

@njmyers I just did very brief look at that and I'm thinking that if this works, we should use similar approach in createRemoteFileNode directly. We are using queue there, so consumers of this function (gatsby-source-wordpress in this case) shouldn't need to worry about this. One thing that is potentially problematic is that 200ms throttle - maybe we could start without it and when we start to see problems automatically apply throttling (per hostname)

@pieh Yes that would probably be the place to apply this logic. The throttling for me was a way to approach this and diagnose the issue so I agree that the createRemoteFileNode should be able to handle this on it's own.

Particularly problematic however is the current behavior of silently failing the errors and returning null. In my opinion there should be some communication about either the failure or success of the operation. I think createRemoteFileNode could be made more robust with the following functionality.

1) Eagerly create connections
2) If there are errors from the server begin to throttle and/or retry if needed
3) Set some sane defaults for throttling/retrying
4) Create an entry point for adjusting throttling/retrying
4) Reject a promise if for some reason the node is unable to be processed.

I can also say that I played around with timeout values here https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L135-L141. Although that increased the probability of a successful response I still had to add handling in order to ensure a successful response.

Most likely the correct entry point for this logic would be here.

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L259-L269

Where if the tasks are failing they are retried and/or failed and then finally rejected.

Also just read briefly the queue docs. I see what you are saying about queue being able to manage this on it's own.

@njmyers nice investigation work! Definitely agree that the file downloading needs to be a lot smarter!

It could be nice actually to extract out the file downloading piece to its own package that focuses on this problem of downloading and caching remote files.

There's a good chance we'll need to use the functionality in multiple places in Gatsby and the future and it's something other folks on the internet would want to use as well.

@KyleAMathews you mean extracting createRemoteFileNode to a separate package?

No just the file downloading and caching part. createRemoteFileNode would then just call this package and get back a promise that'd resolve when the file was downloaded (or returned from the cache).

I'm having this problem with my own cockpit source plugin as well.

I see so it would really be more like extracting these portions of code to a separate package...

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L125-L244

This seems to be the code that deals specifically with downloading and caching please correct me if I'm wrong. Happy to work on this! Just trying to figure out how it works in the greater ecosystem.

Would a PR to only fix gatsby-source-wordpress be accepted, then extract the fix afterwards? Having trouble using @njmyers forked plugin as-is, and it seems like it's a huge improvement.

@dustinhorton not sure if this helps any but I found that if you want to use a local plugin it's best to point gatsby directly to package.json file. I was having trouble getting gatsby to find my local plugin until I started specifying it explicitly.

https://github.com/njmyers/byalejandradesign.com/blob/d56b1938f6d1bc22c3cf2282bb3198e378fe3561/packages/web/gatsby-config.js#L91-L94

I'm still happy to work on this issue and even the new plugin as discussed. Just looking for a little guidance on how to integrate this as it seems like a disruptive change that could impact many other things that I am not aware of. @KyleAMathews any thoughts? I still feel as though the code here

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L125-L244

Is the part that should be extracted out into it's own package. That being said it is one of the core functions of createRemoteFileNode and I want to make sure I go about it correctly so it can be integrated back into the ecosystem properly.

@njmyers You are mostly correct with your code selection - we would also want our current queue (which ATM limit to 200 concurrent requests, which seems not great for local dev and apparently for wordpress) moved and probably changed.

@dustinhorton I think it's reasonable to use this in wordpress plugin first (mostly because it's practically done).

@pieh Great thanks for your clarification! I'll start working on a new plugin.

Regarding a temporary wordpress-source fix my only other question would be what to do here

https://github.com/njmyers/byalejandradesign.com/blob/d56b1938f6d1bc22c3cf2282bb3198e378fe3561/packages/gatsby-source-wordpress/src/download-media-files.js#L169-L173

At the moment it would still be possible to have network errors and there needs to be a catch clause for the whole downloadMediaFiles function. What is the normal behavior for passing errors to gatsby? I would be happy to add that code into the wordpress plugin to properly pass the network errors up to the correct handler. Maybe we could display an error message and a reference to this issue? Thanks for your assistance!

@njmyers Thanks—yeah I was replicating your setup as closely as possible, aside it being a monorepo (including referencing package.json). Running develop just gave errors as if there were no gatsby-source-wordpress. I'll give it another go here shortly.

More faithfully recreated your monorepo, and oddly it's just sitting at source and transform nodes, like it was with the non-forked gatsby-source-wordpress before downgrading the got dependency.

@pieh Able to answer his inquiry @ https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-442536931 ?

@dustinhorton Yes it should be sitting there for quite some time too if you have a lot of images. My fork will throw unhandled promise rejection if a remote file fails to download. That is why I would like to be able to have some mechanism to properly handle this scenario.

I think I read on another thread as well that there was talk of integrating some sort of progress manager as well since this would provide feedback about plugin status.

If you look in your OS file system under project-root/.cache/gatsby-source-filesystem you should be able to see all the images that are being downloaded. In my case it is almost 400 images now so it does take quite some time. However before using my forked version the plugin would silently fail on an error and then never progress causing the issue where source and transform would take for hours...

Do you have a repo? I would love to be able to try it on another site as so far I have only tested it in a real life situation on my site.

@njmyers That'd rule—if you don't mind, shoot me an email: [email protected], or just look out for an invite. I'll get something prepped this evening.

Updating got solved all issues for me, too.

The problem with got@9 is that it requires Node 8 (https://github.com/sindresorhus/got/releases/tag/v9.0.0), so we can't upgrade ATM :(

We should be able to upgrade at least to got@8, but I'm not sure if this will fix the issues

got@8 seems to implement RFC 7234 compliant HTTP caching, so gatsby-source-filesystem could supply it's own file system cache adapter. Which should at least reduce time spent in source and transform nodes the second time around given that the resource is cacheable.

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

Thanks for being a part of the Gatsby community! 💪💜

@gatsbot Still an issue.

Was asked to contribute a blog post for y'all. Can't do it as it is stuck on source and transform nodes. Saw the other issue, but I am not seeing where there is a fix for this. It is a fork of gatsbyjs, latest upstream. I only got this to run once. It is always stuck transforming nodes.

It's failing to grab screenshots from a few sites while building. I'll add the offending sites in the morning.

@twhite96 I just ran into the issue and what worked for me was removing temporary files that I still had open (from emacs), not sure if that will help you or not, but it allowed my build to move forward.

So it's looking like this is still a problem…

facing the same problem when using gatsby-source-s3 to pull a 100 photos and transform them through sharp. Anyone figured out a fix?

Somehow my problem was fixed (randomly?). This is the steps I took, I created a new s3 bucket with fewer pictures (for testing) and then tried building and It built successfully very quicly. Then I decided to go back and try to pull from the original bucket and now all the sudden it built successfully in 49s when originally it would go on for hours. I don't know how the mere switch in bucket links fixed the stall but hope this helps someone figure it out. maybe it has to do with the cache?

Hi All. I updated my local plugin version that I was using for a site that had this issue. I think it’s a better implementation as it uses ‘better-queue’ before ‘createRemoteNode’ and passes in the ‘concurrentRequests’ parameter. It’s a little bit redundant as ‘createRemoteNode’ already uses a queue but regardless this version seems to working well with the recent gatsby upgrades and gives feedback on the progress of the files. I will try to get a PR together for this. Sorry for delays I know I said I would work on this earlier but have been quite busy!

https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

@njmyers

Thanks so much. Your version solved some problems that I was having. I combined that with a line or two to filter out downloading 25 GB of mp3s, and I am now set!

Definitely still an issue.
I've been trying to compile my project for the last 24 hours. From approximately 12 tries, 3 succeeded with outputs and actual WP connection. Is there any fix to this?
BTW, I've tried to use @njmyers version of the plugin (awesome job, actually!), but results were mixed. Sometimes it would complain about wordpress_parent or Date and eventually crash, but couldn't figure out what's actually going on with these errors. In other builds, different errors (but they do compile, which is interesting), which actually causes issues on GraphQL.

@lucassilvagc can you post some outputs? I’m glad people are trying and testing the branch. Let’s get it working better so we can open the PR!

@njmyers Sure!

A quick overview of what's going on:

My website currently runs with ~1940 image files, maybe WordPress's fault by creating multiple image files multiple times. If I do use a vanilla _gatsby-source-wordpress_, the issue appears as intended (there's a "vanilla" build I've made yesterday evening on another build environent - which returns the same issue we're discussing on this issue altogether. This build works and compiles when all the image files are returned). By using your plugin (replacing all the files inside node_modules/gatsby-source-wordpress (correct me if I'm wrong on this)), _gatsby develop_ returns me the following:

TypeError: Cannot read property 'wordpress_parent' of undefined

  - normalize.js:287 entities.map.e
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:287:11

  - Array.map

  - normalize.js:286 Object.exports.mapElementsToParent.entities [as mapElementsToParent]
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:286:12

  - gatsby-node.js:134 Object.exports.sourceNodes
    [amazingtec]/[gatsby-source-wordpress]/gatsby-node.js:134:24


warning The gatsby-source-wordpress plugin has generated no Gatsby nodes. Do you need it?
success source and transform nodes — 299.757 s
success building schema — 10.192 s

After a quick while, it outputs:

'Cannot query field "allWordpressPage" on type "Query". Did you mean "allSitePage"?',
    locations: [ [Object] ] } ]
error UNHANDLED REJECTION

  TypeError: Cannot read property 'allWordpressPage' of undefined

  - gatsby-node.js:54 graphql.then.result
    C:/Projects/amztec-gtby/amazingtec/gatsby-node.js:54:36

PS: this was a vanilla build of gatsby-source-wordpress that was _"converted"_ to yours by replacing the files, as I said above. I think the fact that it can't query all the pages is related to no nodes being generated. Also want to notice that this build is equal as my vanilla one that works when this issue doesn't appear.

Also want to notice that adding routes appears to cause the same initial problem for me (as I wanted to avoid different pages that aren't related or will return errors due to multiple protection layers over WordPress). I just don't know if the routes I've listed are correct, or if I'm missing something after.

I'm very happy with your reply, this issue is currently being a huge setback to my project and I'm glad that you're still up on this issue. Thanks a lot!

Having the same issue with 400+ custom posts with acf fields and 4000 image.

I updated got and was able to build with 35 minutes

Unable to build again after I updated got

As expected, since this bug still exists in gatsby-wordpress. 35 minutes to download and process all the images keeps being a very long time considering all the factors (avg internet speed, processing power, total of files and so on).
You can try adapting @njmyers version to your specific use, it'll work like a charm on downloading every image file you have.

As expected, since this bug still exists in gatsby-wordpress. 35 minutes to download and process all the images keeps being a very long time considering all the factors (avg internet speed, processing power, total of files and so on).
You can try adapting @njmyers version to your specific use, it'll work like a charm on downloading every image file you have.

My site was working fine when i had a small number of images but when i started adding more this also happens.

@MWalid how can i update the got ? Thanks.

been trying to build all day with no success. have around 1450 images.

We haven't been able to deploy for 2 days now. Can someone help point me in the right direction as to where this is occurring in the code so I can try and find a solution?

We haven't been able to deploy for 2 days now. Can someone help point me in the right direction as to where this is occurring in the code so I can try and find a solution?

Have you upgraded your got nested dependency of the gatsby-source-filesystem to use at least version 9.4.0?

If not, you should add:

  "resolutions": {
    "gatsby-source-filesystem/got": "9.4.0"
  }

in your Gatsby project's package.json. Then remove node_modules and your yarn.lock file and install again.

Note: This resolutions feature only works for yarn. npm has not implemented this yet.

@anagstef thanks very much for the tip! I'll try this and report back.

When running gatsby develop, is there a way to keep local cache instead of fetching remote data each time the command is launched ?

@anagstef looks to be working much better! Thanks for the tip!

The output is very verbose when building with this version of got. Do you know if there's any way to remove this?

@nratter I'm glad it worked for you!

Yes, I know that, it is very verbose and it cannot be turned off. Ruins all the useful console output.

After some investigation I have done, I think it is caused here:
https://github.com/gatsbyjs/gatsby/blob/80c7023a8bc23886939205fe52e305277294e6af/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L155

As you can see it calls a console.log with the progress of the download of each file every time the downloadProgress event emits which happens too many times per second. This was not a problem before, because the old got version does not implement the downloadProgress event.

Maybe we can fix it with a PR? Looks like debugging leftover code.

I had the same issue, stuck on "source and transform nodes". After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn't the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I'm no longer seeing "source and transform nodes", and my build finishes.

My workaround was to have a local wordpress for testing, the live site is in netlify, while deploying it did not cause any issue.

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Here's the function I'm using:

/**
 * Map over items array using the fn function but wait for each step to finish before moving to the next one
 */
exports.serialMap = async (items, fn) => {
  const results = []
  for (const item of items) {
    const result = await fn(item)
    results.push(result)
  }
  return results
}

and here's how I'm using it:

const imageNodes = await serialMap(node.___originalImages, imgUrl => {
  return createRemoteFileNode({
    url: imgUrl,
    parentNodeId: node.id,
    store,
    cache,
    createNode,
    createNodeId,
  })
})

After the images are downloaded, here's how my source and transform step looks

Downloading remote files [==============================] 1063/1063 156.1 secs 100%
Downloading remote files [==============================] 1064/1064 157.2 secs 100%
Downloading remote files [==============================] 1065/1065 158.4 secs 100%
Downloading remote files [==============================] 1066/1066 159.5 secs 100%
Downloading remote files [==============================] 1067/1067 160.5 secs 100%
Downloading remote files [==============================] 1068/1068 161.5 secs 100%
Downloading remote files [==============================] 1069/1069 162.6 secs 100%
Downloading remote files [==============================] 1070/1070 163.7 secs 100%
Downloading remote files [==============================] 1071/1071 164.9 secs 100%
Downloading remote files [==============================] 1072/1072 166.0 secs 100%
Downloading remote files [==============================] 1073/1073 167.5 secs 100%
Downloading remote files [==============================] 1074/1074 169.2 secs 100%
Downloading remote files [==============================] 1075/1075 171.0 secs 100%
success source and transform nodes — 175.271 s

Hope it solves your problems too.
Cheers

@ancashoria where should I put this code?

@ancashoria yes, I'm also unclear on where to place this code.

This is somewhat unrelated to the gatsby-source-wordpress plugin. I have the code above in my gatsby-node.js. The idea is that firing all those requests in parallel caused them to fail, so I wrote that helper function to fire them one after another.

I'm guessing there's a similar issue in gatsby-source-wordpress too, but I'm not that familiar with it.
Sorry I can't be of more assistance.

It seems to be related to massive images and slow internet connections. Netlify was able to build the site but my local connection was not as it is only 1MB/s download which caused it to timeout after 30s and fail on the large image.

I have 1gb fiber and no 'massive' images.

I am not transforming blog images locally after downloading them wordpress, i simply use their url. It would be nice if there was a setting that disables the downloading of these images in this case.

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Yeah the issue is really based on the fact that createRemoteFileNode uses concurrency of 200 which is too much for most WP servers. I have my images on CloudFront and was hitting some rate limits there.

I tried fixing the issue with a branched version of the source-plugin for a while but the issue really isn't in gatsby-source-wordpress it is in gatsby-source-filesystem. Ideally consumers of the createRemoteFileNode function would be able to pass in concurrency there. Then plugins could make the concurrency option available in their configs. I still would like to do a PR to address this issue!

The solution I have been using is just a simple script to modify the code inside node_modules. Really quite fragile and not ideal but it is a simple hack to modify the concurrency directly. Uses shelljs so it is supposed to work for windows users as well (haven't tried).

#!/usr/bin/env node
const path = require('path');
const shell = require('shelljs');

const FILE_PATH = path.resolve(
  __dirname,
  // add path to your root dir here,
  'node_modules',
  'gatsby-source-filesystem/create-remote-file-node.js'
);

shell.sed('-i', 'concurrent: 200', 'concurrent: 20', FILE_PATH);

I had the same issue, stuck on "source and transform nodes". After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn't the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I'm no longer seeing "source and transform nodes", and my build finishes.

this was exactly my issue. Netlify was building very fast - less than 2 mins. Only about 30 posts, with around 500 source images. Locally wasn't every completing, simply unticking the CloudFlare status to be DNS only solved the issue immediately

I had the same issue, stuck on "source and transform nodes". After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn't the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.
I bypassed cloudflare, went straight to the vps and I'm no longer seeing "source and transform nodes", and my build finishes.

this was exactly my issue. Netlify was building very fast - less than 2 mins. Only about 30 posts, with around 500 source images. Locally wasn't every completing, simply unticking the CloudFlare status to be DNS only solved the issue immediately

I also found this to be the case. I previously had one image that was causing the build to fail and dismissed Cloudflare being the issue. The issue since came back recently and as @amcc suggested not routing the A record through Cloudflare solved the issue immediately locally as well.

Just wanted to echo that this isn't only a WP source issue — was hitting the same problem with gatsby-source-prismic, reducing the concurrency of soure-filesystem with @njmyers hack fixed it for me, so guessing it was a rate limiting/overload issue.

Agree that if nothing else the concurrency of source-filesystem should be configurable.

@njmyers I am sorry to ask this, but how exactly is this fix executed. Simply run the script before the build or do I need to somehow reference the script in the build process, because I am currently asking myself how to apply this fix locally and also on for example netlify.

@alexanderwe No worries it’s a silly hack anyways. You can run it after you install node_modules. I am not 100% sure but I believe postinstall in your project package.json file would work.

For me Gatsby hangs 50% of the time on "source and transform nodes" when i use json with more than 500 images inculuded. I'm using gatsby-source-custom-api

Images are hosted in fast, stable server.
My innternet connecion is also fast and stable.

"gatsby": "^2.9.4",
"gatsby-image": "^2.1.4",
"gatsby-plugin-emotion": "^4.0.7",
"gatsby-plugin-sharp": "^2.1.5",
"gatsby-plugin-typography": "^2.2.13",
"gatsby-source-custom-api": "^2.0.4",
"gatsby-transformer-remark": "^2.4.0",
"gatsby-transformer-sharp": "^2.1.21",

What can i do to debug that?

This problem occur only with gatsby-source-custom-api or source-wordpress ? If i provide graphql Api instead of gatsby-source-custom-api there is chance this problem dissapear?

It happens to me too. I've tried every fix that has been suggested and nothing seems to work. Definitely will not be using Wordpress as a backend for Gatsby again, though I hear people are having similar issues with other services as well.

@alexanderwe The proper way to fix this is to implement the patch suggested by @njmyers
Then introduce another PR to gatsby-source-wordpress and others in order to actually make this configurable from their reference in gatsby-config.js

@sebastienfi I just stumbled over this https://github.com/gatsbyjs/gatsby/issues/14819#event-2418874313 and the corresponding commit https://github.com/gatsbyjs/gatsby/commit/90aa24787b32ef9613b6becbfadab6029ec39ce9#diff-1864dd21828754bdbc63f22b895bee8e which adds an environment variable to configure the concurrency rate, which solved the issue for me. There is also an ongoing discussion about environment variables vs configuration parameters: https://github.com/gatsbyjs/gatsby/issues/14636

Have you tried setting GATSBY_CONCURRENT_DOWNLOAD to a lower number? By default it's set to 200.

Linux/mac:
GATSBY_CONCURRENT_DOWNLOAD=5 gatsby build

Windows:
setx GATSBY_CONCURRENT_DOWNLOAD 5; gatsby build

@wardpeet
i tried, nothing has changed

It definitely has something to do with source-file-system, as the logs show that the images have been successfully retrieved.... the issue is still huge... we are running late on our deadline, and we really look for a solution to this...

after setting the wordpress source plugin to debug I see this
image
always hangs up between 470-480... not usually in the same place though.

does anyone know where in the code this is executing?

Ended up getting it to work by toggling a vpn halfway through

Is anyone willing to share their repo with me and credentials so I can give this one a spin and try to find the problem?

Feel free to send me a private mail at [email protected]

my repo isn't easily recreated at this point—i have a backup of the db as it was somewhere, but in order to get the site building i had to reduce every months worth of posts into a single post, for years of content.

@wardpeet emailed you my repo Ward ([email protected]). let me know how it goes.

Our company changed the wifi and increased the bandwidth. Today I had no issue in downloading the images.... But still I don't understand, is is the network or the concurrency ?

however all the builds on Netlify fail...

5:13:43 PM: === [ Fetching wordpress__wp_media ] === https://wildkiwi.com/wp-json/wp/v2/media
5:13:43 PM: Total entities : 1717
5:13:43 PM: Pages to be requested : 344
5:13:45 PM: The request failed with error code "undefined"

error code is undefined, so I don't really understand what's happening...

When I change the concurrent requests to 5 it works on Netlify

I was having this problem with a different plugin (https://github.com/angeloashmore/gatsby-source-prismic) and setting GATSBY_CONCURRENT_DOWNLOAD=50 did the trick.

This just happened out of the blue (one day my site would build, the next it wouldn't, with no changes), and without any kind of error message, it's a little disconcerting to be deploying websites for clients without being confident that this won't happen again.

We basically download 200 images at once but this might be problematic for some computers/internet connections. A good solution is to bake in some retry mechanisms.

I was having these problems but I managed to get the build working fine with a combination of setx GATSBY_CONCURRENT_DOWNLOAD 5; gatsby build and Smushing all the images (some of which were excessively large in dimension and filesize) using the free version of https://en-gb.wordpress.org/plugins/wp-smushit/.

Hello! I'm experiencing the same issue with a source plugin I'm creating (unrelated to WordPress), and when downloading a 1000+ images form an API. It hangs almost always at the end of the process.

Setting GATSBY_CONCURRENT_DOWNLOAD didn't solve it. I tried 50, 20, 5, no luck.

I get a collection of sizes from the API, and I was using the largest image, but changed it to the smallest one and doesn't fix it either.

It's hard to identify why it fails at this point, the only thing I get is source and transform nodes and then silence forever.

It would be awesome to have a debugging mechanism for this.

I was experiencing the same issue in a gatsby+wordress integration. The build would stop forever in the onCreateNode API where I was using createRemoteFileNode.

Solution: I updated the gatsby-source-filesystem from 2.0.4 to 2.1.8 and added GATSBY_CONCURRENT_DOWNLOAD=50 to my environment variables.

Hello 👋

I have a similar issue on my project.

Environment

  System:
    OS: macOS 10.14.6
    CPU: (4) x64 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
    Shell: 5.3 - /bin/zsh
  Binaries:
    Node: 10.16.0 - ~/.nvm/versions/node/v10.16.0/bin/node
    Yarn: 1.17.3 - ~/.yarn/bin/yarn
    npm: 6.9.0 - ~/.nvm/versions/node/v10.16.0/bin/npm
  Languages:
    Python: 2.7.15 - /usr/local/bin/python
  Browsers:
    Chrome: 76.0.3809.100
    Firefox: 68.0.1
    Safari: 12.1.2
  npmPackages:
    gatsby: ^2.13.42 => 2.13.42
    gatsby-cli: ^2.7.34 => 2.7.34
    gatsby-image: ^2.2.14 => 2.2.14
    gatsby-plugin-glamor: ^2.1.3 => 2.1.3
    gatsby-plugin-manifest: ^2.2.4 => 2.2.4
    gatsby-plugin-offline: ^2.2.4 => 2.2.4
    gatsby-plugin-react-helmet: ^3.1.5 => 3.1.5
    gatsby-plugin-sass: ^2.1.10 => 2.1.10
    gatsby-plugin-sharp: ^2.2.9 => 2.2.9
    gatsby-plugin-svg-sprite: ^2.0.1 => 2.0.1
    gatsby-source-filesystem: ^2.1.18 => 2.1.18
    gatsby-source-wordpress: ^3.1.12 => 3.1.12
    gatsby-transformer-sharp: ^2.2.5 => 2.2.5

I have more than 80000 medias on my WP site. When I run npx gatsby develop I'm stuck after "END PLUGIN".

...
=== [ Fetching wordpress__TAG ] === https://[WP_REST_API]/wp-json/wp/v2/tags

Total entities : 8805
Pages to be requested : 89
 -> wordpress__TAG fetched : 8805
Fetching the wordpress__TAG took: 12408.827ms
⠀
=== [ Fetching wordpress__wp_partners ] === https://[WP_REST_API]/wp-json/wp/v2/partners
 -> wordpress__wp_partners fetched : 22
Fetching the wordpress__wp_partners took: 1268.292ms
⠀
=END PLUGIN=====================================: 377120.512ms
⠼ source and transform nodes

I tried to modify the GATSBY_CONCURRENT_DOWNLOAD value but nothing has changed.
There is a way to limit the media quantity import ? For example :

{
  resolve: `gatsby-source-filesystem`,
  options: {
    name: `images`,
    path: `${__dirname}/src/images/uploads`,
    limit: 50,
  },
},

Same problem here, my self-hosted WP has 1690 media, i'm always stuck at the end of the Downloading remote files step, sometimes only one media is missing...

Edit: this time build was successful with GATSBY_CONCURRENT_DOWNLOAD=5 yarn build...

@kvalium Thank you for your comment, GATSBY_CONCURRENT_DOWNLOAD=5 yarn build worked for me

I had the same problem and I manage to solve this by resizing the terminal window.

Please refer to last comments on #4666.

I also had the same issue. I resolved it with :

rm -r node_modules/ 
rm -r .cache
sudo chown -R login:login . 
fuser -k 8000/tcp 
yarn 
gatsby build
gatsby develop

Hope it can help

Looks like this is a quirky issue. Here is my experience with it:

  • ❌ I saw this issue on macOS High Sierra (using iTerm)
  • ✅ I started using GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop and the issue went away (this was the case for a couple weeks)
  • ❌ I upgrade to Mojave and upgraded my global Gatsby installation to 2.7.47 and then started seeing the issue again (using iTerm)
  • ❌ Tried changing GATSBY_CONCURRENT_DOWNLOAD to 5
  • ❌ Tried blowing away .cache and node_modules
  • ❌ Tried resizing the iTerm window while running gatsby develop (both with 50 and 5)
  • ❌ Ran GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop in "Terminal" app, not in iTerm
  • ✅ Two weeks later tried using GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop in iTerm and resized the window a couple times during the process and it worked.

Prematurely thought I had it running with that last one but then it hung. Hopefully this helps others. Still seems like this isn't quite nailed down but we're getting there slowly but surely.

Update: Today this worked for me. Not sure if it's because I resized the iTerm window at the right point in the process or because I watched it go from 93% all the way to 100% but something was different this time.

Additional to use GATSBY_CONCURRENT_DOWNLOAD = 5, add the following code into your gatsby-node.js file

// Internationalization
exports.onPostBuild = () => {
ChildProcess.execSync("ps aux | grep jest | grep -v grep | awk ‘{print $2}’ | xargs kill")
console.log('Copying locales')
fs.copySync(path.join(__dirname, '/src/locales'), path.join(__dirname, '/public/locales))
}

532314892 @bradydowling:

Not sure if it's because I resized the iTerm window at the right point

While experiencing the same issue, I resized my iTerm window and bam – it suddenly continued, as well. I don't know if this is a wild coincidence, or...

@bradydowling @davegregg Woah that is a bizarre one. Resizing my iTerm window did the trick.

@TylerBarnes Whatever this is, it I'd suggest it isn't Wordpress-specific. I'm using nothing related to Wordpress whatsoever.

@beauhankins How about you?

@davegregg @beauhankins @bradydowling are any of you able to share a repo where this is happening? That seems really bizarre that resizing your terminal window fixes the problem.

@TylerBarnes ya here's a repo where I was seeing it. I haven't touched it in a little bit.


Side note: How do you handle a situation where you clone a Gatsby site with an older version of Gatsby than what is currently installed by the CLI?

I was running the commends w/in the VS code terminal (I use bash). It was taking forever and as suggested above i exited full screen mode and it worked.

@bradydowling thanks for sharing your repo! For using older versions of Gatsby than cli, you can make an npm script for develop and build.

{
  "scripts": {
    "develop": "gatsby develop",
    "build": "gatsby build"
  }
} 

then running npm run develop or yarn develop will use the local version in your project.

We're investigating this issue but in the meantime, anyone with the problem may be able to get around it by running CI=1 yarn build, as that should use a different reporter library behind the scenes. If you try that and it works please let us know!

@dustinhorton :

v2 version @ https://github.com/dustinhorton/gatsby-v2-issue. Been building for about 50 minutes at this point.

Fwiw. I realize that was posted about a year ago, and Gatsby has changed considerably since then. When running it on my machine (and setting the gatsby version to * in package.json) the build seems to complete in about 2000 seconds (~33 minutes).
Additionally, when upgrading the cli, there's now a progress bar, which makes a huge difference in terms of how long it "feels", since you get a more concrete feedback loop.

The sourcing step takes almost all of this time (1968 / 1975 seconds). The downloading of remote files is the most of that (1845 seconds).

This doesn't surprise me when I look at a single round trip to this server:

# Starting requestInQueue, _concurrentRequests= 10
@ requestInQueue for 75 tasks { concurrent: 10 } { id: 'url' }
@ Fetch http://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media?per_page=100&page=4: 2587.339ms
@ Fetch http://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media?per_page=100&page=10: 2661.584ms
@ Fetch http://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media?per_page=100&page=8: 2695.937ms
@ Fetch http://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media?per_page=100&page=2: 2738.339ms
@ Fetch http://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media?per_page=100&page=6: 2853.199ms

Each request takes roughly 2 to 4 seconds. The 75 pages that are fetched initially while exploring, take 18 seconds in total (!). I have a fast connection and I recan repro that timing with a plain wget.

So the longest step will try to download about 7500 resources. Considering a single request takes 2 to 4 seconds, I'm not surprised it takes that long.

Even so, I do notice some pauses during the main download stretch of 1845 seconds. I'm not sure whether this is just the server throttling the data or not (I did set concurrency to 5).

I did try to wiggle the width of the terminal (I'm on xfce linux, fwiw) and while that occassionally coincided with progress moving forward, I'm right now convinced that's more of a coincidence than causality.

Bottom line: while I can repro the slow download and seemingly "stuck" progress, all signs currently point to that being pretty much caused by waiting on the server response. Additionally, the width the terminal does not seem to affect this.

That said: there _is_ a possibilty that the terminal output gets stuck somehow while updating the progress bar at a very particular width. While this is unlikely, it's not impossible. Hence we really need a repro that we can run ourselves (so no auth). And preferably one that does _not_ depend on a remote server, as I don't want to be hammering the server.

I'm going to update labels on this issue accordingly.

The repro posted in https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-438667221 by @njmyers does not exist anymore

The repo posted in https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-562607399 by @bradydowling requires a bunch of permissions I don't have, and seems to have similar problems with round trip time

@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=7: 25025.257ms
@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=4: 27791.269ms
@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=2: 37817.874ms
@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=5: 38056.989ms
@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=3: 38446.504ms
@ Fetch http://topazandsapphire.com/wp-json/wp/v2/media?per_page=100&page=6: 43799.842ms

This sourcing step is not really showing any progress indicator except for the spinner and occassionaly steps are being logged, and still takes a few minutes, so perhaps we can at least show some kind of progress indicidator if that makes sense.

Additionally, perhaps it could help to point out the average time to fetch a resource, as that's an indication of why "Gatsby" is slow, when it's really caused by the round trip.

In this repo, even downloading 589 remote files took about 5 minutes, with the progress bar often just being stuck for no apparent reason.

After the bootstrap the build fails for me because files are missing.

@pvdz I'll have to play with this again (I gave up on it for a while) but there are certain files that throw permissions issues even when it builds successfully so I just figured those can be ignored.

But to summarize your post, are you saying that certain (download) steps just take a really long time and we should wait longer for them to complete?

@bradydowling Well, looks like it, yes. :)

FTR: I've tracked the resource gathering a bit. To shed some light on timings;

Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6084.jpg: 15605.630ms
Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6036.jpg
Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6051.jpg: 6447.272ms
Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6034.jpg
Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6045.jpg: 6944.355ms
Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6029.jpg
Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6036.jpg: 6401.541ms
Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/01/IMG_6027.jpg

These are 6mb files btw. I'm on a 250Mbs connection which is fine to handle those faster than 1mbs but it does not surprise me that it blows up download times. No amount of cli resizing is going to speed that up ;)

Interesting. This is just a standard WordPress personal blog hosted on EC2 so it's not like it's a gigantic install. Perhaps this is because all these requests are overloading the host. Or, I'm no WordPress expert, but perhaps there's some sort of standard WP rate limit on REST API calls that can happen? I'm also going with the assumption that this behavior isn't unique to this site.

Perhaps this is because all these requests are overloading the host.

This is my guess (or something in this ballpark). But I'm exploring a bit of our own architecture to check whether we are losing efficiency through abstractions. But considering I can mimic most of the times reported with plain wgets/curls, I doubt there's much there.

So fwiw I replaced the got.stream() bits with a dumb raw downloader:

    let r = ""
    require("http").get(url, res =>
      res
        .on("data", m => (r += m))
        .on("end", () => {
          console.timeEnd("$$ Fetch time for " + url)
          resolve(r)
        })
    )
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/IMG_5260.jpg
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/TRAVEL-LEISURE-2-copy.png: 1003.535ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/International-Travel-Topaz-Sapphire.png
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/IMG_4606.jpg: 3174.126ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/Brunch-Topaz-Sapphire-2.png
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/IMG_4647.jpg: 9521.157ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/IMG_6978.jpg
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/05/International-Travel-Topaz-Sapphire.png: 3611.910ms

So yes, I'm pretty sure the long delays (in this case at least) are caused by download. So perhaps our best bet is to improve the feedback while waiting for a download :)

Lots and lots of people say terminal windows resizing (for whatever weird reason) resolves the develop process stuck on 'source and transform nodes'.

Sadly, when using WSL this is not a solution. Stuck with 'source and transform nodes' locally in build as well as in develop. Netlify builds do work but local development has become impossible.

@Vacilando can you debug some links that are being downloaded for your site during sourcing and test manually whether they download fast? Like I mentioned above, one big problem I'm seeing is that certain wp hosts are simply super duper slow.

So if the host is slow and there's a lot of content to download, then yeah this step will take a lot of time because that's all it should be doing in this step; discover content and download it :)

If you've confirmed the content itself is downloaded in a fraction of the whole step, please circle back here. In that case a repro would be tremendously helpful :)

Perhaps in an ideal world you could pass a flag to gatsby which would cache
the site asset download so this doesn't have to be done repeatedly.

Another optimal solution would be to allow a flag that sets some sort of
rate limiting or throttling on the downloading of assets so it doesn't bust
the host.

Any thoughts on those two ideas?

On Thu, Dec 19, 2019, 6:09 PM Peter van der Zee notifications@github.com
wrote:

@Vacilando https://github.com/Vacilando can you debug some links that
are being downloaded for your site during sourcing and test manually
whether they download fast? Like I mentioned above, one big problem I'm
seeing is that certain wp hosts are simply super duper slow.

So if the host is slow and there's a lot of content to download, then yeah
this step will take a lot of time because that's all it should be doing in
this step; discover content and download it :)

If you've confirmed the content itself is downloaded in a fraction of the
whole step, please circle back here. In that case a repro would be
tremendously helpful :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gatsbyjs/gatsby/issues/6654?email_source=notifications&email_token=ABS4AU62367MTEWP7LJXWTLQZP5L5A5CNFSM4FLHT3T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHLK6DI#issuecomment-567717645,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABS4AU7GCV4YMZQH6R37BSDQZP5L5ANCNFSM4FLHT3TQ
.

@bradydowling part of that already exists actually. You can set an env variable GATSBY_CONCURRENT_DOWNLOAD to configure the limit for concurrent requests. The next major version of gatsby-source-wordpress https://github.com/gatsbyjs/gatsby/issues/19292 will have more control over how media files are downloaded. As for the caching, downloaded files are currently cached, but when you change a gatsby-*.js file it currently wipes the cache out to prevent a stale cache from causing unexpected bugs. So that's a core issue rather than being gatsby-source-wordpress specific, but work is always being done to improve Gatsby's cache.

Partially Jobs Api (#19831) should fix this caching problem.

Ya I saw the bit about GATSBY_CONCURRENT_DOWNLOAD closer to the top. From my experience, that didn't help so I guess my suggestion was toward more fine-grained control like in mb per s/m/h or something like that. Maybe I'm just saying nonsense.

@bradydowling I'm looking at adding request retries with exponential backoff as well as adding an optional setting for max requests per second for cases where that doesn't work well enough.

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.
If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

I'm going to close this now.

If you think you have a wordpress sourcing problem, please confirm that your delays are not caused by a slow wordpress server first. Then please open a _new_ issue (but feel free to point back to this issue).

The high number of comments makes it very difficult to track the discussion. So opening a new issue is more likely to result in your specific problem getting an answer.

I and others confirmed it over the past year and a half. my original issue was on a well-tuned vps. @njmyers had a likely fix, or at least improvement, but couldn't get any answers from maintainers about how they'd like it done.

i thought about closing myself, but i think it needs to be out there as a warning that a moderately large wordpress site is NOT a good fit for gatsby as of yet.

@dustinhorton I understand that. This issue is over a year and a half old, things change rapidly. With the issue amounting this many comments it's difficult to figure out the actual problem anymore.

image

Fwiw, as noted above, I checked the last reported repros and determined those, at least, were caused by slow remotes. If you have a repro with a current Gatsby release on a fast remote please let me know, even if it's perhaps already posted in this thread. Or maybe open a new issue for it (and tag me) if you want more focus on it, I'll leave that up to you :)

(_Just to be clear, we closed this issue because it's gone a bit stale with too many off-topic messages, please do not feel like we're squashing the discussion as that is not the intention and we recognise our work is not finished here!_)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hobochild picture hobochild  ·  3Comments

ferMartz picture ferMartz  ·  3Comments

dustinhorton picture dustinhorton  ·  3Comments

kalinchernev picture kalinchernev  ·  3Comments

dustinhorton picture dustinhorton  ·  3Comments