The performance of Yoast sitemaps is hurting servers on larger websites.
With a client of mine, they have 1.4 million products, and Yoast is causing table-scans to happen on every segment of data loaded. It's good that Yoast builds the sitemap in chunks, but it's bad that table scans happen for each of these chunks.
The table scans are causing sitemaps to not work on large websites, and worse, they're hurting other user queries.
I expect index-seeks and good performance regardless of how large websites get.
I expect the sitemaps to work.
The problem is two-fold:
There is a post_date != '00-00-00 00:00:00' query added to the where clause. This prevents use of a sorted index column which is required for the pagination.
There is no supporting index to avoid sorting operations.
I've posted a full analysis here:
https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/
But basically:
Remove this post_date != '00-00-00 00:00:00' part of the query from this file: https://github.com/Yoast/wordpress-seo/blob/a1628d1570a8e8cca896312509ac986182836c30/inc/sitemaps/class-post-type-sitemap-provider.php
Add a supporting index to wp_posts so sorts are not required on every windowed-page request:
create index wpi_scalability_pro_sitemaps on wp_posts (post_status, post_password, post_type, post_modified)
* If relevant, which editor is affected (or editors):
- [ ] Classic Editor
- [ ] Gutenberg
- [ ] Classic Editor plugin
* Which browser is affected (or browsers):
- [ ] Chrome
- [ ] Firefox
- [ ] Safari
- [ ] Other
Continuing our discussion from your post, in the past I worked on refactoring of sitemaps module and specifically paid close attention to feedback on performance at the scale you are talking about.
For the record I am not a DB specialist, so my experience with this is combination of my own observation, extensive feedback from plugin's users, and some reaching out to people who _are_ specialists.
To reiterate your claim from the post is that main query for the sitemap post page is unoptimized and doesn't scale for large amounts of posts.
The query in question is (you chose to run a subquery part of it in your post):
SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt
FROM (
SELECT wp_posts.ID
FROM wp_posts
WHERE wp_posts.post_status = 'publish' AND wp_posts.post_type = 'test' AND wp_posts.post_password = '' AND wp_posts.post_date != '0000-00-00 00:00:00'
ORDER BY wp_posts.post_modified ASC
LIMIT 100 OFFSET 0
) o
JOIN wp_posts l ON l.ID = o.ID
The main challenge of SQL for sitemaps is that MySQL doesn't scale well for OFFSET operations. The more rows you have the more rows it walks through to reach your offset. The queries in the plugin had been very specifically written and audited to account for that.
I didn't have cool 1M posts on hand, but I generated 92K which is more than sufficient to illustrate the rate of performance drop off:
So with a considerable offset performance dropped of by 50 _milliseconds_, while your post claims performance drop off of _multiple minutes_ (at larger sizes, but certainly not consistent with the pace I observe).
Further, removing AND wp_posts.post_date != '0000-00-00 00:00:00' part of query, you claim to be problematic and have major impact on performance of the query, makes _zero_ difference to the run time for me.
Do I believe that you observe poor performance at a large site? Sure, that is certainly possible.
Do I believe that this query is inherently problematic? No, not from my experience or any feedback I've gotten before.
I suspect the question to explore here is _why that specific server runs this query slowly_, not _what's wrong with the query_.
Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?
For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.
And as for the 'dropoff' - I think you'll find both those queries are just doing full table scans and hence no difference in speed.
Finally -
I suspect the question to explore here is _why that specific server runs this query slowly_, not _what's wrong with the query_.
I optimised the server. There's nothing now wrong with the server stack. There were issues with it before, but not now.
Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?
I didn't mean this dismissively, I apologize for it coming across like that. I believe your observed results, though I disagree on conclusions so far (the need to change the query and add custom index).
For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.
Gotcha. From point of view of plugin in general that would have two problems:
I optimised the server. There's nothing now wrong with the server stack.
I can't speak for your server stack, I can only say that there is extensive feedback on record from large sites on sitemaps to the contrary and it performing sufficiently well. As such I would suspect _individual_ issue with environment or otherwise, not a _systematic_ one with implementation.
Since I am no longer involved with development, I think I'll shut up now. :) Thank you for taking the time to research and report and I (always) hope there are opportunities to improve sitemap performance for general case, it was my pet module. :)
I have problem too with high CPU usage. Maybe because of sitemaps. I have aproxx 250 pages
This issue occurs for me and can be observed in large sites with several hundred thousand pages simply by visiting the sitemap_index.xml page. It is loading very slow. Its trivial to insert 1M+ records in the wp_posts table or using the WP API to replicate this. There is a workaround I found: https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/
Hello @joelkarunungan,
I have the same issue and highly penalising regarding loading time (15 minutes per sitemap).
Have you recently tested https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/ solution ?
I tried to implement it, however the command line is failing, executing:
cd /srv/www/example.com && /usr/local/bin/wp eval '$sm = new WPSEO_Sitemaps;$sm->build_root_map();$sm->output();' > /srv/www/example.com/wp-content/uploads/sitemap_index.xml
Throws
PHP Notice: Undefined index: HTTP_X_FORWARDED_PROTO in phar:///home/wp/vendor/wp-cli/wp-cli/php/WP_CLI/Runner.php(1169) : eval()'d code on line 72
PHP Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388
Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388
Am I missing something?
@redvivi first solution I ended up using is @dhilditch solution:
https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/ > Yoast SEO XML Sitemaps performance (this is same as what he mentioned above in this issue thread)
I eventually dropped Yoast's sitemap and used Google XML Sitemaps as mentioned in the link as well. This was the best solution for me despite the bugs with the 3 year old plugin.
Hoping though that someone would adopt the Google XML Sitemaps plugin through WP official so the plugin gets updated and its bugs fixed since its performance really outperforms Yoast's sitemap generator.
@joelkarunungan, thanks for your feedback.
Unfortunately, @dhilditch does not change the performance issue I am facing, I guess the data I want to sitemap does not apply in my case.
It is quite disappointing that there is not a solution from Yoast.
Also got this problem with 500k+ posts. I think it's Google fetching the posts and related to XML sitemap 100%.
SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt
FROM (
SELECT wp_posts.ID
FROM wp_posts
WHERE wp_posts.post_status = 'publish'
AND wp_posts.post_type = 'post'
AND wp_posts.post_password = ''
AND wp_posts.post_date != '0000-00-00 00:00:00'
ORDER BY wp_posts.post_modified ASC LIMIT 100 OFFSET 381500
)
o JOIN wp_posts l ON l.ID = o.ID
There is a
post_date != '00-00-00 00:00:00'query added to the where clause.
For reference, post_date != '0000-00-00 00:00:00' and post_author != 0 were added in 3cdcfc9.
post_author != 0 was later removed in 0845fc9.
Hello, i also had massive performance issues with the yoast sitemaps (site over 350.000 posts), adding the index and removing the post_date != '0000-00-00 00:00:00' improved performance a lot.
Anyway i just build my own static file cache.
Hello
Me too adding index in database and removing post_date improved speed a lot.
However i still have the problem for custom taxonomy. I try to find the post_date in inc/sitemaps/class-taxonomy-sitemap-provider.php but cannot find.
Any idea of how I can fix that in a similar way for custom taxonomy?
Thank you
Please inform the customer of conversation # 555596 when this conversation has been closed.
Please inform the customer of conversation # 557857 when this conversation has been closed.
Why in the world would this be prioritized as "a minor"? We have a 6 core server and are too experiencing huge CPU spikes possibly related to constant generation of sitemaps. This should be fixed and prioritized as an important task.
In general, from what I have experienced, Yoast is slow at a lot of things, including slowing our backend down 1 second per. request in our WP Admin backend and also generating sitemaps.
Slowly thinking towards going with another SEO plugin.
Please inform the customer of conversation # 566981 when this conversation has been closed.
Hi,
i have created a multi column index which helped to improve the performance a lot:
ALTER TABLEwp_postsADD INDEXpost_status_post_type_post_password_post_date_post_modified(post_status,post_type,post_password,post_date,post_modified) USING BTREE;
Time to create one post-sitemapXXX.xml sitemap went down from 3! Minutes to 10 Seconds (still much too long).
We are having the same issue with the sitemap killing the site. I see above this is marked a "minor" issue? That makes absolutely no sense. This should be a priority bug fix as it is relatively simple to address. What is the status of this bug - one that has been known since last February???
Hi @Drew-Dix , severity currently encompasses both severity and priority. Though I agree that the severity, just looking at functionality, is very high, the priority is pretty low (based on the time this issue has been open). Not many people experience this issue, as far as I can see in the comments. It looks to be very site-specific and no details as to what causes this specifically have been shared. As @Rarst stated at the end of his post:
I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.
So any detailed information that anyone experiencing this issue can share, would be very welcome.
As the fix is relatively simple, I would think it would ratchet up on the priority ladder unless Yoast has made a determination to only service very small sites. If that is the case, you should note that on your plugin page for potential users of Yoast to correctly set expectations. While I am not a developer, I did previously manage a development team. A fix like this would have been out the door in days. Not months.
These indexes should be added into WP core because it affects feeds, other plugins, ...
@Drew-Dix Please comment https://core.trac.wordpress.org/ticket/15499 too. Maybe it'll help to WP adds proper indexes.
Thanks @stodorovic will review.
Also, to add - this is a chronic issue regarding performance that is progressive. Thus it impacts all Yoast users over time. As Google takes site speed into consideration regarding SEO, by using Yoast you are, ironically, undermining your SEO if your site grows. So this is an issue that is impacting all Yoast users. They just do not know it (yet).
We will rebuild the entire XML sitemaps logic once we have the Indexables table we're working on. This will make it possible to generate XML sitemaps simply by querying one table, that'll already contain the permalinks, so generation of an XML sitemap will become an utter breeze. Because we already plan to do that, this issue is no longer of importance as far as I'm concerned, and I'm thus closing. But please do re-open and / or comment if you think I'm missing something :)
For more context on the indexables stuff, it's all in this branch: https://github.com/Yoast/wordpress-seo/tree/feature/indexables-frontend
@jdevalk Thanks for the update. Do you have timing on when this update will be released?
Indexables will probably be released in January, no timing yet on when we'll rebuild XML sitemaps on top of it but I hope to do that soon(ish).
Would you mind to provide guidance on information (including diag reports) you would require to take this further ?
I assume that generating a 100k+ WordPress posts on any website would be sufficient to reproduce this issue...
@redvivi
What I do is first create a source table containing unique randomly generated content with 1M records. Just copy the wp_posts table and write a script that would churn out the unique random data.
Once that's done its just trivial to feed it into the WP posts table. Pseudocode below for my timezone GMT+8.
INSERT INTO db.destination (ID, post_author, post_date, post_date_gmt, post_content, post_title, post_excerpt, post_status, comment_status, ping_status, post_password, post_name, to_ping, pinged, post_modified, post_modified_gmt, post_content_filtered, post_parent, guid, menu_order, post_type, post_mime_type, comment_count)
SELECT
post_id AS ID,
0 AS post_author,
DATE_ADD(date_field, INTERVAL post_id SECOND) AS post_date,
DATE_ADD(DATE_ADD(date_field, INTERVAL post_id SECOND), INTERVAL - 8 HOUR) AS post_date_gmt,
post_content,
post_title,
'' AS post_excerpt,
'publish' AS post_status,
'open' AS comment_status,
'open' AS ping_status,
'' AS post_password,
post_slug AS post_name,
'' AS to_ping,
'' AS pinged,
NOW() AS post_modified,
UTC_TIMESTAMP() AS post_modified_gmt,
'' AS post_content_filtered,
0 AS post_parent,
CONCAT('http://example.com/blog/',
post_slug, '/') AS guid,
0 AS menu_order,
'post' AS post_type,
'' AS post_mime_type,
0 AS comment_count
FROM
db.source ORDER BY ID;
Hi Folks, just checking if this is the tree that contains the fix for this?
https://github.com/Yoast/wordpress-seo/tree/indexable-fron and should now be working fine in the 12.6.2 version? It seems it is still slow when querying a million rows and generating a sitemap? Just checking if we should open up a new Issue for this or just reopen this one
Indexables is the 14.0 release, you can update to 14.0.4 (current latest) to see the improvements. If this is not fixed, please report back here with details.
On my end, I have switched to
https://wordpress.org/support/plugin/bwp-google-xml-sitemaps/, which
provides a much more resilient implementation and cache priming mechanism -
though less reactive, than Yoast one for such large indexes.
Perhaps if it is not solved, it would be wise to get inspiration from this
implementation.
On Fri, 1 May 2020 at 08:58, Djennez notifications@github.com wrote:
Indexables is the 14.0 release, you can update to 14.0.4 (current latest)
to see the improvements. If this is not fixed, please report back here with
details.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Yoast/wordpress-seo/issues/12161#issuecomment-622275684,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AMDK7BO62ZRJAPJGH6PJARTRPJXJJANCNFSM4GUMZMQQ
.
We haven't rebuild the XML sitemaps on top of indexables yet. That'll happen soon though :)
Thanks for the feedback @Djennez @jdevalk :)
@Djennez in v14.2 always the same problem. No improvement noted unfortunately.
@jdevalk We all look forward to the generation performance improvement!
Especially on project-sitemap.xml for me (~35 seconds for 236 URLs), with divi (theme builder), php 7.4, powerful machine
@JeremieBALDY Please look #12304. It isn't Yoast SEO issue because Divi runs all shortcodes for all pages/posts/projects (which are created by Divi builder) to extract images. If you disable images in sitemap then you will see normal loading time. It'll be fixed when sitemaps get images on different way. Please report it to Divi too.
The main challenge of SQL for sitemaps is that MySQL doesn't scale well for OFFSET operations. The more rows you have the more rows it walks through to reach your offset. The queries in the plugin had been very specifically written and audited to account for that.
This is not true if you have an index which covers your WHERE statement followed by the same index covering your ORDER BY statement with the index columns in this order (WHERE columns followed by ORDER BY columns in the index).
See my article for proof where the SQL query is using OFFSET 483300 and runs in the same time regardless of which offset you're using.
If you're averse to adding indexes to help your slow queries, maybe populate a separate table under your control? Or maybe use the existing indexes on wp_posts if you can find a suitable one that covers the WHERE and ORDER BY statement.
@dhilditch a table under our own control is exactly what we already have in indexables, which we're planning to base the next iteration of our XML sitemaps on :)
Most helpful comment
Thanks @stodorovic will review.
Also, to add - this is a chronic issue regarding performance that is progressive. Thus it impacts all Yoast users over time. As Google takes site speed into consideration regarding SEO, by using Yoast you are, ironically, undermining your SEO if your site grows. So this is an issue that is impacting all Yoast users. They just do not know it (yet).