Currently (3.2.*) we use the following query to determine what the last modified date of a sitemap "page" is. Which is needed in the sitemap index.
file: class-post-type-sitemap-provider.php
SELECT post_modified_gmt
FROM (
SELECT @rownum:=@rownum+1 rownum, $wpdb->posts.post_modified_gmt
FROM (SELECT @rownum:=0) r, $wpdb->posts
USE INDEX()
WHERE post_status IN ('publish','inherit')
AND post_type = %s
ORDER BY post_modified_gmt ASC
) x
WHERE rownum %%%d=0
SQL Explain results in a lot of rows being fetched and suggests that this is a very expensive command to run.
Suggested change:
This way we only have to do a CALC_ROWS once for the total number and then have one query per paged sitemap for the latest post.
All custom SQL up for review for #2787
I already managed to get rid of it in authors, getting to post types eventually.
What priority does this issue have? We run a page with > 800.000 articles and can't generate a sitemap because this query crashes our database server while it seems to be stacked as it's not being processed fast enough. We needed to kill the PHP service as the query took > 20 sec and queued up over and over again.
See some DB information below:
mysql> SHOW INDEX FROM wp_posts FROM database;
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| wp_posts | 0 | PRIMARY | 1 | ID | A | 1368414 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date | 1 | post_type | A | 18 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date | 2 | post_status | A | 18 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date | 3 | post_date | A | 1368414 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date | 4 | ID | A | 1368414 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | post_parent | 1 | post_parent | A | 342103 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | post_author | 1 | post_author | A | 18 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | post_name | 1 | post_name | A | 1368414 | 191 | NULL | | BTREE | | |
| wp_posts | 1 | guid | 1 | guid | A | 1368414 | 191 | NULL | | BTREE | | |
| wp_posts | 1 | type_status_mime_type | 1 | post_type | A | 18 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_mime_type | 2 | post_status | A | 18 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_mime_type | 3 | post_mime_type | A | 18 | 10 | NULL | | BTREE | | |
| wp_posts | 1 | type_status_comment_count | 1 | post_type | A | 200 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_comment_count | 2 | post_status | A | 200 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_comment_count | 3 | comment_count | A | 200 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date_gmt | 1 | post_type | A | 200 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date_gmt | 2 | post_status | A | 200 | NULL | NULL | | BTREE | | |
| wp_posts | 1 | type_status_date_gmt | 3 | post_date_gmt | A | 200 | NULL | NULL | | BTREE | | |
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
18 rows in set (0.01 sec)
mysql> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum %500=0;
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 811282 | Using where |
| 2 | DERIVED | <derived3> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | wp_posts | ALL | type_status_date,type_status_mime_type,type_status_comment_count,type_status_date_gmt | type_status_date | 82 | | 1368415 | Using where |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
4 rows in set (3.65 sec)
I reduced _a lot_ how _often_ this query gets called in #4563
As for rewriting query itself I don't see a way so far, there is some discussion on PR. _Specific_ SQL suggestions highly welcomed. :)
For the record I have explored switching to a multiple queries with offsets. While each query by itself is faster, the total sum of them quickly becomes much slower than single aggregate query.
Ok, I spent some time figuring out what happens in this query and seems that MySQL picking index for it badly might be the problem.
Could someone with large site try this tweak (preferably on top of #4563):
FROM (SELECT @rownum:=0) r, $wpdb->posts USE INDEX()
cc @spacedmonkey
@Rarst - Great! Thanks for investigation. I will test it on our page. (> 1mio posts)
@Rarst Do you have a particular index in mind? We must specify at least one to gain a positive effect, because:
It is syntactically valid to omit _index_list_ for
USE INDEX, which means “use no indexes.”
(emphasis mine) – http://dev.mysql.com/doc/refman/5.7/en/index-hints.html
Yeah, that's the thing. If it picks core's available index (type_status_date) it seems to work _slower_ than without using any index at all. In my local dev environment query speeds up from 200+ ms to about 40ms without index, roughly 5x improvement.
Please try my suggestion and let us know what difference does it make under your circumstances. :)
Results before vs. after:
USE INDEX ()> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 812699 | Using where |
| 2 | DERIVED | <derived3> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | wp_posts | ALL | type_status_date,type_status_mime_type,type_status_comment_count,type_status_date_gmt | type_status_date | 82 | | 1376595 | Using where |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
4 rows in set (4.76 sec)
USE INDEX ()> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 812699 | Using where |
| 2 | DERIVED | <derived3> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | wp_posts | ALL | NULL | NULL | NULL | NULL | 1376595 | Using where |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
4 rows in set (2.64 sec)
This essentially means that a mere EXPLAIN causes the query builder to spend 2.6+ seconds to figure out how the data can be queried - seemingly caused by the filesort in between.
For completeness:
> SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
-- ...snip...
1625 rows in set (3.69 sec)
Forgot to mention that, yes, your suspicion on hinting MySQL to use no index at all does indeed improve the query's performance (though that appears to be a rather weird edge case), but unfortunately, not in a sufficient way.
I'd like to help to find a suitable fix / replacement query as soon as possible here, as we had to turn off the generation of the sitemap entirely for the meantime, which appears to beat down the search engine ranking of the affected site very quickly... so we'd appreciate every hint (even temporary hacks) that could allow us to re-enable the sitemap generation.
I already had a look at explanations for this query over in https://github.com/Yoast/wordpress-seo/pull/4563 (thanks!), but it's still not fully clear to me what exact data set the query tries to retrieve - in particular:
post_modified_gmt, but that's used to retrieve the oldest (or latest?) modification date per chunk of %d posts instead, isn't it?For each chunk of N posts (ordered by ID) of type T, list the most recent modification date of all contained posts.
Is that the intention? Already tried to decipher this info from the query as well as the surrounding code within the plugin, but didn't see a explanation.
Also, how often, or under which circumstances is it triggered? Do we always need the full list (all chunks) or do we know the affected chunk upfront most of the time?
Thanks!
Ok, from the top.
For starters this is an arcane ancient query, from the depths of history, written in another plugin, I don't know when and by I don't know whom. Git history doesn't go back that far. So my own insight isn't much here, I just have been cleaning up sitemaps code for months.
The general approach to this in plugin is to query items, sorted by modification time. That way we can usually establish last modified time for each page and the changes affect the "tail" of pages more, since recent items are more commonly edited than old items.
The problem with posts is that there can be _a lot_ of them. Million of terms or authors? Unlikely. Million of posts? Easily.
So what we do is:
So the last part is the query we discuss here.
The biggest challenge is that to build index we need modified time for _every_ sitemap page.
Which resulted in current situation of:
So are we looking for the oldest or for the most recent modification date in each chunk?
Most recent.
Can we fully describe the intended/expected results of this query in a human-readable and graspable way?
Page 1, [last modified time]
Page 2, [last modified time]
...
Page last, [last modified time]
Also, how often, or under which circumstances is it triggered? Do we always need the full list (all chunks) or do we know the affected chunk upfront most of the time?
Every time we build a sitemap index, which is whenever it's accessed add not cached.
Thanks a lot for the elaborate reply, @Rarst. That's very helpful to understand the matter.
Hm, no the ASC part should be right. Larger/later date = later page. But we have DESC in other types somewhere. Ugh, this hurts my brain.
I have asked a Q about the query at DBA SE maybe professional DB folks would have some bright ideas... :)
The discussion on the SE question got some suggestions about indexes and that InnoDB engine might perform better for this case.
No definitive solutions though so far. :(
Results of my research: http://dba.stackexchange.com/a/139564/27291
mysql> SELECT MAX(p.ID), MAX(p.post_modified)
FROM wp_posts p, (
SELECT ID
FROM wp_posts
WHERE post_type = 'post' AND ID > 51000
ORDER BY ID ASC
LIMIT 0,1000
) c
WHERE c.ID = p.ID \G
*************************** 1. row ***************************
MAX(p.ID): 52000
MAX(p.post_modified): 1997-12-31 20:27:50
1 row in set (0.01 sec)
The closest question came up to answering is creating and forcing
INDEX `type_modified_status` (`post_type`, `post_modified_gmt`, `post_status`)
index, seems there are some straight SQL bugs in play about this which degrade performance.
I am not sure if creating and forcing index is something we can forcible ship in the plugin. :\
@sun I saw your answer briefly, but you seem to have deleted it since? Could you elaborate on your findings.
Also does index tweak meaningfully improve performance in your setup?
Another answer suggested following refactor of the query. It seems to suffer from index issues too, _but_ it can benefit from natively available index! We don't have to consider creating a new one!
SELECT post_modified_gmt
FROM ( SELECT @rownum := 0 ) init
JOIN wp_posts USE INDEX(type_status_date)
WHERE post_status IN ('publish','inherit')
AND post_type = 'test'
AND (@rownum:=@rownum+1) %1000 = 0
ORDER BY post_modified_gmt ASC
Will look over on fresh head and likely PR this as replacement for what we have.
Query changed in #4563 PR
Most helpful comment
Ok, from the top.
Historical perspective
For starters this is an arcane ancient query, from the depths of history, written in another plugin, I don't know when and by I don't know whom. Git history doesn't go back that far. So my own insight isn't much here, I just have been cleaning up sitemaps code for months.
What we need in general
The general approach to this in plugin is to query items, sorted by modification time. That way we can usually establish last modified time for each page and the changes affect the "tail" of pages more, since recent items are more commonly edited than old items.
How it's implemented for posts
The problem with posts is that there can be _a lot_ of them. Million of terms or authors? Unlikely. Million of posts? Easily.
So what we do is:
So the last part is the query we discuss here.
The biggest challenge is that to build index we need modified time for _every_ sitemap page.
Which resulted in current situation of:
Q&A
Most recent.
Page 1, [last modified time]
Page 2, [last modified time]
...
Page last, [last modified time]
Every time we build a sitemap index, which is whenever it's accessed add not cached.