Wordpress-seo: Sitemap index: page last modified date performance

Created on 29 Apr 2016  ·  19Comments  ·  Source: Yoast/wordpress-seo

Currently (3.2.*) we use the following query to determine what the last modified date of a sitemap "page" is. Which is needed in the sitemap index.

file: class-post-type-sitemap-provider.php

SELECT post_modified_gmt
FROM (
    SELECT @rownum:=@rownum+1 rownum, $wpdb->posts.post_modified_gmt
    FROM (SELECT @rownum:=0) r, $wpdb->posts
    USE INDEX()
    WHERE post_status IN ('publish','inherit')
        AND post_type = %s
    ORDER BY post_modified_gmt ASC
) x
WHERE rownum %%%d=0

SQL Explain results in a lot of rows being fetched and suggests that this is a very expensive command to run.

Suggested change:

  • Get items for the specific post type
  • Get the number of pages (items / max_entries)
  • Query the post_modified_gmt with a LIMIT {page size}, 1

This way we only have to do a CALC_ROWS once for the total number and then have one query per paged sitemap for the latest post.

performance xml sitemap

Most helpful comment

Ok, from the top.

Historical perspective

For starters this is an arcane ancient query, from the depths of history, written in another plugin, I don't know when and by I don't know whom. Git history doesn't go back that far. So my own insight isn't much here, I just have been cleaning up sitemaps code for months.

What we need in general

  1. For every type of item we need to split it into sitemap pages of N size, where N is setting (which should be under 50K).
  2. For every sitemap page we need to provide a link to it in sitemap index, with last modified time for the page.
  3. _Nice to have_: we should do this in a way that minimizes how often pages are changed, i.e. if one item changes it is preferable that it changes a single page rather than all of pages.

The general approach to this in plugin is to query items, sorted by modification time. That way we can usually establish last modified time for each page and the changes affect the "tail" of pages more, since recent items are more commonly edited than old items.

How it's implemented for posts

The problem with posts is that there can be _a lot_ of them. Million of terms or authors? Unlikely. Million of posts? Easily.

So what we do is:

  1. We sort posts by modified time.
  2. Limit N, offset N times page number — gives us a block of posts for the page.
  3. Every Nth post gives us post from that block to use as last modified time.

So the last part is the query we discuss here.

The biggest challenge is that to build index we need modified time for _every_ sitemap page.

Which resulted in current situation of:

  1. Heavy SQL query
  2. No better ideas

Q&A

So are we looking for the oldest or for the most recent modification date in each chunk?

Most recent.

Can we fully describe the intended/expected results of this query in a human-readable and graspable way?

Page 1, [last modified time]
Page 2, [last modified time]
...
Page last, [last modified time]

Also, how often, or under which circumstances is it triggered? Do we always need the full list (all chunks) or do we know the affected chunk upfront most of the time?

Every time we build a sitemap index, which is whenever it's accessed add not cached.

All 19 comments

All custom SQL up for review for #2787

I already managed to get rid of it in authors, getting to post types eventually.

What priority does this issue have? We run a page with > 800.000 articles and can't generate a sitemap because this query crashes our database server while it seems to be stacked as it's not being processed fast enough. We needed to kill the PHP service as the query took > 20 sec and queued up over and over again.

See some DB information below:

mysql> SHOW INDEX FROM wp_posts FROM database;
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name                  | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| wp_posts |          0 | PRIMARY                   |            1 | ID             | A         |     1368414 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date          |            1 | post_type      | A         |          18 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date          |            2 | post_status    | A         |          18 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date          |            3 | post_date      | A         |     1368414 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date          |            4 | ID             | A         |     1368414 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_parent               |            1 | post_parent    | A         |      342103 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_author               |            1 | post_author    | A         |          18 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_name                 |            1 | post_name      | A         |     1368414 |      191 | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | guid                      |            1 | guid           | A         |     1368414 |      191 | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_mime_type     |            1 | post_type      | A         |          18 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_mime_type     |            2 | post_status    | A         |          18 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_mime_type     |            3 | post_mime_type | A         |          18 |       10 | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_comment_count |            1 | post_type      | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_comment_count |            2 | post_status    | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_comment_count |            3 | comment_count  | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date_gmt      |            1 | post_type      | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date_gmt      |            2 | post_status    | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date_gmt      |            3 | post_date_gmt  | A         |         200 |     NULL | NULL   |      | BTREE      |         |               |
+----------+------------+---------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
18 rows in set (0.01 sec)

mysql> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum %500=0;
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| id | select_type | table      | type   | possible_keys                                                                         | key              | key_len | ref  | rows    | Extra          |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                                                                                  | NULL             | NULL    | NULL |  811282 | Using where    |
|  2 | DERIVED     | <derived3> | system | NULL                                                                                  | NULL             | NULL    | NULL |       1 | Using filesort |
|  2 | DERIVED     | wp_posts   | ALL    | type_status_date,type_status_mime_type,type_status_comment_count,type_status_date_gmt | type_status_date | 82      |      | 1368415 | Using where    |
|  3 | DERIVED     | NULL       | NULL   | NULL                                                                                  | NULL             | NULL    | NULL |    NULL | No tables used |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
4 rows in set (3.65 sec)

I reduced _a lot_ how _often_ this query gets called in #4563

As for rewriting query itself I don't see a way so far, there is some discussion on PR. _Specific_ SQL suggestions highly welcomed. :)

For the record I have explored switching to a multiple queries with offsets. While each query by itself is faster, the total sum of them quickly becomes much slower than single aggregate query.

Ok, I spent some time figuring out what happens in this query and seems that MySQL picking index for it badly might be the problem.

Could someone with large site try this tweak (preferably on top of #4563):

FROM (SELECT @rownum:=0) r, $wpdb->posts USE INDEX()

cc @spacedmonkey

@Rarst - Great! Thanks for investigation. I will test it on our page. (> 1mio posts)

@Rarst Do you have a particular index in mind? We must specify at least one to gain a positive effect, because:

It is syntactically valid to omit _index_list_ for USE INDEX, which means “use no indexes.”

(emphasis mine) – http://dev.mysql.com/doc/refman/5.7/en/index-hints.html

Yeah, that's the thing. If it picks core's available index (type_status_date) it seems to work _slower_ than without using any index at all. In my local dev environment query speeds up from 200+ ms to about 40ms without index, roughly 5x improvement.

Please try my suggestion and let us know what difference does it make under your circumstances. :)

Results before vs. after:

Before USE INDEX ()

> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
| id | select_type | table      | type   | possible_keys                                                                         | key              | key_len | ref  | rows    | Extra          |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                                                                                  | NULL             | NULL    | NULL |  812699 | Using where    |
|  2 | DERIVED     | <derived3> | system | NULL                                                                                  | NULL             | NULL    | NULL |       1 | Using filesort |
|  2 | DERIVED     | wp_posts   | ALL    | type_status_date,type_status_mime_type,type_status_comment_count,type_status_date_gmt | type_status_date | 82      |      | 1376595 | Using where    |
|  3 | DERIVED     | NULL       | NULL   | NULL                                                                                  | NULL             | NULL    | NULL |    NULL | No tables used |
+----+-------------+------------+--------+---------------------------------------------------------------------------------------+------------------+---------+------+---------+----------------+
4 rows in set (4.76 sec)

After USE INDEX ()

> EXPLAIN SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |  812699 | Using where    |
|  2 | DERIVED     | <derived3> | system | NULL          | NULL | NULL    | NULL |       1 | Using filesort |
|  2 | DERIVED     | wp_posts   | ALL    | NULL          | NULL | NULL    | NULL | 1376595 | Using where    |
|  3 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL |    NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+---------+----------------+
4 rows in set (2.64 sec)

This essentially means that a mere EXPLAIN causes the query builder to spend 2.6+ seconds to figure out how the data can be queried - seemingly caused by the filesort in between.

For completeness:

> SELECT post_modified_gmt FROM (SELECT @rownum:=@rownum+1 rownum, wp_posts.post_modified_gmt FROM (SELECT @rownum:=0) r, wp_posts WHERE post_status IN ('publish','inherit') AND post_type = 'post' ORDER BY post_modified_gmt ASC) x WHERE rownum % 500 = 0;
-- ...snip...
1625 rows in set (3.69 sec)

Forgot to mention that, yes, your suspicion on hinting MySQL to use no index at all does indeed improve the query's performance (though that appears to be a rather weird edge case), but unfortunately, not in a sufficient way.

I'd like to help to find a suitable fix / replacement query as soon as possible here, as we had to turn off the generation of the sitemap entirely for the meantime, which appears to beat down the search engine ranking of the affected site very quickly... so we'd appreciate every hint (even temporary hacks) that could allow us to re-enable the sitemap generation.

I already had a look at explanations for this query over in https://github.com/Yoast/wordpress-seo/pull/4563 (thanks!), but it's still not fully clear to me what exact data set the query tries to retrieve - in particular:

  1. Unless I misunderstand the statements, it is _not_ actually sorted by post_modified_gmt, but that's used to retrieve the oldest (or latest?) modification date per chunk of %d posts instead, isn't it?
  2. So are we looking for the oldest or for the most recent modification date in each chunk?
  3. Can we fully describe the intended/expected results of this query in a human-readable and graspable way? Something along the lines of:

For each chunk of N posts (ordered by ID) of type T, list the most recent modification date of all contained posts.

Is that the intention? Already tried to decipher this info from the query as well as the surrounding code within the plugin, but didn't see a explanation.

Also, how often, or under which circumstances is it triggered? Do we always need the full list (all chunks) or do we know the affected chunk upfront most of the time?

  1. Can I find any earlier versions of this query somewhere that did not use subqueries for chunking, in order to test-run/benchmark them against our current database and to better understand the intention?

Thanks!

Ok, from the top.

Historical perspective

For starters this is an arcane ancient query, from the depths of history, written in another plugin, I don't know when and by I don't know whom. Git history doesn't go back that far. So my own insight isn't much here, I just have been cleaning up sitemaps code for months.

What we need in general

  1. For every type of item we need to split it into sitemap pages of N size, where N is setting (which should be under 50K).
  2. For every sitemap page we need to provide a link to it in sitemap index, with last modified time for the page.
  3. _Nice to have_: we should do this in a way that minimizes how often pages are changed, i.e. if one item changes it is preferable that it changes a single page rather than all of pages.

The general approach to this in plugin is to query items, sorted by modification time. That way we can usually establish last modified time for each page and the changes affect the "tail" of pages more, since recent items are more commonly edited than old items.

How it's implemented for posts

The problem with posts is that there can be _a lot_ of them. Million of terms or authors? Unlikely. Million of posts? Easily.

So what we do is:

  1. We sort posts by modified time.
  2. Limit N, offset N times page number — gives us a block of posts for the page.
  3. Every Nth post gives us post from that block to use as last modified time.

So the last part is the query we discuss here.

The biggest challenge is that to build index we need modified time for _every_ sitemap page.

Which resulted in current situation of:

  1. Heavy SQL query
  2. No better ideas

Q&A

So are we looking for the oldest or for the most recent modification date in each chunk?

Most recent.

Can we fully describe the intended/expected results of this query in a human-readable and graspable way?

Page 1, [last modified time]
Page 2, [last modified time]
...
Page last, [last modified time]

Also, how often, or under which circumstances is it triggered? Do we always need the full list (all chunks) or do we know the affected chunk upfront most of the time?

Every time we build a sitemap index, which is whenever it's accessed add not cached.

Thanks a lot for the elaborate reply, @Rarst. That's very helpful to understand the matter.

Hm, no the ASC part should be right. Larger/later date = later page. But we have DESC in other types somewhere. Ugh, this hurts my brain.

I have asked a Q about the query at DBA SE maybe professional DB folks would have some bright ideas... :)

The discussion on the SE question got some suggestions about indexes and that InnoDB engine might perform better for this case.

No definitive solutions though so far. :(

Results of my research: http://dba.stackexchange.com/a/139564/27291

mysql> SELECT MAX(p.ID), MAX(p.post_modified)
       FROM wp_posts p, (
         SELECT ID
         FROM wp_posts
         WHERE post_type = 'post' AND ID > 51000
         ORDER BY ID ASC
         LIMIT 0,1000
       ) c
       WHERE c.ID = p.ID \G
*************************** 1. row ***************************
           MAX(p.ID): 52000
MAX(p.post_modified): 1997-12-31 20:27:50
1 row in set (0.01 sec)

The closest question came up to answering is creating and forcing

INDEX `type_modified_status` (`post_type`, `post_modified_gmt`, `post_status`)

index, seems there are some straight SQL bugs in play about this which degrade performance.

I am not sure if creating and forcing index is something we can forcible ship in the plugin. :\

@sun I saw your answer briefly, but you seem to have deleted it since? Could you elaborate on your findings.

Also does index tweak meaningfully improve performance in your setup?

Another answer suggested following refactor of the query. It seems to suffer from index issues too, _but_ it can benefit from natively available index! We don't have to consider creating a new one!

SELECT  post_modified_gmt
    FROM  ( SELECT @rownum := 0 ) init 
    JOIN  wp_posts USE INDEX(type_status_date)
    WHERE  post_status IN ('publish','inherit')
      AND  post_type = 'test'
      AND  (@rownum:=@rownum+1) %1000 = 0
    ORDER BY  post_modified_gmt ASC

Will look over on fresh head and likely PR this as replacement for what we have.

Query changed in #4563 PR

Was this page helpful?
0 / 5 - 0 ratings