Postgraphile: About using too much to_json function causes performance degradation

Created on 25 Aug 2018 · 7Comments · Source: graphile/postgraphile

Hello in use, there are 10 million pieces of data in the database. When the execution offset is 9 million, the limit is 10, the SQL query is very slow, and the query time can reach more than 3,000 milliseconds after analysis.
Finally, the reason is found out because too many to_json functions are used in SQL to reduce the query performance of Postgresql.
Excuse me, is there a solution to this phenomenon?

Problematic SQL:

explain analyze  select to_json((__local_2__."id")) as "id", to_json((__local_2__."shop_id")) as "shopId", to_json((__local_2__."name")) as "name", to_json((__local_2__."published")) as "published", to_json((__local_2__."body_html")) as "bodyHtml", to_json((__local_2__."product_type")) as "productType", to_json((__local_2__."created_at")) as "createdAt", to_json((__local_2__."updated_at")) as "updatedAt"
          from "public"."products" as __local_2__

          where (TRUE) and (TRUE) and (TRUE)
          order by __local_2__."id" ASC
                OFFSET 11500610
          limit 10

Analysis result:

Limit  (cost=636146.05..636146.61 rows=10 width=260) (actual time=31879.550..31879.568 rows=7 loops=1)
  ->  Index Scan using products_pkey on products __local_2__  (cost=0.43..636148.05 rows=11500646 width=260) (actual time=0.054..31343.420 rows=11500617 loops=1)
Planning time: 0.079 ms
Execution time: 31879.592 ms

Remove the to_json SQL:

explain analyze  select __local_2__."id" as "id", __local_2__."shop_id" as "shopId", __local_2__."name" as "name", __local_2__."published" as "published", __local_2__."body_html" as "bodyHtml", __local_2__."product_type" as "productType", __local_2__."created_at" as "createdAt", __local_2__."updated_at" as "updatedAt"
          from "public"."products" as __local_2__

          where (TRUE) and (TRUE) and (TRUE)
          order by __local_2__."id" ASC
                OFFSET 11500610
          limit 10

Analysis result:

Limit  (cost=406133.85..406134.21 rows=10 width=553) (actual time=3453.729..3453.732 rows=7 loops=1)
  ->  Index Scan using products_pkey on products __local_2__  (cost=0.43..406135.12 rows=11500646 width=553) (actual time=0.033..2948.304 rows=11500617 loops=1)
Planning time: 0.076 ms
Execution time: 3453.750 ms

🚀 performance

Source

habitshine

Most helpful comment

This is fixed in v4.4.

create table products (
  id serial,
  shop_id int not null default floor(random() * 10000000),
  name text not null,
  published boolean not null default true,
  body_html text default '<html><head><title>Foo</title></head><body>Bar</body></html>',
  product_type text default 'fork',
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);
insert into products (name) select i::text from generate_series(1, 10000000) i;

{
  allProducts(first: 10, offset: 9000000) {
    edges {
      node {
        id
        shopId
        name
        bodyHtml
        published
        productType
        createdAt
        updatedAt
      }
    }
  }
}

4.0.1 took 33.19s:

npx [email protected] -c issue-845 -p 5111

4.4.0-beta.5 took 988ms:

npx postgraphile@next -c issue-845 -p 5112

benjie on 15 Mar 2019

👍2 😄1

All 7 comments

Removing some of the to_json is an optimisation that’s on my todo list; we can remove it from very simple types like int, string; but we cannot remove it from most other types otherwise nested queries will result in different data being returned when the field is nested vs when it is root level.

I strongly suggest you move to cursor-based pagination rather than limit/offset; using cursor pagination enables PostgreSQL to use its indexes more efficiently resulting in significantly improved performance. In this case I would expect the query to be on the order of tens of milliseconds with cursor pagination.

You could also try upgrading PostgreSQL.

benjie on 25 Aug 2018

😄1 👍1

Also have you benchmarked this fairly? I.e. have you ran the first query again after the second query? PostgreSQL has probably cached things from the first execution which make the second execution faster, so it’s unlikely the tests are fair unless you interleve them.

benjie on 25 Aug 2018

(I ask because the query plans are the same but the estimated costs differ.)

benjie on 25 Aug 2018

Excellent question -- I was going to ask the exact same question. The benchmark at first glance does not seem fair. I would suggest the following:

Stop Postgres ... Start it:

Run the same queries 4 times ..
1 == With to_json
2 == Without to_json
3 == With to_json
4 == Without to_json

THEN -- Stop / Start Postgres again and run
1 == WITHOUT to_json
2 == With to_json
3 == WITHOUT
4 == WIth

And lets compare everything... ??

sjmcdowall on 25 Aug 2018

@benjie
Thank you for your reply
Both queries are fair, as described by @sjmcdowall , so there is no caching.
The version of Postgresql currently in use is 10.5.
I look forward to your early completion of the optimization, and once again thanks for providing a powerful postgraphile.
I am Chinese, English is not very good, thanks for patient reading.

habitshine on 25 Aug 2018

So having looked into this more, it seems to be related to how the Postgres query planner works currently. It's possible that future versions of PostgreSQL will resolve these issues. In the mean time it's recommended that you solve this by using cursor based pagination which allows PostgreSQL to utilise its indexes efficiently - using limit/offset pagination is expensive in general.

I'm closing this for now because it will involve a large amount of work for quite an edge-case gain. If you'd like to sponsor this work, please get in touch.

benjie on 22 Sep 2018

This is fixed in v4.4.

create table products (
  id serial,
  shop_id int not null default floor(random() * 10000000),
  name text not null,
  published boolean not null default true,
  body_html text default '<html><head><title>Foo</title></head><body>Bar</body></html>',
  product_type text default 'fork',
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);
insert into products (name) select i::text from generate_series(1, 10000000) i;

{
  allProducts(first: 10, offset: 9000000) {
    edges {
      node {
        id
        shopId
        name
        bodyHtml
        published
        productType
        createdAt
        updatedAt
      }
    }
  }
}

4.0.1 took 33.19s:

npx [email protected] -c issue-845 -p 5111

4.4.0-beta.5 took 988ms:

npx postgraphile@next -c issue-845 -p 5112

benjie on 15 Mar 2019

👍2 😄1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Is @foreignKey available for 2 or more columns in 1 view?

k-ogawa-1988 · 3Comments

Question: Non-Unique Cursors

jwdotjs · 5Comments

Status of @nonNull smart comment on computed fields

srghma · 3Comments

Mutations for creating lists of objects/rows in one request?

james-ff · 4Comments

HTTP GET method?

outsidenote · 4Comments