Postgraphile: Performance problem: to_json

Created on 15 Mar 2017 · 30Comments · Source: graphile/postgraphile

I am experiencing performance problems with row level security and the to_json function. The issue may be related to #380. Here's my setup: I have a table articles with about 100k rows, and two roles (root and viewer). root is allowed to query all articles, viewer only the ones that are published. Therefore I enabled row level security and added this policy:

CREATE POLICY articles_read_viewer ON articles FOR SELECT TO viewer USING ((status = 'PUBLISHED'::status));

When querying as viewer this request is slower (100ms vs 8ms) than as root which is fine. But as soon as to_json is added, the response time goes from 100ms to 8 seconds (!).

Without to_json:

explain analyze select * from latest_articles() as a where true offset 0 limit 10;

Limit  (cost=166838.48..166838.60 rows=10 width=1674) (actual time=92.209..92.214 rows=10 loops=1)
  ->  Sort  (cost=166838.48..167072.71 rows=93694 width=1674) (actual time=92.208..92.213 rows=10 loops=1)
        Sort Key: articles.publication_date DESC
        Sort Method: top-N heapsort  Memory: 45kB
        ->  Seq Scan on articles  (cost=0.00..21734.44 rows=93694 width=1674) (actual time=0.500..71.917 rows=94549 loops=1)
              Filter: ((status = 'PUBLISHED'::status) AND (publication_date < (now())::timestamp without time zone))
              Rows Removed by Filter: 5223
Planning time: 3.484 ms
Execution time: 92.255 ms



md5-c1859306f3dda53ff37940a21ec7a34f



explain analyze select to_json(a) as value from latest_articles() as a where true offset 0 limit 10;



md5-af4f3a082cec4908f5519ec5b6e5f693



Limit  (cost=166838.48..166838.63 rows=10 width=32) (actual time=7709.402..7709.629 rows=10 loops=1)
  ->  Subquery Scan on __local_0__  (cost=166838.48..168243.89 rows=93694 width=32) (actual time=7709.401..7709.628 rows=10 loops=1)
        ->  Sort  (cost=166838.48..167072.71 rows=93694 width=1674) (actual time=7709.324..7709.326 rows=10 loops=1)
              Sort Key: articles.publication_date DESC
              Sort Method: external merge  Disk: 130496kB
              ->  Seq Scan on articles  (cost=0.00..21734.44 rows=93694 width=1674) (actual time=0.369..77.828 rows=94549 loops=1)
                    Filter: ((status = 'PUBLISHED'::status) AND (publication_date < (now())::timestamp without time zone))
                    Rows Removed by Filter: 5223
Planning time: 2.762 ms
Execution time: 7796.936 ms



md5-ec19d8da6e66ed54993b5f67ff9c3b00



CREATE FUNCTION latest_articles(before_date timestamp without time zone DEFAULT now()) RETURNS SETOF articles
    LANGUAGE sql STABLE
    AS $$
  select *
  from articles
  WHERE publication_date < before_date
  order by publication_date DESC
$$;

😌 Fixed in v4

Source

danielbuechele

All 30 comments

Create an index on publication_date and the problem should go away, I hope?

benjie on 15 Mar 2017

There is an index on it:

CREATE INDEX articles_publication_date_idx ON articles USING btree (publication_date);

Is something wrong with that?

danielbuechele on 15 Mar 2017

Ok, I could solve the problem by increasing SET work_mem = '500MB';.
However, it looks like the index isn't used? Right?

danielbuechele on 15 Mar 2017

Definitely looks like the index isn't used; have you done a VACUUM ANALYZE articles; recently?

benjie on 15 Mar 2017

I just run VACUUM ANALYZE articles; but it didn't change anything.

danielbuechele on 16 Mar 2017

😕1

For a closer comparison, maybe try

explain analyze select a from latest_articles() as a where true offset 0 limit 10;

N.B. select a versus select *

cbandy on 17 Mar 2017

As for that index not being used, it may have to do with a subtly mismatched type. I'm integrating full text search into another project and I kept running into this.

If you try explain analyze select id from articles order by publication_date; you should see Index Scan using articles_publication_date_idx....

I'd start from there!

mgallagher on 17 Mar 2017

@danielbuechele I'm not sure if the results I found looking into a similar issue are useful to you (sorry, it's a lot of query plans for subtly different queries and it not very readable due to being too wide): https://gist.github.com/benjie/02df5fdbd7ee8315712fff83007f3cd1

benjie on 17 Mar 2017

As out of nowhere now indexes are used in my queries. At least when no policy is applied. When a row-level-security policy is applied the index is not used anymore. Not sure if there is a way to make postgres use the index even when the policy is used.

But even in the case of row-level-security and no index being used, there are two scenarios. The current implementation needs a lot of memory and so the sorting needs to use disk. (Which could be fixed increasing work_mem). The implementation from #396 needs way lesser memory, so the work_mem doesn't need to be increased.

danielbuechele on 17 Mar 2017

Just wanted to add my 5 cents here: I have also observed same problem when was doing prototyping with PostgraphQL and had table with 100k+ generated rows. Very similar symptoms, once row-level security is enabled and condition is used, seq scan start ruling the world.

I even experimented by creating a sql function and using in row level check, same problem - seq scan. BUT if I run effective query (by effective query i mean my original query + the check), all indexes were used correctly.

Which brought me to the question -- may be this is PostgreSQL planner/row-level-security issue ?

kenota on 17 Mar 2017

Certainly sounds like it. I think the performance issues in the gist I posted above are also PostgreSQL query planner issues (it should filter the rows before formatting the data, like it does for the indexed version). I'm running 9.6.2 and still have these issues, but 10.0.0 is just around the corner - it's possible some might be addressed? We should probably post something to the postgres mailing list.

benjie on 17 Mar 2017

👍1

I believe these kind of performance issues are solved in V4, but it's possible we need a few more CTEs to fully solve them. (I'm loathe to add more than necessary because they complicate the generated queries.)

benjie on 9 Jul 2017

There are RLS performance issues in Postgres 9.5 and 9.6 which are solved in Postgres 10 - I'm quite excited for its release!

One of the PostGraphQL users has also stated that a very large query they run now takes 1/4 of the time to execute and 1/10 of the RAM with v4; so our performance improvements are also promising. Would love to get some more stats on this from others.

benjie on 22 Sep 2017

🎉1

Not sure if this is relevant, but since it seems to be close to the subject at hand and this is still not closed yet, I thought I might post a quick issue that I would really like to get resolved.

I'm using Postgres 9.6.6, right now on postgraphile 4.0.0-beta.5, a very simple to_json after executing an update statement takes 45s and all the mutation asked is one field...

The table returning is a join table which means it contains two ids from two different tables and has its own id as well. But I'm really not sure why this happens seems to be slowing down only for this table.

CREATE OR REPLACE FUNCTION remove_staff_from_slot(
  id uuid
) RETURNS slot AS $$

  UPDATE slot SET staff_id = null WHERE slot.id = id;
  SELECT * FROM slot WHERE slot.id = id;

$$ language sql volatile;

Slot

CREATE TABLE slot
(
    id uuid NOT NULL PRIMARY KEY DEFAULT uuid_generate_v4(),
    project_id id_type NOT NULL REFERENCES project (id) ON DELETE CASCADE,
    role_id id_type NOT NULL REFERENCES role (id),
    staff_id id_type REFERENCES staff (id),
    description text
);

Log

  postgraphile:request GraphQL query request has begun. +4m
  postgraphile:request GraphQL query is parsed. +1ms
  postgraphile:request GraphQL query is validated. +2ms
  postgraphile:postgres begin +6ms
  postgraphile:postgres SAVEPOINT graphql_mutation +1ms
  postgraphile:postgres
  postgraphile:postgres       with __local_0__ as (
  postgraphile:postgres         select __local_1__.* from "remove_staff_from_slot"($1) __local_1__
  postgraphile:postgres       )
  postgraphile:postgres       select (__local_0__)::text from __local_0__ +0ms
  postgraphile:postgres
  postgraphile:postgres       with __local_0__ as (
  postgraphile:postgres         select (str::"slot").*
  postgraphile:postgres         from unnest(($1)::text[]) str
  postgraphile:postgres       )
  postgraphile:postgres
  postgraphile:postgres       select to_json((__local_0__."id")) as "id"
  postgraphile:postgres       from __local_0__ as __local_0__
  postgraphile:postgres
  postgraphile:postgres       where (not (__local_0__ is null)) and (TRUE) and (TRUE)
  postgraphile:postgres
  postgraphile:postgres
  postgraphile:postgres
  postgraphile:postgres      +45s
  postgraphile:postgres RELEASE SAVEPOINT graphql_mutation +3ms
  postgraphile:postgres commit +0ms
  postgraphile:request GraphQL query has been executed. +1ms
  postgraphile:request GraphQL query request finished. +1ms
0 error(s) in 44683.24ms :: mutation RemoveTest($input: RemoveStaffFromSlotInput!) { removeStaffFromSlot(input: $input) { slot { id } } }

pencilcheck on 13 Apr 2018

45s?! That extremely unexpected. If you run PostGraphile as a superuser database user does it run faster?

Btw a more efficient implementation of your function would be:

CREATE OR REPLACE FUNCTION remove_staff_from_slot(
  id uuid
) RETURNS slot AS $$

  UPDATE slot SET staff_id = null WHERE slot.id = id returning *;

$$ language sql volatile;

benjie on 13 Apr 2018

I had that statement before, it performs about the same and I'm just trying a different form to see if it makes any difference.

The DB I run locally has only one user, so by default I believe it is running as superuser.

I ran multiple times, it oscillates between 25s, 45s, etc. Dunno why.

pencilcheck on 13 Apr 2018

What happens if you just run the remove_staff_from_slot function manually via psql?. Note to be fair that you must use psql as the same user as postgraphile uses; e.g. if you use

postgraphile -c "postgres://user:pass@localhost/mydb"

then you should run

psql "postgres://user:pass@localhost/mydb"

benjie on 13 Apr 2018

Yes, I did that the first thing when I noticed it and fortunately it runs instantly. I did even copied the UPDATE statement and run it as well and it runs instantly. The part that slows down as you can see in the log is where it tried to run to_json for the return value.

Btw, the URL I run locally is postgres://localhost/mydb I don't run with a specific user.

Here is my command:

DEBUG=postgraphile*,-postgraphile:graphql,-postgraphile:postgres:explain postgraphile --secret the_secret_word --cors --token jwt_token --dynamic-json --connection ${DATABASE_URL-postgres://localhost:5432/mydb} --watch

pencilcheck on 13 Apr 2018

So what's the smallest statement that reproduces the issue when typed into psql?

benjie on 13 Apr 2018

(And also what's the output of that statement?)

benjie on 13 Apr 2018

I don't know. As you can see the slowdown is the query generated by postgraphile that is doing some to_json. I'm not sure how to reproduce that query in psql.

pencilcheck on 13 Apr 2018

You should be able to run it verbatim except replacing $1 with the function argument.

benjie on 13 Apr 2018

I see a +0ms right after the statement that calls remove_staff_from_slot.

Is the +45s a measurement of the next statement, ... unnest(($1)::text[])? If so, what should one use for $1 there, @benjie?

cbandy on 13 Apr 2018

Ah; good catch I did miss that.

If you run:

with __local_0__ as (
  select __local_1__.* from "remove_staff_from_slot"(PUT_YOUR_ARGUMENT_TO_THIS_FUNCTION_HERE) __local_1__
       )
       select (__local_0__)::text from __local_0__;

Then that will return a string. The string that this returns should then be fed back into postgres in the next $1. The reason for this is due to not being able to create temporary tables but requiring that the mutation takes place first and then the select afterwards which CTEs do not guarantee due to certain optimisations.

benjie on 13 Apr 2018

I think I somehow found the culprit. I forgot I had two triggers that covers this table. One of them creates an audit based on the changes. The other runs to generate the tsv vector in the table. And of course both adds up to the UPDATE statement so turns out to be very slow.

I somehow always thought triggers are async, but in postgresql it is all sync.

Regardless a very minor thing, that I would like to mention is that in the log, the statement that takes 0ms is actually the one taking the longest time (the CTE for remove_staff_from_slot), somehow the time didn't reflect properly with the way the statement was written.

Therefore I think my issue should be solved, really appreciate for all the help.

pencilcheck on 14 Apr 2018

The time comes from the debug module; I think we log before running the SQL so that makes sense.

benjie on 15 Apr 2018

It is really weird now I switched away and use the generated function such as "updateSlotById" to imitate what the function was doing and it was blazing fast!

Just wondering, but besides the to_json issue mentioned in this thread, is there any optimization for the stored procedure where postgraphile is doing that manual ones don't? I really can't see how as a developer and user of this framework can do to make operations like this faster though.

If this is not relevant to the thread I can move this to a new thread.

pencilcheck on 15 Apr 2018

PostGraphile just runs the SQL you see and then pretty much passes the results to the end user; the performance problems tend to come from within the database. If you can make a reproducible test case where the SQL PostGraphile generates is really slow but the equivalent SQL that you would write is fast I'd love to see that as a key focus for me is performance. But in my performance testing we achieve pretty good results!

benjie on 15 Apr 2018

Ah, that's not what I was asking, I am just curious why it seems the same but my own version of it is so slow but the ones generated from postgraphile is so fast, even though it seems like it is doing the same thing in the end.

pencilcheck on 18 Apr 2018

[semi-automated message] Hi, there has been no activity in this issue for over 8 weeks so I'm closing it to keep the issues/pull request manageable. If this is still an issue, please re-open with a comment explaining why.