I am experiencing performance problems with row level security and the to_json function. The issue may be related to #380. Here's my setup: I have a table articles with about 100k rows, and two roles (root and viewer). root is allowed to query all articles, viewer only the ones that are published. Therefore I enabled row level security and added this policy:
CREATE POLICY articles_read_viewer ON articles FOR SELECT TO viewer USING ((status = 'PUBLISHED'::status));
When querying as viewer this request is slower (100ms vs 8ms) than as root which is fine. But as soon as to_json is added, the response time goes from 100ms to 8 seconds (!).
Without to_json:
explain analyze select * from latest_articles() as a where true offset 0 limit 10;
Limit (cost=166838.48..166838.60 rows=10 width=1674) (actual time=92.209..92.214 rows=10 loops=1)
-> Sort (cost=166838.48..167072.71 rows=93694 width=1674) (actual time=92.208..92.213 rows=10 loops=1)
Sort Key: articles.publication_date DESC
Sort Method: top-N heapsort Memory: 45kB
-> Seq Scan on articles (cost=0.00..21734.44 rows=93694 width=1674) (actual time=0.500..71.917 rows=94549 loops=1)
Filter: ((status = 'PUBLISHED'::status) AND (publication_date < (now())::timestamp without time zone))
Rows Removed by Filter: 5223
Planning time: 3.484 ms
Execution time: 92.255 ms
md5-c1859306f3dda53ff37940a21ec7a34f
explain analyze select to_json(a) as value from latest_articles() as a where true offset 0 limit 10;
md5-af4f3a082cec4908f5519ec5b6e5f693
Limit (cost=166838.48..166838.63 rows=10 width=32) (actual time=7709.402..7709.629 rows=10 loops=1)
-> Subquery Scan on __local_0__ (cost=166838.48..168243.89 rows=93694 width=32) (actual time=7709.401..7709.628 rows=10 loops=1)
-> Sort (cost=166838.48..167072.71 rows=93694 width=1674) (actual time=7709.324..7709.326 rows=10 loops=1)
Sort Key: articles.publication_date DESC
Sort Method: external merge Disk: 130496kB
-> Seq Scan on articles (cost=0.00..21734.44 rows=93694 width=1674) (actual time=0.369..77.828 rows=94549 loops=1)
Filter: ((status = 'PUBLISHED'::status) AND (publication_date < (now())::timestamp without time zone))
Rows Removed by Filter: 5223
Planning time: 2.762 ms
Execution time: 7796.936 ms
md5-ec19d8da6e66ed54993b5f67ff9c3b00
CREATE FUNCTION latest_articles(before_date timestamp without time zone DEFAULT now()) RETURNS SETOF articles
LANGUAGE sql STABLE
AS $$
select *
from articles
WHERE publication_date < before_date
order by publication_date DESC
$$;
Create an index on publication_date and the problem should go away, I hope?
There is an index on it:
CREATE INDEX articles_publication_date_idx ON articles USING btree (publication_date);
Is something wrong with that?
Ok, I could solve the problem by increasing SET work_mem = '500MB';.
However, it looks like the index isn't used? Right?
Definitely looks like the index isn't used; have you done a VACUUM ANALYZE articles; recently?
I just run VACUUM ANALYZE articles; but it didn't change anything.
For a closer comparison, maybe try
explain analyze select a from latest_articles() as a where true offset 0 limit 10;
N.B. select a versus select *
As for that index not being used, it may have to do with a subtly mismatched type. I'm integrating full text search into another project and I kept running into this.
If you try explain analyze select id from articles order by publication_date; you should see Index Scan using articles_publication_date_idx....
I'd start from there!
@danielbuechele I'm not sure if the results I found looking into a similar issue are useful to you (sorry, it's a lot of query plans for subtly different queries and it not very readable due to being too wide): https://gist.github.com/benjie/02df5fdbd7ee8315712fff83007f3cd1
As out of nowhere now indexes are used in my queries. At least when no policy is applied. When a row-level-security policy is applied the index is not used anymore. Not sure if there is a way to make postgres use the index even when the policy is used.
But even in the case of row-level-security and no index being used, there are two scenarios. The current implementation needs a lot of memory and so the sorting needs to use disk. (Which could be fixed increasing work_mem). The implementation from #396 needs way lesser memory, so the work_mem doesn't need to be increased.
Just wanted to add my 5 cents here: I have also observed same problem when was doing prototyping with PostgraphQL and had table with 100k+ generated rows. Very similar symptoms, once row-level security is enabled and condition is used, seq scan start ruling the world.
I even experimented by creating a sql function and using in row level check, same problem - seq scan. BUT if I run effective query (by effective query i mean my original query + the check), all indexes were used correctly.
Which brought me to the question -- may be this is PostgreSQL planner/row-level-security issue ?
Certainly sounds like it. I think the performance issues in the gist I posted above are also PostgreSQL query planner issues (it should filter the rows before formatting the data, like it does for the indexed version). I'm running 9.6.2 and still have these issues, but 10.0.0 is just around the corner - it's possible some might be addressed? We should probably post something to the postgres mailing list.
I believe these kind of performance issues are solved in V4, but it's possible we need a few more CTEs to fully solve them. (I'm loathe to add more than necessary because they complicate the generated queries.)
There are RLS performance issues in Postgres 9.5 and 9.6 which are solved in Postgres 10 - I'm quite excited for its release!
One of the PostGraphQL users has also stated that a very large query they run now takes 1/4 of the time to execute and 1/10 of the RAM with v4; so our performance improvements are also promising. Would love to get some more stats on this from others.
Not sure if this is relevant, but since it seems to be close to the subject at hand and this is still not closed yet, I thought I might post a quick issue that I would really like to get resolved.
I'm using Postgres 9.6.6, right now on postgraphile 4.0.0-beta.5, a very simple to_json after executing an update statement takes 45s and all the mutation asked is one field...
The table returning is a join table which means it contains two ids from two different tables and has its own id as well. But I'm really not sure why this happens seems to be slowing down only for this table.
CREATE OR REPLACE FUNCTION remove_staff_from_slot(
id uuid
) RETURNS slot AS $$
UPDATE slot SET staff_id = null WHERE slot.id = id;
SELECT * FROM slot WHERE slot.id = id;
$$ language sql volatile;
Slot
CREATE TABLE slot
(
id uuid NOT NULL PRIMARY KEY DEFAULT uuid_generate_v4(),
project_id id_type NOT NULL REFERENCES project (id) ON DELETE CASCADE,
role_id id_type NOT NULL REFERENCES role (id),
staff_id id_type REFERENCES staff (id),
description text
);
Log
postgraphile:request GraphQL query request has begun. +4m
postgraphile:request GraphQL query is parsed. +1ms
postgraphile:request GraphQL query is validated. +2ms
postgraphile:postgres begin +6ms
postgraphile:postgres SAVEPOINT graphql_mutation +1ms
postgraphile:postgres
postgraphile:postgres with __local_0__ as (
postgraphile:postgres select __local_1__.* from "remove_staff_from_slot"($1) __local_1__
postgraphile:postgres )
postgraphile:postgres select (__local_0__)::text from __local_0__ +0ms
postgraphile:postgres
postgraphile:postgres with __local_0__ as (
postgraphile:postgres select (str::"slot").*
postgraphile:postgres from unnest(($1)::text[]) str
postgraphile:postgres )
postgraphile:postgres
postgraphile:postgres select to_json((__local_0__."id")) as "id"
postgraphile:postgres from __local_0__ as __local_0__
postgraphile:postgres
postgraphile:postgres where (not (__local_0__ is null)) and (TRUE) and (TRUE)
postgraphile:postgres
postgraphile:postgres
postgraphile:postgres
postgraphile:postgres +45s
postgraphile:postgres RELEASE SAVEPOINT graphql_mutation +3ms
postgraphile:postgres commit +0ms
postgraphile:request GraphQL query has been executed. +1ms
postgraphile:request GraphQL query request finished. +1ms
0 error(s) in 44683.24ms :: mutation RemoveTest($input: RemoveStaffFromSlotInput!) { removeStaffFromSlot(input: $input) { slot { id } } }
45s?! That extremely unexpected. If you run PostGraphile as a superuser database user does it run faster?
Btw a more efficient implementation of your function would be:
CREATE OR REPLACE FUNCTION remove_staff_from_slot(
id uuid
) RETURNS slot AS $$
UPDATE slot SET staff_id = null WHERE slot.id = id returning *;
$$ language sql volatile;
I had that statement before, it performs about the same and I'm just trying a different form to see if it makes any difference.
The DB I run locally has only one user, so by default I believe it is running as superuser.
I ran multiple times, it oscillates between 25s, 45s, etc. Dunno why.
What happens if you just run the remove_staff_from_slot function manually via psql?. Note to be fair that you must use psql as the same user as postgraphile uses; e.g. if you use
postgraphile -c "postgres://user:pass@localhost/mydb"
then you should run
psql "postgres://user:pass@localhost/mydb"
Yes, I did that the first thing when I noticed it and fortunately it runs instantly. I did even copied the UPDATE statement and run it as well and it runs instantly. The part that slows down as you can see in the log is where it tried to run to_json for the return value.
Btw, the URL I run locally is postgres://localhost/mydb I don't run with a specific user.
Here is my command:
DEBUG=postgraphile*,-postgraphile:graphql,-postgraphile:postgres:explain postgraphile --secret the_secret_word --cors --token jwt_token --dynamic-json --connection ${DATABASE_URL-postgres://localhost:5432/mydb} --watch
So what's the smallest statement that reproduces the issue when typed into psql?
(And also what's the output of that statement?)
I don't know. As you can see the slowdown is the query generated by postgraphile that is doing some to_json. I'm not sure how to reproduce that query in psql.
You should be able to run it verbatim except replacing $1 with the function argument.
I see a +0ms right after the statement that calls remove_staff_from_slot.
Is the +45s a measurement of the next statement, ... unnest(($1)::text[])? If so, what should one use for $1 there, @benjie?
Ah; good catch I did miss that.
If you run:
with __local_0__ as (
select __local_1__.* from "remove_staff_from_slot"(PUT_YOUR_ARGUMENT_TO_THIS_FUNCTION_HERE) __local_1__
)
select (__local_0__)::text from __local_0__;
Then that will return a string. The string that this returns should then be fed back into postgres in the next $1. The reason for this is due to not being able to create temporary tables but requiring that the mutation takes place first and then the select afterwards which CTEs do not guarantee due to certain optimisations.
I think I somehow found the culprit. I forgot I had two triggers that covers this table. One of them creates an audit based on the changes. The other runs to generate the tsv vector in the table. And of course both adds up to the UPDATE statement so turns out to be very slow.
I somehow always thought triggers are async, but in postgresql it is all sync.
Regardless a very minor thing, that I would like to mention is that in the log, the statement that takes 0ms is actually the one taking the longest time (the CTE for remove_staff_from_slot), somehow the time didn't reflect properly with the way the statement was written.
Therefore I think my issue should be solved, really appreciate for all the help.
The time comes from the debug module; I think we log before running the SQL so that makes sense.
It is really weird now I switched away and use the generated function such as "updateSlotById" to imitate what the function was doing and it was blazing fast!
Just wondering, but besides the to_json issue mentioned in this thread, is there any optimization for the stored procedure where postgraphile is doing that manual ones don't? I really can't see how as a developer and user of this framework can do to make operations like this faster though.
If this is not relevant to the thread I can move this to a new thread.
PostGraphile just runs the SQL you see and then pretty much passes the results to the end user; the performance problems tend to come from within the database. If you can make a reproducible test case where the SQL PostGraphile generates is really slow but the equivalent SQL that you would write is fast I'd love to see that as a key focus for me is performance. But in my performance testing we achieve pretty good results!
Ah, that's not what I was asking, I am just curious why it seems the same but my own version of it is so slow but the ones generated from postgraphile is so fast, even though it seems like it is doing the same thing in the end.
[semi-automated message] Hi, there has been no activity in this issue for over 8 weeks so I'm closing it to keep the issues/pull request manageable. If this is still an issue, please re-open with a comment explaining why.
