Graphql-engine: Performance problem with queries as a non-admin user

Created on 10 Jan 2020 · 15Comments · Source: hasura/graphql-engine

Hi all,

I'm currently having a query take longer than previously (I have recently updated, but am not 100% confident this is a regression).

Essentially, when performing a query as an admin user, it takes under a split second to complete, however when changing to be a regular user the query takes over 24 seconds. The regular user has a permission attached (relatively complex) which I imagine to be the discriminating difference between the regular user and the admin.

An interesting observation is that the Postgres instance is pinned at 100% for the entirety of the 24 seconds. Both are located in docker containers (essentially the digital-ocean 1-click docker-compose schema).

I am not sure why checking for permissions is taking so long, the resulting response is only around 240kB. My hunch is that Hasura is constantly requesting permission constraints (or data pertaining to those constraints) for each object returned, rather than caching them, but don't know too much about the system to explore myself.

Let me know if you need any more info!

server

Source

pnappa

Most helpful comment

@pnappa Would be nice if you could provide _some_ information on this, so that others could also be aware of possible bottlenecks.

webdeb on 11 Jan 2020

👍3

All 15 comments

Could you include which version of graphql-engine you’re using?

lexi-lambda on 10 Jan 2020

Stripped out the admin secret, and other environment variables, but here's my docker-compose:

version: '3.6'
services:
  postgres:
    image: postgres
    restart: always
    volumes:
    - db_data:/var/lib/postgresql/data
  graphql-engine:
    image: hasura/graphql-engine:v1.0.0
    depends_on:
    - "postgres"
    restart: always
    environment:
      # database url to connect
      HASURA_GRAPHQL_DATABASE_URL: postgres://postgres:@postgres:5432/postgres
      # enable the console served by server
      HASURA_GRAPHQL_ENABLE_CONSOLE: "true" # set "false" to disable console
    command:
    - graphql-engine 
    - serve
  caddy:
    image: abiosoft/caddy:0.11.0
    depends_on:
    - "graphql-engine"
    restart: always
    ports:
    - "80:80"
    - "443:443"
    volumes:
    - ./Caddyfile:/etc/Caddyfile
    - caddy_certs:/root/.caddy
volumes:
  db_data:
  caddy_certs:

It's version 1.0.0.

Edit: also good to mention, that the previous version I had been running with it reported to be running quickly was v1.0.0-beta.9

pnappa on 10 Jan 2020

Thanks! Could you also share the query that’s taking too long? And, ideally, the relevant bits of the database schema? It would help a lot to reproduce the issue locally.

lexi-lambda on 10 Jan 2020

Sure thing - do you mind if I send it privately?

pnappa on 10 Jan 2020

Sure thing: send me an email at [email protected], and I’ll take a look and/or pass it along.

lexi-lambda on 10 Jan 2020

Okay, done. I included the relevant permission rule, the query, and the generated SQL query. Database schema wasn't included, there's no nice way to export anything, but it should be implicit within the query string.

pnappa on 10 Jan 2020

@pnappa Would be nice if you could provide _some_ information on this, so that others could also be aware of possible bottlenecks.

webdeb on 11 Jan 2020

👍3

@webdeb What's a good minimum information set to provide? Unfortunately I can't provide the exact query/dataset, as it's my employer's information (I wish I could).

A good litmus to see whether one's issue falls under the purview of this issue is if a query takes much, much longer when operating under permissions as compared to queries performed when x-hasura-admin-secret is supplied.

pnappa on 13 Jan 2020

@pnappa sure, I am not expecting, that you provide your clients/employers data.

but at least

how big is that table(s)
are you using joins
what kind of permissions rules and combinations _or/_eq/_and etc
how many header/jwt arguments are used

that kind of information. I mean you are facing a 24x performance impact. A minimal reproducible example would be great.

webdeb on 13 Jan 2020

If I’m able to figure out what the issue is, I’ll certainly share it here. However, though @pnappa graciously sent me the query, it’s complicated enough that it isn’t straightforward for me to replicate without some details about the actual schema.

@pnappa, if you could please send me a dump of the SQL schema (possibly restricted to the relevant tables), that would be very helpful. You can generate one using pg_dump with the -s/--schema-only option. Providing some test data as well would be even more helpful, but that isn’t absolutely necessary.

lexi-lambda on 13 Jan 2020

👍1

I sent you another email containing the schema @lexi-lambda.

@webdeb
Here's a brief sketch:

we are using joins (i.e. nested queries)
the tables involved have around 3-80 rows, but the largest involved has about 800.
the relationships are mostly tree like (i.e. I think it's non cyclic)
only 2 jwt data values, and only one is used in the permission query (X-Hasura-User-Id).
the permission is pretty simple, it's a conjunction with two arguments.

As said, it's not an easy thing to make a minimal reproducible example - I haven't experienced any slow-down like this in our application, and we have around 500(?) queries in total.

pnappa on 14 Jan 2020

Alright, with the information @pnappa provided me, I think I’ve tracked down the issue. I don’t think it’s a problem with Hasura per se, but rather an issue with query planning. For some reason, on your database, Postgres thought the query you’re trying to run had an enormous cost! Here’s the first line of the EXPLAIN output:

Aggregate  (cost=816421.63..816421.64 rows=1 width=32)

I don’t know why this number was so large. On my local database, the number was far smaller! So I tried running ANALYZE… and that fixed the query plan. Now the query runs fast on your database again.

Why did this inflated cost number tank your query performance? Well, the issue is that your Postgres instance was compiled with JIT compilation support. JIT compiling a query has the potential to make it run faster, but the actual compilation process is expensive, so Postgres doesn’t use it unless it expects the extra time investment to pay off.

The default threshold for JITing is 100,000 “cost units,” and the default threshold for JITing with optimizations is 500,000 cost units. The cost number Postgres was coming up with blows right past both those thresholds, so Postgres was firing up the whole JIT pipeline for this query.

That was, unfortunately, a very bad decision. I tweaked some settings on my local database to force it to JIT this query, and here’s what the EXPLAIN ANALYZE output told me:

 Planning Time: 28.567 ms
 JIT:
   Functions: 544
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 72.277 ms, Inlining 88.320 ms, Optimization 1939.653 ms, Emission 1275.987 ms, Total 3376.237 ms
 Execution Time: 3391.661 ms

This is absolutely terrible, as here’s the EXPLAIN ANALYZE output without JITing:

 Planning Time: 10.840 ms
 Execution Time: 1.652 ms

The JIT ballooned this query’s performance from 12ms to 6.8s. That’s a 567x increase. It was likely even worse on your instance, since the statistics were so out of whack the rest of the query plan was likely useless, too.

I don’t know how the statistics on your database were so far off, since usually the autovacuum daemon takes care of updating that automatically. I’d take a look to make sure that didn’t somehow get disabled. Also, more generally: it isn’t clear to me that the JIT is actually helpful in a lot of situations, even when the query plan is accurate. If you run EXPLAIN on a slow query, and you find it’s using the JIT, you might want to try disabling the JIT and seeing if that fixes your problem.

Anyway, I think this issue can be closed, since the performance seems fine now, but feel free to reopen it if you’re still having problems.

lexi-lambda on 15 Jan 2020

❤2

This seems to be it! However, it didn't seem to be fixed when I re-ran that query from the GraphiQL explorer. I had to run EXPLAIN ANALYZE on the offending query (although, to be able to execute it in the Run SQL page in Hasura, I had to make it a stored procedure, as that's what the logs use). Doing this makes it run pretty quick again.

So, for anyone else reading this, turn on the logging mode for your Hasura instance to print executed queries (which emits the generated SQL). Then, extract it from the query (i.e. under the query key), convert it into a stored procedure in the Run SQL tab:

PREPARE ec2(<arg types>) AS
<generated_sql extracted>;

EXPLAIN ANALYZE EXECUTE ec2(<args>);

Where <args> is a parameter list of the prepared_arguments key. <arg types> corresponds to the types of the list of parameters. For me, this was what these two roughly looked like:

arg_types -> json, uuid, boolean, boolean, boolean, boolean
args ->  '{"etc": "etc"}', '30725332-af7f-49c9-94ca-8fcf28102fce', false, false, false, false

It might be a good idea to drop the created prepared statement once you're done using it.

I suppose I should look into turning off the JIT. I can't imagine there's many queries where it's useful for me. An interesting thought, is that Hasura can actually precompile all the queries in the condition of having query whitelisting enforced, which presumably would result in a modest speed improvement (compiling the queries ahead of time, of course).

pnappa on 15 Jan 2020

🎉1

(took me a while to get around to turning the JIT off, but I eventually ran into a query that took 144s, and 143 of those were the JIT compiling and optimising..)

To turn it off in Docker, you need to add a command to configure postgres with the JIT off. So your docker-compose block might look something like this:

  postgres:
    image: postgres
    # see https://github.com/hasura/graphql-engine/issues/3672
    command: postgres -c jit=off
    restart: always
    ports:
    - "5432:5432"
    volumes:
    - db_data:/var/lib/postgresql/data

pnappa on 30 Jan 2020

@pnappa in your comment above what is the part? Is that the part that the hasura console generates when you click the Analyze button? or something else? Thanks for writing all this up!

Also based on @lexi-lambda 's comments I ran VACUUM FULL; on my postgres db and after it completed my queries were MUCH faster!