Graphql-engine: Performance issue with large-ish nested query

Created on 11 Oct 2018  路  15Comments  路  Source: hasura/graphql-engine

Hasura version: v1.0.0-alpha23

I have a 40k row table I'm joining to a 5.5m row table. Querying with Hasura is exceptionally slow, but Postgres isn't.

The GraphQL query looks like this:

{
   blocksummary_total (limit: 10){
    total_payments
    amount_due
    blockYearly {
      year
      violation_code
      violation_description
    }
  }
}

I've limited it to 10 results because much more than that and Hasura just hangs for so long I'm forced to give up. Running with a limit of 10 takes between 10-20seconds. Here's an example log line for this query:

{"timestamp":"2018-10-11T15:07:54.667+0000","level":"info","type":"http-log","detail":{"status":200,"query_hash":"476106a5c31616b2b50f3e59cfb2d6874135ce56","http_version":"HTTP/1.1","query_execution_time":13.5884354,"request_id":null,"url":"/v1alpha1/graphql","hasura_metadata":null,"ip":"172.17.0.1","response_size":592159,"method":"POST","hasura_role":null,"detail":null}}

What I believe is the equivalent SQL-native query (forgive my ignorance for not yet having figured out how to access the query Hasura is constructing):

explain analyze select
    t.geocoded_address,
    t.total_payments,
    t.amount_due,
    y.year,
    y.violation_code,
    y.violation_description
from blocksummary_total t
join blocksummary_yearly y
    on t.geocoded_address = y.geocoded_address
limit 10;

The query plan and actual execution for this query:

  ->  Nested Loop  (cost=0.41..2553486.26 rows=5510465 width=105) (actual time=0.353..0.849 rows=10 loops=1)
        ->  Seq Scan on blocksummary_yearly y  (cost=0.00..166756.65 rows=5510465 width=90) (actual time=0.011..0.012 rows=10 loops=1)
        ->  Index Scan using blocksummary_total_pk on blocksummary_total t  (cost=0.41..0.43 rows=1 width=57) (actual time=0.082..0.082 rows=1 loops=10)
              Index Cond: ((geocoded_address)::text = (y.geocoded_address)::text)
Planning time: 3.180 ms
Execution time: 1.417 ms

The same query on the full dataset in Postgres takes about 5 seconds. I haven't been able to get it to work at all in Hasura.

Am I doing something wrong? Did I find a bug?

Use case: I'm using Hasura with GatsbyJS to generate tens of thousands of pages for a news application about parking tickets in Chicago.

server question

All 15 comments

@eads Hang tight. @0x777 will look into this and get back to you asap.

We're also adding a "View SQL" button on the console to make getting the SQL easier :)

Thanks! No rush on this end; for local development, I can just query a small subset of the data. But the performance disparity is certainly puzzling.

A view SQL button will be _amazing_.

@eads Can you check if this problem exists with v1.0.0-alpha22? The query generation logic changed in alpha23. The debug mode which prints out the generated sql statements is not exposed currently. We'll add the view sql feature and an option to print out sql in the logs ASAP.

Actually quite a bit worse. I ran it a couple times and it took between 65s and 85s.

My most recent:

{"timestamp":"2018-10-11T20:31:52.754+0000","level":"info","type":"http-log","detail":{"status":200,"query_hash":"8c8f84e47690c8495c3b235c583a9ebe4e012a9c","http_version":"HTTP/1.1","query_execution_time":79.853898269,"request_id":null,"url":"/v1alpha1/graphql","hasura_metadata":null,"ip":"172.17.0.1","response_size":104181,"method":"POST","hasura_role":null,"detail":null}}

Fwiw with both alpha22 and alpha23 queries without a join -- even ones with hundreds of thousands of rows returned -- have been lightning fast for me. I just ran one that selects 500k rows and it took 180ms.

Out of curiosity, how might I start digging around the Hasura code to debug this? No worries if you don't have time to respond but I'm definitely interested in how the engine works and if this is something I could help troubleshoot a little more effectively.

This should roughly be the sql query generated for your GraphQL query. I modified the table, column names for a similarly shaped query so there could be errors in the query. Can you check the explain output of this query? It should be the same as the explain output of your join query above with few aggregate nodes.

SELECT
  coalesce(json_agg("root"), '[]') AS "root"
FROM
  (
    SELECT
      json_build_object(
        'total_payments',
        "_0_root.base"."total_payments",
        'amount_due',
        "_0_root.base"."amount_due",
        'blockYearly',
        "_3_root.ar.root.blockYearly"."blockYearly"
      ) AS "root"
    FROM
      (
        SELECT
          *
        FROM
          "public"."blocksummary_total"
        WHERE
          ('true')
      ) AS "_0_root.base"
      LEFT OUTER JOIN LATERAL (
        SELECT
          coalesce(json_agg("blockYearly"), '[]') AS "blockYearly"
        FROM
          (
            SELECT
              json_build_object(
                'year',
                "_1_root.ar.root.blockYearly.base"."year",
                'violation_code',
                "_1_root.ar.root.blockYearly.base"."violation_code"
              ) AS "blockYearly"
            FROM
              (
                SELECT
                  *
                FROM
                  "public"."blocksummary_yearly"
                WHERE
                  (("_0_root.base"."geocoded_address") = ("geocoded_address"))
              ) AS "_1_root.ar.root.blockYearly.base"
          ) AS "_2_root.ar.root.blockYearly"
      ) AS "_3_root.ar.root.blockYearly" ON ('true')
  ) AS "_4_root_alias";

You can start looking here (https://github.com/hasura/graphql-engine/blob/master/server/src-lib/Hasura/RQL/DML/Select.hs#L612). This code will be a lot easier to follow when you are looking at the generated sql. I'll push a dev build which outputs the generated sql query today.

Here's the explain from your query:

Aggregate  (cost=7230749112.04..7230749112.05 rows=1 width=32)
  ->  Nested Loop Left Join  (cost=180533.98..7230748911.78 rows=40052 width=48)
        ->  Seq Scan on blocksummary_total  (cost=0.00..943.52 rows=40052 width=57)
        ->  Aggregate  (cost=180533.98..180533.99 rows=1 width=32)
              ->  Seq Scan on blocksummary_yearly  (cost=0.00..180532.81 rows=232 width=16)
                    Filter: ((blocksummary_total.geocoded_address)::text = (geocoded_address)::text)

(Running analyze on this one hung just as badly as running through Graphql.

Here's the explain from my query, dropping the limit:

Hash Join  (cost=1875.17..344969.33 rows=5510465 width=105)
  Hash Cond: ((y.geocoded_address)::text = (t.geocoded_address)::text)
  ->  Seq Scan on blocksummary_yearly y  (cost=0.00..166756.65 rows=5510465 width=90)
  ->  Hash  (cost=943.52..943.52 rows=40052 width=57)
        ->  Seq Scan on blocksummary_total t  (cost=0.00..943.52 rows=40052 width=57)

Is it possible for you to add an index on geocoded_address column of blocksummary_yearly?

create index on blocksummary_yearly (geocoded_address);

Ah damn, I thought I had but I reloaded the db and forgot to recreate the index. Will test.

With the index (with EXPLAIN ANALYZE):

Aggregate  (cost=36477367.70..36477367.71 rows=1 width=32) (actual time=23569.334..23569.335 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=910.69..36477167.43 rows=40052 width=48) (actual time=4.530..22878.824 rows=40052 loops=1)
        ->  Seq Scan on blocksummary_total  (cost=0.00..943.52 rows=40052 width=57) (actual time=0.310..13.602 rows=40052 loops=1)
        ->  Aggregate  (cost=910.69..910.70 rows=1 width=32) (actual time=0.569..0.569 rows=1 loops=40052)
              ->  Bitmap Heap Scan on blocksummary_yearly  (cost=10.36..909.53 rows=232 width=16) (actual time=0.062..0.254 rows=138 loops=40052)
                    Recheck Cond: ((blocksummary_total.geocoded_address)::text = (geocoded_address)::text)
                    Heap Blocks: exact=1658918
                    ->  Bitmap Index Scan on blocksummary_yearly_geocoded_address_idx  (cost=0.00..10.30 rows=232 width=0) (actual time=0.052..0.052 rows=138 loops=40052)
                          Index Cond: ((blocksummary_total.geocoded_address)::text = (geocoded_address)::text)
Planning time: 0.171 ms
Execution time: 23606.171 ms

Hallelujah! About 3x slower than the native query (which didn't see an improvement from adding the index) but also totally reasonable for my purposes.

That's great. Can you run vacuum on both the tables? Ideally you should see an index scan.

vacuum blocksummary_total
vacuum blocksummary_yearly

After vacuuming, it's the same query plan:

Aggregate  (cost=36465528.12..36465528.13 rows=1 width=32) (actual time=24057.486..24057.487 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=910.69..36465327.92 rows=40039 width=48) (actual time=5.996..23318.386 rows=40052 loops=1)
        ->  Seq Scan on blocksummary_total  (cost=0.00..943.39 rows=40039 width=57) (actual time=0.009..13.102 rows=40052 loops=1)
        ->  Aggregate  (cost=910.69..910.70 rows=1 width=32) (actual time=0.580..0.580 rows=1 loops=40052)
              ->  Bitmap Heap Scan on blocksummary_yearly  (cost=10.36..909.53 rows=232 width=16) (actual time=0.061..0.269 rows=138 loops=40052)
                    Recheck Cond: ((blocksummary_total.geocoded_address)::text = (geocoded_address)::text)
                    Heap Blocks: exact=1658918
                    ->  Bitmap Index Scan on blocksummary_yearly_geocoded_address_idx  (cost=0.00..10.30 rows=232 width=0) (actual time=0.052..0.052 rows=138 loops=40052)
                          Index Cond: ((blocksummary_total.geocoded_address)::text = (geocoded_address)::text)
Planning time: 0.709 ms
Execution time: 24094.912 ms

I'll play with it more though to see if I can get the planner to do an index scan instead. I feel kind of dumb for not restoring the index, but on the other hand, the query was working fine the way I wrote it so it didn't occur to me that the way Hasura was constructing the query would blow up. But it will be a lot easier to troubleshoot once implementers can see the generated SQL queries.

The following typically influence the planner picking Index scans oven Bitmap scans:

  1. random_page_cost being closer to the seq_page_cost(https://www.postgresql.org/docs/current/static/runtime-config-query.html)
  2. correlation in pg_stats for the column is close to -1 or +1 (https://www.postgresql.org/docs/current/static/view-pg-stats.html).

The default values of random_page_cost is 4 and for seq_page_cost is 1. On an SSD, you can set random_page_cost close to 1.

I did some experiments to see how graphql-engine performs when fetching really large data. I used the schema that your query suggested and generated a large dataset for it (600K rows in blockyearly_total, 8M rows in blockyearly_yearly).



setup script:


create table blockyearly_total (
  geocoded_address text not null,
  amount_due int not null
);

create table blockyearly_yearly (
  id serial,
  geocoded_address TEXT not null
);

insert into
  blockyearly_total (geocoded_address, amount_due)
select generate_series, ceil(random() * 1000)
from generate_series(1, 600000);

insert into
  blockyearly_yearly (geocoded_address)
select ceil(random() * 600000)
from generate_series(1,8000000);

alter table blockyearly_total add primary key (geocoded_address);
alter table blockyearly_yearly
  add primary key (id),
  add constraint geocoded_address_fk foreign key (geocoded_address)
  references blockyearly_total(geocoded_address);

create index on blockyearly_yearly (geocoded_address);

vacuum analyze;

This is the GraphQL query that we are interested in

{
  blockyearly_total {
    geocoded_address
    amount_due
    blockyearly {
      id
    }
  }
}

hasura graphql-engine



The sql query that hasura generates for this GraphQL query:

explain analyze
SELECT
  coalesce(json_agg("root"), '[]') AS "root"
FROM
  (
    SELECT
      row_to_json(
        (
          SELECT
            "_5_e"
          FROM
            (
              SELECT
                "_0_root.base"."geocoded_address" AS "geocoded_address",
                "_0_root.base"."amount_due" AS "amount_due",
                "_4_root.ar.root.blockyearly"."blockyearly" AS "blockyearly"
            ) AS "_5_e"
        )
      ) AS "root"
    FROM
      (
        SELECT
          *
        FROM
          "public"."blockyearly_total"
        WHERE
          ('true')
      ) AS "_0_root.base"
      LEFT OUTER JOIN LATERAL (
        SELECT
          coalesce(json_agg("blockyearly"), '[]') AS "blockyearly"
        FROM
          (
            SELECT
              row_to_json(
                (
                  SELECT
                    "_2_e"
                  FROM
                    (
                      SELECT
                        ("_1_root.ar.root.blockyearly.base"."id") :: text AS "id"
                    ) AS "_2_e"
                )
              ) AS "blockyearly"
            FROM
              (
                SELECT
                  *
                FROM
                  "public"."blockyearly_yearly"
                WHERE
                  (
                    ("_0_root.base"."geocoded_address") = ("geocoded_address")
                  )
              ) AS "_1_root.ar.root.blockyearly.base"
          ) AS "_3_root.ar.root.blockyearly"
      ) AS "_4_root.ar.root.blockyearly" ON ('true')
  ) AS "_6_root_alias"

1. default

With the default setting for random_page_cost.

large_data=# select current_setting('random_page_cost');
 current_setting
-----------------
 4
(1 row)

Execution time: ~35-37s



Query plan:

Aggregate  (cost=38556633.73..38556633.74 rows=1 width=32) (actual time=37041.518..37041.518 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=64.20..38547633.73 rows=600000 width=42) (actual time=0.454..35674.381 rows=600000 loops=1)
        ->  Seq Scan on blockyearly_total  (cost=0.00..9243.00 rows=600000 width=10) (actual time=0.059..63.467 rows=600000 loops=1)
        ->  Aggregate  (cost=64.20..64.21 rows=1 width=32) (actual time=0.059..0.059 rows=1 loops=600000)
              ->  Bitmap Heap Scan on blockyearly_yearly  (cost=4.55..63.90 rows=15 width=8) (actual time=0.009..0.043 rows=13 loops=600000)
                    Recheck Cond: (blockyearly_total.geocoded_address = geocoded_address)
                    Heap Blocks: exact=7998796
                    ->  Bitmap Index Scan on blockyearly_yearly_geocoded_address_idx  (cost=0.00..4.54 rows=15 width=0) (actual time=0.005..0.005 rows=13 loops=600000)
                          Index Cond: (blockyearly_total.geocoded_address = geocoded_address)
              SubPlan 2
                ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=8000000)
  SubPlan 1
    ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=600000)
Planning time: 0.435 ms
Execution time: 37060.171 ms

2. modifying random_page_cost

large_data=# set random_page_cost = 1.1;
SET

Execution time: ~30-31s



Query plan:

Aggregate  (cost=14074718.45..14074718.46 rows=1 width=32) (actual time=31707.579..31707.579 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=23.40..14065718.45 rows=600000 width=42) (actual time=0.274..30378.586 rows=600000 loops=1)
        ->  Seq Scan on blockyearly_total  (cost=0.00..9243.00 rows=600000 width=10) (actual time=0.018..58.964 rows=600000 loops=1)
        ->  Aggregate  (cost=23.40..23.41 rows=1 width=32) (actual time=0.050..0.050 rows=1 loops=600000)
              ->  Index Scan using blockyearly_yearly_geocoded_address_idx on blockyearly_yearly  (cost=0.43..23.09 rows=15 width=8) (actual time=0.006..0.035 rows=13 loops=600000)
                    Index Cond: (blockyearly_total.geocoded_address = geocoded_address)
              SubPlan 2
                ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=8000000)
  SubPlan 1
    ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=600000)
Planning time: 0.517 ms
Execution time: 31727.912 ms

The planner now selects Index Scan over Bitmap Index Scan. The query is consistently faster by roughly 5-6 seconds.

hand written sql query

To compare the query that hasura generated with a hand written query, they'll need to produce the same shape of data (which is requested by the graphql query), so a group by clause is needed.

option 1: outer group by

explain analyze
select
  json_agg(r.*)
from
  (
    select
      t.geocoded_address,
      t.amount_due,
      json_agg(y.*) as yearly
    from
      blockyearly_total t
      left outer join blockyearly_yearly y on (t.geocoded_address = y.geocoded_address)
    group by
      t.geocoded_address
  ) r;

Execution time: ~30-31s



Query plan:

 Aggregate  (cost=383283.22..383283.23 rows=1 width=32) (actual time=30756.805..30756.806 rows=1 loops=1)
   ->  Subquery Scan on r  (cost=0.86..381783.21 rows=600000 width=34) (actual time=0.148..29616.176 rows=600000 loops=1)
         ->  GroupAggregate  (cost=0.86..375783.21 rows=600000 width=42) (actual time=0.143..29460.526 rows=600000 loops=1)
               Group Key: t.geocoded_address
               ->  Merge Left Join  (cost=0.86..328283.52 rows=7999938 width=48) (actual time=0.028..21486.135 rows=8000000 loops=1)
                     Merge Cond: (t.geocoded_address = y.geocoded_address)
                     ->  Index Scan using blockyearly_total_pkey on blockyearly_total t  (cost=0.42..15547.42 rows=600000 width=10) (actual time=0.011..175.440 rows=600000 loops=1)
                     ->  Index Scan using blockyearly_yearly_geocoded_address_idx on blockyearly_yearly y  (cost=0.43..211252.87 rows=7999938 width=44) (actual time=0.013..19760.763 rows=800>
 Planning time: 0.510 ms
 Execution time: 30792.054 ms

option 2

An alternative way to write the query:

explain analyze
select
  json_agg(r.*)
from
  (
    select
      t.geocoded_address,
      t.amount_due,
      y.yearly
    from
      blockyearly_total t
      left outer join (
        select
          geocoded_address,
          json_agg(y.*) as yearly
        from
          blockyearly_yearly y
        group by
          y.geocoded_address
      ) y on (t.geocoded_address = y.geocoded_address)
  ) r

Execution time: ~28-29s



Query plan:

 Aggregate  (cost=288884.25..288884.26 rows=1 width=32) (actual time=29419.045..29419.046 rows=1 loops=1)
   ->  Merge Left Join  (cost=0.86..287384.25 rows=600000 width=42) (actual time=0.208..28231.284 rows=600000 loops=1)
         Merge Cond: (t.geocoded_address = y.geocoded_address)
         ->  Index Scan using blockyearly_total_pkey on blockyearly_total t  (cost=0.42..15547.42 rows=600000 width=10) (actual time=0.044..176.131 rows=600000 loops=1)
         ->  GroupAggregate  (cost=0.43..258068.37 rows=545265 width=38) (actual time=0.159..27627.969 rows=600000 loops=1)
               Group Key: y.geocoded_address
               ->  Index Scan using blockyearly_yearly_geocoded_address_idx on blockyearly_yearly y  (cost=0.43..211252.87 rows=7999938 width=44) (actual time=0.033..19700.111 rows=8000000 l>
 Planning time: 0.174 ms
 Execution time: 29458.561 ms

Summary

To summarize, with the correct indexes and setting random_page_cost to an appropriate value (in case of an SSD) there shouldn't be any significant difference in performance compared to a hand written query.

Further, graphql-engine adds little to no overhead when streaming the response from Postgres back to the client. For example, the response size of the above GraphQL query on the generated dataset is roughly 384MB. While psql takes about 30-31s to fetch the response using the generated sql from Postgres, curl takes about 32-33s to fetch the response from GraphQL engine.

Sorry for the late reply, I got sick. This is awesome! Thank you.

For what it's worth, with 40k rows joining to about 5m rows, my by-hand queries take about 20s.

My random_page_cost is already set to 1.1 for my SSD. Sounds like my issue was primarily the indexes and being thrown off that the missing index didn't have a major performance penalty the way I was (naively) writing the query by hand.

Now that I'm back, will take a shot at this today and will close ticket if that sounds right to you.

Whoops, forgot to close. Thanks again.

Was this page helpful?
0 / 5 - 0 ratings