Postgraphile: Performance of computed columns in connections

Created on 6 Dec 2016  ·  11Comments  ·  Source: graphile/postgraphile

I am lacking the proper terminology to express the problem correctly but here you go.

For every row in a table the computed column procedure is executed, hence every other row contributes significantly to the execution time of the query. I am just guessing here but could we maybe use with queries to 'inline' the queries 🙈

You can find a 'test case' here https://github.com/ferdinandsalis/computed-columns-perf

@calebmer please let me know if you were expecting something else for the issue and the test case.

👨‍💻 Fix in v4 💅 enhancement 🚀 performance

Most helpful comment

I would be super happy with every performance boost. My requests currently take about 1360ms. For 20 items with some other nested items.

All 11 comments

For the query:

{
  allPosts {
    edges {
      node {
        id
        body
        summary #computed column
      }
    }
  }
}

we could potentially solve it with a single SQL statement along the lines of:

select
  to_json(__local_0__) as value,
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

and then combine value/computed value with Object.assign or similar. You could go further with explicit column lists (like what's required by #151), minimising encoding data that isn't used, e.g. something like:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

(I'm not sure if the to_jsons inside the json_build_objects are actually required, I suspect not.)

Also, if we were to do this we'd also be able to sort by computed columns as requested in #262 by @valorize:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "perf"."post_summary"(__local_0__) using <

(this would only work for computed columns that take no additional arguments, as the ones that take arguments cannot be expressed as static strings in our orderBy API currently)

So I spent an hour digging through how PostGraphQL works, thought it might be useful to share some of my scattered learnings. I'm a GraphQL newbie, so apologies if some of this is obvious.

We start in src/postgraphql/schema/createPostGraphQLSchema which runs the introspection query against the PG database and then passes it's learnings onto createGqlSchema. Particularly of interest for this specific issue #265 in this file is the _hooks that are defined for objectTypeFieldEntries which call out to createPgProcedureObjectTypeGqlFieldEntry whose resolve method is what's responsible for running the extra SQL queries referenced in @ferdinandsalis' test case.

The root of the schema generation is src/graphql/schema/createGqlSchema which recurses through and creates everything that's necessary based on the learnings from the postgresql introspection. It defines the query root node.

This then loops through creating the type for each each "collection" (table) via createCollectionQueryFieldEntries.

The allPosts in the test case is generated here, which passes out to createConnectionGqlField to fetch the data.

If you spit out page when you get toward the end of createConnectionGqlField you'll get an object of the form:

{
  values: [
    {value: {id: 1, field: 'value', otherField: 'otherValue'}, cursor: [1]},
    {value: {...}, cursor: ...},
    ...
  ],
  hasNextPage: function...,
  hasPreviousPage: function...,
}

Where the value objects are the to_json(__local_0__).

It's then later that this data is fed back into the resolver for the computed columns (via the _hooks mentioned above) which is why postgraphql then feed the data it has fetched back into the database in order to run the procedures against it again.


The computed columns are effectively added here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L81-L84; the columns are fetched via later on in that file via here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L108

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

@benjie thanks for sharing you learnings! This is very helpful to me.

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

Exactly. This has been on my mind for a while, because it also means we could do away with select * and just select the columns we need. It actually wouldn’t be too hard to implement this on the GraphQL side, it just becomes complex if we want to integrate it in a more general way. Some of the complexities are as follows:

  • If we don’t have all the columns, isTypeOf checks break which currently do a structural check. However, this structural check could either be optimized by including a type tag to values or removed entirely as currently its only purpose is to resolve the type of Nodes which can be done in a resolveType function on the Node interface.
  • Connection procedures are tricky to do this way because things start to get recursive. If you have a procedure person_latest_posts that returns setof post and you want to get the post_summary procedure on post things get tricky.

Another less efficient but easier to implement option would be to split this query @benjie wrote effectively into two queries:

select
  to_json(__local_0__) as value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

Then once we get that back:

select
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" in ($1)
order by "id" using <

Where $1 is the ids we got back from the first query.

Not as efficient as the optimal solution, but more efficient than what we have now. We should at least be able to optimize the most common case (returning a scalar from a procedure), and I have a PR that I’ve been sitting on which should make this easier to implement. I’ll see if I can clean it up and send it soon.

Excellent feedback @calebmer, thanks for helping us to understand.

Regarding the isTypeOf I think the type tag you suggest would be best - very easy to add that to the list of fields selected select to_json(__local_0__) as value, 'perf.post' as type from ... or mix it into the JSON object.

My API has some large JSON objects and arrays so your suggested modification which doesn't pipe the fetched data straight back into postgres again would be considerably better than the current solution 👍 . Some of my queries take 800ms+ locally at the moment; this should reduce their duration significantly, not least because it goes from O(N) to O(1) queries.

I would be super happy with every performance boost. My requests currently take about 1360ms. For 20 items with some other nested items.

I believe I've implemented the optimal solution in #340; there's still work to be done before it can be merged but the foundations are mostly there.

@valorize Are you able to try my latest stuff in #342 to see if it solves your 1360ms issue?

Checkout master, install everything, then do DEBUG="" ./scripts/dev -c postgresql://localhost/YOUR_DATABASE_NAME --schema YOUR_SCHEMA_NAME and run your query in graphiql (run it a few times so you can break in V8 a bit, and take the lowest runtime (ignoring the introspection query that might run when graphiql boots up))

Then git checkout bg/performance-experiment-2 (the dev server should automatically restart itself) and run the query again.

You can then repeat the process with the DEBUG="" part removed so that you can count the number of SQL queries that are executed.

Fixed in PostGraphQL v4 👍

Was this page helpful?
0 / 5 - 0 ratings