Postgraphile: Simple query causing massive memory usage (pushing over the half gigabyte level allowed by Heroku)

Created on 16 Jan 2017 · 26Comments · Source: graphile/postgraphile

This is an issue that I'm digging into - I'm tracking my findings here so I can ask for help later if necessary.

The minimal query to trigger this is:

{
  sketchById(id: 11) {
    sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) {
      edges {
        node {
          __id
        }
      }
    }
  }
}

😌 Fixed in v4

Source

benjie

Most helpful comment

Unfortunately I timed out on this. I’m not sure when I’m going to get the chance to work on it again. In order to solve some of the major issues like this one I’m going to need more community support. Either fiscally or in code contributions.

calebmer on 16 Jan 2017

👍6

All 26 comments

So I've downloaded the data from my staging environment and am running the dev version of PostGraphQL locally. I have 16GB of RAM whereas Heroku only gives 512MB so it's not a very fair comparison, and I have to monitor the memory very fast to see the issue.

In order to do so, I've set up this loop (OSX, so using gdate from coreutils):

while sleep 0.001; do
  echo -n "`gdate +%s%3N` ";
  ps -o 'pid=,rss=' -p 45416;
done | tee log.txt

This has given me quite a high resolution file exposing the memory usage. I'm going to build a graph from it and upload... cat log.txt | sed 's/ /'$'\t''/g' > memory.tsv

benjie on 16 Jan 2017

memory

I ran the query six times; first with gaps of a few seconds between each one, finally in quick succession. The x-axis is time in milliseconds, y-axis is RSS memory usage in kilobytes.

As can be seen, this simple query is allocating a hell of a lot of RAM.

benjie on 16 Jan 2017

I've now added --inspect to the ts-node command line and connected Chrome Canary as a debugger to the node process. I ran the query a few times to "warm up" the system; then used the "Record Allocation Profile" profiling tool for a single query execution to see where the RAM is being allocated:

screenshot 2017-01-16 11 07 33

benjie on 16 Jan 2017

Setting DEBUG="*" I get:

  postgraphql:request GraphQL query request has begun. +7s
  body-parser:json content-type "application/json" +11s
  body-parser:json content-encoding "identity" +2ms
  body-parser:json read body +1ms
  body-parser:json parse body +4ms
  body-parser:json parse json +0ms
  body-parser:urlencoded body already parsed +0ms
  body-parser:text body already parsed +1ms
  postgraphql:request GraphQL query is parsed. +13ms
  postgraphql:request GraphQL query is validated. +10ms
  postgraphql:graphql { sketchById(id: 11) { sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) { edges { node { __id } } } } } +1ms
  postgraphql:postgres begin +1ms
  postgraphql:postgres select row_to_json(__local_0__) as object from "hookhaven_data"."sketch" as __local_0__ where "id" = any($1) limit $2 +8ms
  postgraphql:postgres select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = $1) order by "id" using > limit $2 +6ms
  postgraphql:postgres:explain
Limit  (cost=0.00..2.14 rows=1 width=32) (actual time=0.129..0.129 rows=1 loops=1)
  ->  Seq Scan on sketch __local_0__  (cost=0.00..2.14 rows=1 width=32) (actual time=0.128..0.128 rows=1 loops=1)
        Filter: (id = ANY ('{11}'::integer[]))
        Rows Removed by Filter: 3
Planning time: 0.052 ms
Execution time: 0.142 ms +1ms
  postgraphql:postgres commit +2s
  postgraphql:postgres:explain
Limit  (cost=222.36..222.59 rows=90 width=36) (actual time=419.363..438.374 rows=90 loops=1)
  ->  Sort  (cost=222.36..222.59 rows=90 width=36) (actual time=419.362..438.358 rows=90 loops=1)
        Sort Key: id DESC
        Sort Method: external merge  Disk: 69008kB
        ->  Bitmap Heap Scan on sketch_integration_event_run __local_0__  (cost=4.98..219.44 rows=90 width=36) (actual time=1.892..145.971 rows=90 loops=1)
              Recheck Cond: (sketch_id = 11)
              Filter: (id IS NOT NULL)
              Heap Blocks: exact=80
              ->  Bitmap Index Scan on index_sketch_integration_event_run__fk_sketch_id  (cost=0.00..4.96 rows=90 width=0) (actual time=0.019..0.019 rows=90 loops=1)
                    Index Cond: (sketch_id = 11)
Planning time: 0.088 ms
Execution time: 509.332 ms +219ms
  postgraphql:request GraphQL query has been executed. +1ms
  postgraphql:request GraphQL query request finished. +2ms
0 error(s) in 2487.72ms :: { sketchById(id: 11) { sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) { edges { node { __id } } } } }

screenshot 2017-01-16 11 10 27

benjie on 16 Jan 2017

The important part of that seems to be:

  postgraphql:postgres begin +1ms
  postgraphql:postgres select row_to_json(__local_0__) as object from "hookhaven_data"."sketch" as __local_0__ where "id" = any($1) limit $2 +8ms
  postgraphql:postgres select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = $1) order by "id" using > limit $2 +6ms
  postgraphql:postgres:explain
Limit  (cost=0.00..2.14 rows=1 width=32) (actual time=0.129..0.129 rows=1 loops=1)
  ->  Seq Scan on sketch __local_0__  (cost=0.00..2.14 rows=1 width=32) (actual time=0.128..0.128 rows=1 loops=1)
        Filter: (id = ANY ('{11}'::integer[]))
        Rows Removed by Filter: 3
Planning time: 0.052 ms
Execution time: 0.142 ms +1ms
  postgraphql:postgres commit +2s

The commit (last line) didn't fire until 2 seconds after the explain suggesting the memory allocation/slow processing is happening after postgres has returned the data whilst it is being processed by postgraphql

benjie on 16 Jan 2017

Running a CPU profile over a single query yields:

screenshot 2017-01-16 11 19 39

Quite a lot of this time is spent in the pg module itself.

benjie on 16 Jan 2017

So running the second query returns 72MB of data.

$ echo 'select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = 11) order by "id" using > limit 100' | time psql bots_staging | wc -c
 72609787
psql bots_staging  6.36s user 0.13s system 75% cpu 8.646 total
wc -c  0.29s user 0.04s system 3% cpu 8.645 total

So it seems that the issue here could be fixed along with #265: basically we're requesting data that we don't need, and if that data is large then it can cause massive performance/memory allocation issues.

benjie on 16 Jan 2017

This issue seems to match what you're hitting: https://github.com/brianc/node-postgres/issues/1103

How many fields are you selecting? This may be fixable by not using * and only getting the fields you need.

calebmer on 16 Jan 2017

You posted the comment before I did 😉

I'm going to look into fixing this. It's high priority so I'll see what I can do.

calebmer on 16 Jan 2017

👍1

I've been looking into #265 for a while; it seems this is a good starting point. Basically we want to remove the createPgProcedureObjectTypeGqlFieldEntry workaround, and instead do something like:

resolve(source, args, {fieldASTs}) {
  // Look up SQL columns (and computed columns!) required by fieldASTs
  // and add them to the select right here
}

which instead of doing select row_to_json(${sql.identifier(aliasIdentifier)}) as object would do something more like

select json_build_object('col1', ${tbl}.col1, 'col2', ${tbl}.col2, 'computedField1', computedField1(${tbl}), 'computedField2', computedField2(${tbl}, 27, 'Foo')) only referencing the requested columns.

I've not yet dug into fieldASTs so I'm not sure how transparent/opaque it is.

benjie on 16 Jan 2017

@calebmer Why are we doing row_to_json(...) rather than merely returning * directly - were you expecting to add more "meta" columns later, or maybe it's for compatibility with the _hooks that send the data back to postgresql later?

benjie on 16 Jan 2017

(Using psql, replacing to_json(__local_0__) with __local_0__.* cuts the query time from ~8 seconds to ~6.8, so it's only about 15% faster - I don't think that's worth making the code changes for.)

benjie on 16 Jan 2017

We use row_to_json for easier composite type support. row_to_json will turn composite types into nested objects which is convenient.

The first fix I want to implement is just selecting the columns that are requested when there are no procedures. So for now at least a procedure column will cause a de-opt as we revert to the * case. Of course in the future I want to implement the solution we discussed in #265 for procedures, but for now I’m at least focusing on the more common case which is at least a little easier to fix. Does the query that ends up breaking your Heroku box use procedure columns?

calebmer on 16 Jan 2017

I use computed columns heavily, but fortunately not in this one case. (I do have performance issues related to computed columns, but they're not as problematic as this one is.)

benjie on 16 Jan 2017

Ok. Computed columns are definitely something I really want to make fast as well given how great they are for the developer experience. Out of curiosity do you use a lot of connection computed columns? Those will be pretty tricky to optimize and may always require a de-opt.

calebmer on 16 Jan 2017

I'm going to keep taking notes here as I dig further into understanding the codebase (something I've been meaning to do for a while).

Using the awesome ts-node --inspect I've added a debugger statement at the top of readPage in PgPaginatorOrderingAttributes; and am now stepping through the backtrace because I can't find the fieldASTs property. Here's how the data looks as it leaves the graphql module and enters postgraphql:

screenshot 2017-01-16 12 52 48

benjie on 16 Jan 2017

Digging deeper into the info object, we can see the fields that are being requested on sketchIntegrationEventRunsBySketchId; ultimately our {edges: {node: { __id}}} path:

screenshot 2017-01-16 12 57 31

So this is probably what fieldASTs has become?

benjie on 16 Jan 2017

Confirmed: in v0.8.0 fieldASTs was renamed to fieldNodes:

https://github.com/graphql/graphql-js/releases

benjie on 16 Jan 2017

Awesome :+1:

I will note the approach I’m taking here. Because the GraphQL and Postgres parts of the codebase are kept separate (and intentionally so) I’m going to add a optimizations object which the GraphQL adapter can pass through the interface. The optimizations object will contain all of the fields selected in GraphQL. My current designs for an optimizations object depend on the changes in https://github.com/calebmer/postgraphql/pull/281. So I’m rushing to finish that first.

Out of curiosity, does fieldNodes resolve fragment fields? Or will we have to resolve fragment fields?

calebmer on 16 Jan 2017

Looks like resolve takes 4 arguments: source, args, context and info; and it's info.fieldNodes that we need.

benjie on 16 Jan 2017

Good question, I shall investigate...

benjie on 16 Jan 2017

New query, with an inline and an external fragment and a relation lookup:

{
  sketchById(id: 11) {
    sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) {
      edges {
        node {
          __id
          ... on SketchIntegrationEventRun {
            createdAt
          }
          ... F0
        }
      }
    }
  }
}
fragment F0 on SketchIntegrationEventRun {
  id
  sketchBySketchId {
        ...F1
  }
}
fragment F1 on Sketch {
  updatedAt
}

It seems it does not lookup the fragments for you, you'll need to resolve them yourself:

screenshot 2017-01-16 13 11 54