Postgraphile: Simple query causing massive memory usage (pushing over the half gigabyte level allowed by Heroku)

Created on 16 Jan 2017  路  26Comments  路  Source: graphile/postgraphile

This is an issue that I'm digging into - I'm tracking my findings here so I can ask for help later if necessary.

The minimal query to trigger this is:

{
  sketchById(id: 11) {
    sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) {
      edges {
        node {
          __id
        }
      }
    }
  }
}
馃槍 Fixed in v4

Most helpful comment

Unfortunately I timed out on this. I鈥檓 not sure when I鈥檓 going to get the chance to work on it again. In order to solve some of the major issues like this one I鈥檓 going to need more community support. Either fiscally or in code contributions.

All 26 comments

So I've downloaded the data from my staging environment and am running the dev version of PostGraphQL locally. I have 16GB of RAM whereas Heroku only gives 512MB so it's not a very fair comparison, and I have to monitor the memory very fast to see the issue.

In order to do so, I've set up this loop (OSX, so using gdate from coreutils):

while sleep 0.001; do
  echo -n "`gdate +%s%3N` ";
  ps -o 'pid=,rss=' -p 45416;
done | tee log.txt

This has given me quite a high resolution file exposing the memory usage. I'm going to build a graph from it and upload... cat log.txt | sed 's/ /'$'\t''/g' > memory.tsv

memory

I ran the query six times; first with gaps of a few seconds between each one, finally in quick succession. The x-axis is time in milliseconds, y-axis is RSS memory usage in kilobytes.

As can be seen, this simple query is allocating a hell of a lot of RAM.

I've now added --inspect to the ts-node command line and connected Chrome Canary as a debugger to the node process. I ran the query a few times to "warm up" the system; then used the "Record Allocation Profile" profiling tool for a single query execution to see where the RAM is being allocated:

screenshot 2017-01-16 11 07 33

Setting DEBUG="*" I get:

  postgraphql:request GraphQL query request has begun. +7s
  body-parser:json content-type "application/json" +11s
  body-parser:json content-encoding "identity" +2ms
  body-parser:json read body +1ms
  body-parser:json parse body +4ms
  body-parser:json parse json +0ms
  body-parser:urlencoded body already parsed +0ms
  body-parser:text body already parsed +1ms
  postgraphql:request GraphQL query is parsed. +13ms
  postgraphql:request GraphQL query is validated. +10ms
  postgraphql:graphql { sketchById(id: 11) { sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) { edges { node { __id } } } } } +1ms
  postgraphql:postgres begin +1ms
  postgraphql:postgres select row_to_json(__local_0__) as object from "hookhaven_data"."sketch" as __local_0__ where "id" = any($1) limit $2 +8ms
  postgraphql:postgres select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = $1) order by "id" using > limit $2 +6ms
  postgraphql:postgres:explain
Limit  (cost=0.00..2.14 rows=1 width=32) (actual time=0.129..0.129 rows=1 loops=1)
  ->  Seq Scan on sketch __local_0__  (cost=0.00..2.14 rows=1 width=32) (actual time=0.128..0.128 rows=1 loops=1)
        Filter: (id = ANY ('{11}'::integer[]))
        Rows Removed by Filter: 3
Planning time: 0.052 ms
Execution time: 0.142 ms +1ms
  postgraphql:postgres commit +2s
  postgraphql:postgres:explain
Limit  (cost=222.36..222.59 rows=90 width=36) (actual time=419.363..438.374 rows=90 loops=1)
  ->  Sort  (cost=222.36..222.59 rows=90 width=36) (actual time=419.362..438.358 rows=90 loops=1)
        Sort Key: id DESC
        Sort Method: external merge  Disk: 69008kB
        ->  Bitmap Heap Scan on sketch_integration_event_run __local_0__  (cost=4.98..219.44 rows=90 width=36) (actual time=1.892..145.971 rows=90 loops=1)
              Recheck Cond: (sketch_id = 11)
              Filter: (id IS NOT NULL)
              Heap Blocks: exact=80
              ->  Bitmap Index Scan on index_sketch_integration_event_run__fk_sketch_id  (cost=0.00..4.96 rows=90 width=0) (actual time=0.019..0.019 rows=90 loops=1)
                    Index Cond: (sketch_id = 11)
Planning time: 0.088 ms
Execution time: 509.332 ms +219ms
  postgraphql:request GraphQL query has been executed. +1ms
  postgraphql:request GraphQL query request finished. +2ms
0 error(s) in 2487.72ms :: { sketchById(id: 11) { sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) { edges { node { __id } } } } }

screenshot 2017-01-16 11 10 27

The important part of that seems to be:

  postgraphql:postgres begin +1ms
  postgraphql:postgres select row_to_json(__local_0__) as object from "hookhaven_data"."sketch" as __local_0__ where "id" = any($1) limit $2 +8ms
  postgraphql:postgres select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = $1) order by "id" using > limit $2 +6ms
  postgraphql:postgres:explain
Limit  (cost=0.00..2.14 rows=1 width=32) (actual time=0.129..0.129 rows=1 loops=1)
  ->  Seq Scan on sketch __local_0__  (cost=0.00..2.14 rows=1 width=32) (actual time=0.128..0.128 rows=1 loops=1)
        Filter: (id = ANY ('{11}'::integer[]))
        Rows Removed by Filter: 3
Planning time: 0.052 ms
Execution time: 0.142 ms +1ms
  postgraphql:postgres commit +2s

The commit (last line) didn't fire until 2 seconds after the explain suggesting the memory allocation/slow processing is happening after postgres has returned the data whilst it is being processed by postgraphql

Running a CPU profile over a single query yields:

screenshot 2017-01-16 11 19 39

Quite a lot of this time is spent in the pg module itself.

So running the second query returns 72MB of data.

$ echo 'select to_json(__local_0__) as value from "hookhaven_data"."sketch_integration_event_run" as __local_0__ where "id" is not null and true and true and ("sketch_id" = 11) order by "id" using > limit 100' | time psql bots_staging | wc -c
 72609787
psql bots_staging  6.36s user 0.13s system 75% cpu 8.646 total
wc -c  0.29s user 0.04s system 3% cpu 8.645 total

So it seems that the issue here could be fixed along with #265: basically we're requesting data that we don't need, and if that data is large then it can cause massive performance/memory allocation issues.

This issue seems to match what you're hitting: https://github.com/brianc/node-postgres/issues/1103

How many fields are you selecting? This may be fixable by not using * and only getting the fields you need.

You posted the comment before I did 馃槈

I'm going to look into fixing this. It's high priority so I'll see what I can do.

I've been looking into #265 for a while; it seems this is a good starting point. Basically we want to remove the createPgProcedureObjectTypeGqlFieldEntry workaround, and instead do something like:

resolve(source, args, {fieldASTs}) {
  // Look up SQL columns (and computed columns!) required by fieldASTs
  // and add them to the select right here
}

which instead of doing select row_to_json(${sql.identifier(aliasIdentifier)}) as object would do something more like

select json_build_object('col1', ${tbl}.col1, 'col2', ${tbl}.col2, 'computedField1', computedField1(${tbl}), 'computedField2', computedField2(${tbl}, 27, 'Foo')) only referencing the requested columns.

I've not yet dug into fieldASTs so I'm not sure how transparent/opaque it is.

@calebmer Why are we doing row_to_json(...) rather than merely returning * directly - were you expecting to add more "meta" columns later, or maybe it's for compatibility with the _hooks that send the data back to postgresql later?

(Using psql, replacing to_json(__local_0__) with __local_0__.* cuts the query time from ~8 seconds to ~6.8, so it's only about 15% faster - I don't think that's worth making the code changes for.)

We use row_to_json for easier composite type support. row_to_json will turn composite types into nested objects which is convenient.

The first fix I want to implement is just selecting the columns that are requested when there are no procedures. So for now at least a procedure column will cause a de-opt as we revert to the * case. Of course in the future I want to implement the solution we discussed in #265 for procedures, but for now I鈥檓 at least focusing on the more common case which is at least a little easier to fix. Does the query that ends up breaking your Heroku box use procedure columns?

I use computed columns heavily, but fortunately not in this one case. (I do have performance issues related to computed columns, but they're not as problematic as this one is.)

Ok. Computed columns are definitely something I really want to make fast as well given how great they are for the developer experience. Out of curiosity do you use a lot of connection computed columns? Those will be pretty tricky to optimize and may always require a de-opt.

I'm going to keep taking notes here as I dig further into understanding the codebase (something I've been meaning to do for a while).


Using the awesome ts-node --inspect I've added a debugger statement at the top of readPage in PgPaginatorOrderingAttributes; and am now stepping through the backtrace because I can't find the fieldASTs property. Here's how the data looks as it leaves the graphql module and enters postgraphql:

screenshot 2017-01-16 12 52 48

Digging deeper into the info object, we can see the fields that are being requested on sketchIntegrationEventRunsBySketchId; ultimately our {edges: {node: { __id}}} path:

screenshot 2017-01-16 12 57 31

So this is probably what fieldASTs has become?

Confirmed: in v0.8.0 fieldASTs was renamed to fieldNodes:

https://github.com/graphql/graphql-js/releases

Awesome :+1:

I will note the approach I鈥檓 taking here. Because the GraphQL and Postgres parts of the codebase are kept separate (and intentionally so) I鈥檓 going to add a optimizations object which the GraphQL adapter can pass through the interface. The optimizations object will contain all of the fields selected in GraphQL. My current designs for an optimizations object depend on the changes in https://github.com/calebmer/postgraphql/pull/281. So I鈥檓 rushing to finish that first.

Out of curiosity, does fieldNodes resolve fragment fields? Or will we have to resolve fragment fields?

Looks like resolve takes 4 arguments: source, args, context and info; and it's info.fieldNodes that we need.

Good question, I shall investigate...

New query, with an inline and an external fragment and a relation lookup:

{
  sketchById(id: 11) {
    sketchIntegrationEventRunsBySketchId(orderBy: ID_DESC, first: 100) {
      edges {
        node {
          __id
          ... on SketchIntegrationEventRun {
            createdAt
          }
          ... F0
        }
      }
    }
  }
}
fragment F0 on SketchIntegrationEventRun {
  id
  sketchBySketchId {
        ...F1
  }
}
fragment F1 on Sketch {
  updatedAt
}

It seems it does not lookup the fragments for you, you'll need to resolve them yourself:

screenshot 2017-01-16 13 11 54

Looks like operation is the entire query, whereas fieldNodes is specifically just the current thing that's being resolved.

Unfortunately I timed out on this. I鈥檓 not sure when I鈥檓 going to get the chance to work on it again. In order to solve some of the major issues like this one I鈥檓 going to need more community support. Either fiscally or in code contributions.

I'm guessing this is fixed in v4 so I'm going to tag it as so - if I accidentally close this without confirming please let me know!

nostalgia

Was this page helpful?
0 / 5 - 0 ratings