Graphql-engine: Migrations drop and recreate hdb_views dozens of times, causing max_locks_per_transaction to be exceeded

Created on 14 Nov 2019 · 43Comments · Source: hasura/graphql-engine

Updating from v1.0.0-beta.6 to v1.0.0-beta.7 results in this error when running sudo docker-compose up -d: ERROR: manifest for hasura/graphql-engine:v1.0.0-beta.7 not found: manifest unknown: manifest unknown

Updating from v1.0.0-beta.6 to v1.0.0-beta.8 results in 502 Bad Gateway in the browser when accessing the console. The ui fails too because all graphql calls get a 502 response.

Updating from v1.0.0-beta.6 to v1.0.0-beta.9 or v1.0.0-beta.10 results in same error.

Reverting to v1.0.0-beta.6 instantly works after sudo docker-compose up -d.

What am I doing wrong?

server bug high

Source

barbalex

👀1

Most helpful comment

@lexi-lambda please have yourself a beer - you made my day!

barbalex on 17 Dec 2019

❤1 😄1

All 43 comments

Updating from v1.0.0-beta.6 to v1.0.0-beta.8 results in 502 Bad Gateway in the browser when accessing the console. The ui fails too because all graphql calls get a 502 response.

Can you provide any information that appears in the server logs when trying to start v1.0.0-beta.8 through v1.0.0-beta.10?

502 Bad Gateway usually means the server is failing to start, and whatever proxy is running in front of the server is unable to serve requests because the server isn’t actually running. Therefore, my guess is the server is probably crashing on startup for some reason or another, but I can’t say why without more information.

lexi-lambda on 14 Nov 2019

I have dug through lots of logs in /var/log/ but have found nothing suspicious.

But I am a complete noob when it comes to working with logs. Where/What/How should I do it?

barbalex on 14 Nov 2019

This is a digital-ocean one-click hasura droplet

barbalex on 14 Nov 2019

So I now found https://docs.hasura.io/1.0/graphql/manual/deployment/docker/logging.html.

Was looking at the wrong place as this is the first time using docker.

The log right after updating from beta6 to beta10 contains lots of stuff but also this error: cannot continue due to new inconsistent metadata.

barbalex on 15 Nov 2019

Here is the log: https://gist.github.com/barbalex/52a1ed5e9ea9d6bbde0631ed6c0f8283

barbalex on 15 Nov 2019

I wonder if this could be due to problems with v1.0.0-beta.7. I had installed it before it was called back. So I then had to revert to beta.6.

So I looked up the docs and found that when downgrading I would have needed to run an sql skript that would have changed the version set in hdb_catalog.hdb_version.

I then looked up SELECT * FROM hdb_catalog.hdb_version. And found that the version set there was 17, set last June. Which I can't believe as I kept upgrading as new versions appeared.

This seems to be wrong. Could this be causing the issue?

barbalex on 15 Nov 2019

That could be causing the issue, though it’s surprising that beta.6 would run at all if hdb_version reports 17. The catalog version used by beta.6 is 22 (you can see the full list of versions here). Something seems not right, but I’m not sure what—running beta.6 against your database should definitely change the version from 17!

If you want, you could try manually changing the version to 22 and running the latest version to see if that helps. Otherwise, I’m afraid I’m a little lost. If that doesn’t help, a possible next step would be to ask to send one of us your Postgres schema (without any of the data in it), and we can try and debug things on our end.

It’s unfortunate that the inconsistent metadata error messages you are getting don’t actually include the queries that are causing the errors—we should, at the very least, improve the logging for those messages so that they provide at least a little more information. I’ve opened #3363 about that. In the meantime, if you want, you could try enabling query logging on your Postgres instance to see exactly which queries are failing, which could be illuminative.

lexi-lambda on 15 Nov 2019

If you want, you could try manually changing the version to 22 and running the latest version to see if that helps

I tried that. Does not help.

barbalex on 15 Nov 2019

If that doesn’t help, a possible next step would be to ask to send one of us your Postgres schema (without any of the data in it), and we can try and debug things on our end

This app is in development. Data in it is just for testing. I would happily give you full access.
How would I send you the schema? (assume I have no idea how to do it)

barbalex on 15 Nov 2019

If the data is just test data, the easiest solution would be to use pg_dump to just snapshot the whole database and send it to us. Specifically, you’ll want to run this command:

pg_dump postgres://user@host:port/db_name --format=custom --file=out.pg_dump

That will dump the entire database into out.pg_dump, both schema and data, and if you send it to me, I can take a look at it. You can either put a link in this issue, or you can send it via email to [email protected] if you want to keep it private. (There’s also the Discord server if you want to talk more directly than via GH issue thread.)

lexi-lambda on 15 Nov 2019

O.k., will dump.

I just added the query-log flag to HASURA_GRAPHQL_ENABLED_LOG_TYPES, updated to beta.10, ran sudo docker-compose up -d then checked the logs. This is the result: https://gist.github.com/barbalex/aec9e4d14d345b3d8fc14b167976382c

barbalex on 15 Nov 2019

Alright, after testing with the database you sent me, I think I’ve found the issue. When Hasura migrates to the latest version of the catalog, it drops and recreates certain views and triggers that it uses to implement permissions. However, it does this many more times than necessary, ending up dropping and recreating those views over a dozen times. This all happens within the span of a single transaction, which makes Postgres rather unhappy on a database with lots of permissions—it seems to be doing a lot of bookkeeping to track all those deletions and recreations, even though it really doesn’t have to (since it’s just dropping things created in the current transaction).

The fix is clearly to eliminate the duplication, but it isn’t immediately clear to me what the right approach is—it isn’t happening because of a simple mistake or anything like that. A temporary workaround is to crank up the value of Postgres’s max_locks_per_transaction configuration value. The default is 64, and increasing it to 128 was enough for the migration to complete successfully, though it took about 40 seconds to do so. If you’re interested in upgrading, that workaround should be otherwise benign, and you can even lower the value back to 64 after migrating if you’d like. Otherwise, we’ll fix try to fix this in the next release or two.

lexi-lambda on 15 Nov 2019

This is my first project to use a managed database (by digital ocean). So cranking up Postgres’s max_locks_per_transaction configuration value is most probably not an option, unless I move to a self-managed db on a droplet.

I will wait for a fix in an update. Thanks for caring!

barbalex on 15 Nov 2019

~@barbalex In the meanwhile, I'd suggest resetting/squashing the migrations for this project if you cannot update max_locks setting. You can apply migrations on a local instance and then follow this guide to do so: https://blog.hasura.io/resetting-hasura-migrations/~

~You'll be creating a single migration from your current postgres state.~

shahidhk on 19 Nov 2019

@shahidhk
Sorry, I'm not entirely sure I'm understanding this correctly.

Does that mean to run TRUNCATE hdb_catalog.schema_migrations; on the server? (I am not using a local instance yet)

barbalex on 19 Nov 2019

@barbalex Sorry, I mistook Hasura's internal catalog migrations to the schema/metadata migrations. My suggestion is not valid anymore.

We need to wait for @lexi-lambda to put in a fix. In the meanwhile, increasing max_locks_per_transaction is the only way.

shahidhk on 20 Nov 2019

Thanks for this very good and transparent information.

So I know that if I want to update or if I run into any problems before the issue is solved, I will have to migrate the db to a virtual droplet so I can increase max_locks_per_transaction.

barbalex on 20 Nov 2019

👍1

@lexi-lambda Do you think starting with a clean slate on the latest version and then applying the migrations would be another workaround?

shahidhk on 21 Nov 2019

I’m not completely certain. It might help, but it might not, since even squashed migrations could trigger the issue (since they still have individual calls to things like track_table, IIUC?). If absolutely necessary, it would probably work to drop the catalog information and reapply the metadata in batches so that there aren’t too many individual query operations in each bulk batch (and therefore not too many query operations in a single transaction).

Hopefully I’ll have a less awkward solution available soon. I’ll update this issue once I have a development build available for testing.

lexi-lambda on 22 Nov 2019

👍1

I think #3381 is related.

m0ngr31 on 10 Dec 2019

I am a bit surprised that v1.0.0 was just published. While this issue is not solved. My app does not seem to be such an unusual case: 14 Tables, permissions for 3 roles. Not exactly big, is it?

As far as I understand rebuilding my api with v1.0.0 will not help without increasing max_locks_per_transaction.

Have I missed something?

barbalex on 17 Dec 2019

Just not to sound too critical: The reason I am uneasy is not just that my project should go live in two months. It is the fact that hasura is absolutely great and I can not wait to be able to use the features added since v1.0.0-beta.6, especially concerning permissions spanning tables.

barbalex on 17 Dec 2019

I also have two other, larger projects that I will probably sooner or later migrate to hasura. Once this issue is solved.

barbalex on 17 Dec 2019

Yes, my apologies—I was hoping to leave a comment on this thread yesterday, but one more unexpected issue came up that led me to hold off.

The good news: I have been working on a fix for this in #3394, and I think it basically works. It would be great if either of you could try the experimental build in https://github.com/hasura/graphql-engine/pull/3394#issuecomment-566198192 and let me know if it resolves your problem. I would have liked for this change to go into v1.0.0, but it’s a large change, and there are some outstanding subtleties, so I’ve been hoping to have some people try it out before merging it.

The bad news: the change should work fine, but there are some lingering performance issues that seem to stem primarily from a poor interaction with the parallel GC running on machines where the number of cores the OS reports are available is larger than the number of cores graphql-engine should probably reasonably be using. For example, on a Heroku free dyno, nproc reports 8, so graphql-engine currently defaults to running on 8 cores. That choice is not a good one, however, as Heroku free-tier dynos are shared, and this seems to create a significant performance hit.

I am still looking into the appropriate solution for that, but in the meantime, if you want to try the build, consider restricting the number of cores graphql-engine uses manually. The easiest way to do that is to set the GHCRTS=-N<x> environment variable, replacing <x> with the number of cores you’d like it to run on. Setting GHCRTS=-N1 is a particularly conservative choice, since it will disable parallelism completely, but it will certainly mitigate the pathological behavior.

As a final point of note, the performance running the migrations is still not good—on the database you sent me, they take 15-20 seconds. However, they do eventually finish, and since migrating is a one-time cost, I haven’t worried about that too much yet. There are ways we can improve that number much further over time, it’s just a matter of work.

lexi-lambda on 17 Dec 2019

👍1

@lexi-lambda
I did:

cd /etc/hasura
sudo nano docker-compose.yaml

then replaced v1.0.0-beta6 with pull3394-67093178

then

sudo docker-compose up -d
sudo docker-compose restart caddy

and it works 😄

barbalex on 17 Dec 2019

❤1

Did you happen to notice how long the startup time took? There should be a log message that Hasura spits out that includes something like this:

{"time_taken":12.3456,"message":"starting API server"}

If you could tell me what value you saw for time_taken, that would be helpful!

lexi-lambda on 17 Dec 2019

@lexi-lambda
root@hasura-vermehrung:/etc/hasura# docker logs ece3d9c11b3d reported this line:

{"type":"startup","timestamp":"2019-12-17T13:53:51.337+0000","level":"info","detail":{"kind":"server","info":{"time_taken":23.044833698,"message":"starting API server"}}}

and here the entire log: https://gist.github.com/barbalex/989d0e83260737ac623208a569e18319

barbalex on 17 Dec 2019

Thanks! That number looks about right to me, so it looks like you didn’t hit the pathological behavior (which is obviously good).

lexi-lambda on 17 Dec 2019

@lexi-lambda please have yourself a beer - you made my day!

barbalex on 17 Dec 2019

❤1 😄1

So if I do this upgrade, can I then upgrade to the newest released version on master?

m0ngr31 on 17 Dec 2019

@m0ngr31 I’m afraid not directly—the experimental build is incompatible with the version on master (and in fact might be incompatible with future versions of the same branch, though at this point that seems very unlikely since it’s basically finished). If you want to try it out, I’d recommend only doing it in a development/staging environment that you can throw away, not in production (and I’d recommend the same for any development builds unless stated otherwise).

Theoretically you could upgrade to the experimental version, then downgrade the schema to be compatible with v1.0.0. However, I haven’t written the down migration for these changes just yet, so that downgrade path does not currently exist. If you’re really eager to upgrade, I could do that very shortly and let you know, but hopefully the PR should be merged to master soon, so it might be easier to just wait for that to happen.

lexi-lambda on 17 Dec 2019

Okay, I'll just wait for it to get merged and then try it. Thanks for your hard work on this!

m0ngr31 on 17 Dec 2019

Thank you for your patience!

lexi-lambda on 17 Dec 2019

I am having problems now. I actually had very similar problems before the last update. They used to occur when changing insert permissions. Now they occur when inserting a value. PostgreSQL logs this happening:

WITH "public_event__mutation_result_alias" AS 
(
  INSERT INTO "hdb_views"."52c386eb018ed74878972f72ac5c350c37140f0de414dee984ec424c" ( "changed", "geplant", "changed_by", "teilkultur_id", "person_id", "beschreibung", "id", "kultur_id", "datum", "tsv" ) VALUES (DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) RETURNING * 
)
SELECT json_build_object(
  'returning',
  (
    SELECT coalesce(json_agg("root" ), '[]' ) AS "root" FROM (
      SELECT row_to_json((SELECT "_1_e" FROM (SELECT "_0_root.base"."id" AS "id", 'event' AS "__typename" ) AS "_1_e" ) ) AS "root" FROM (SELECT * FROM "public_event__mutation_result_alias" WHERE ('true') ) AS "_0_root.base" 
    ) AS "_2_root" 
  ),
  '__typename',
  'event_mutation_response' 
)

And it results in this error:

postgres-error : relation "hdb_views.52c386eb018ed74878972f72ac5c350c37140f0de414dee984ec424c" does not exist

Would this issue be related or should I open a new issue?

barbalex on 30 Dec 2019

@barbalex That error does seem possibly related to me, but the fact that you said it was also happening to you before, just at a different time, makes me wonder if perhaps it’s a different underlying problem. Unfortunately, the hashes are a little opaque, aren’t they? I think it would probably help if you could share what query you’re running that triggers the error.

lexi-lambda on 31 Dec 2019

This is the query run:

mutation InsertDatasetForCreateNew2 {
  insert_event(objects: [{}]) {
    returning {
      id
      __typename
    }
    __typename
  }
}

_EVERY_ query inserting causes this issue now (or so it seems).
Just like before, it seemed that every change to insert permissions caused the same error. But now I can change insert permissions.

barbalex on 31 Dec 2019

As my project is in beta and not productive yet, I can probably give you any sort of access that may help you to debug the problem.

barbalex on 31 Dec 2019

@barbalex To be honest, I am not certain whether the root cause of the issue you’re seeing is this change or something else… but I realized it doesn’t actually matter, because we’re getting rid of hdb_views for insert permissions entirely! See #3598.

lexi-lambda on 2 Jan 2020

🎉1

I have now rebuilt my api from scratch with these changes:

using a PostgreSQL instance on the same droplet as the api instead of a hosted db
using a PostGIS-capable PostgreSQL instance due to the fact that the project needs PostGIS
setting max_locks_per_transaction to 2000
using the newest hasura version (1.0.0)

Here is my docker-compose.yaml in case this helps someone else:

version: '3.7'
services:
  postgres:
    image: mdillon/postgis
    # specify container name to make it easier to run commands.
    # for example, you could run docker exec -i postgres psql -U postgres postgres < schema.sql to run an SQL file against the Postgres database
    container_name: 'postgres'
    restart: always
    environment:
      POSTGRES_PASSWORD: secret
    ports:
      # make the Postgres database accessible from outside the Docker container on port 5432
      - '5432:5432'
    volumes:
      - db_data:/var/lib/postgresql/data
    # hasura needs higher max_locks_per_transaction
    command: postgres -c max_locks_per_transaction=2000
  graphql-engine:
    image: hasura/graphql-engine:v1.0.0
    container_name: 'hasura'
    ports:
      - '8080:8080'
    depends_on:
      - 'postgres'
    restart: always
    environment:
      # database url to connect
      HASURA_GRAPHQL_DATABASE_URL: postgres://user:secret@postgres:5432/postgres
      # enable the console served by server
      HASURA_GRAPHQL_ENABLE_CONSOLE: 'true' # set "false" to disable console
      HASURA_GRAPHQL_ADMIN_SECRET: secret
      HASURA_GRAPHQL_JWT_SECRET: 'secret'
    command:
      - graphql-engine
      - serve
  caddy:
    image: abiosoft/caddy:0.11.0
    container_name: 'caddy'
    depends_on:
      - 'graphql-engine'
    restart: always
    ports:
      - '80:80'
      - '443:443'
    volumes:
      - ./Caddyfile:/etc/Caddyfile
      - caddy_certs:/root/.caddy
volumes:
  db_data:
  caddy_certs:

Now data-sets can be inserted again without any errors.

In the process I have found a few similar issues, as far as needing a heightened max_locks_per_transaction goes. I will leave it to you but feel comfortable to close this one as a duplicate and thanks for great help.

barbalex on 5 Jan 2020

So this should be fixed in the new beta then?

m0ngr31 on 9 Jan 2020

We decided to postpone the beta release to fix #3655 first, but the patch for that should land in the next few days, and we’ll release after that. This issue will be fixed in that release.

lexi-lambda on 9 Jan 2020

I can confirm that my app is working fine with v1.1.0-beta.2, without changing max_locks_per_transaction.

Thanks a lot for your hard work!

barbalex on 16 Jan 2020

🎉1

Actually no: https://github.com/hasura/graphql-engine/issues/3734

barbalex on 17 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings