Updating from v1.0.0-beta.6 to v1.0.0-beta.7 results in this error when running sudo docker-compose up -d: ERROR: manifest for hasura/graphql-engine:v1.0.0-beta.7 not found: manifest unknown: manifest unknown
Updating from v1.0.0-beta.6 to v1.0.0-beta.8 results in 502 Bad Gateway in the browser when accessing the console. The ui fails too because all graphql calls get a 502 response.
Updating from v1.0.0-beta.6 to v1.0.0-beta.9 or v1.0.0-beta.10 results in same error.
Reverting to v1.0.0-beta.6 instantly works after sudo docker-compose up -d.
What am I doing wrong?
Updating from v1.0.0-beta.6 to v1.0.0-beta.8 results in
502 Bad Gatewayin the browser when accessing the console. The ui fails too because all graphql calls get a 502 response.
Can you provide any information that appears in the server logs when trying to start v1.0.0-beta.8 through v1.0.0-beta.10?
502 Bad Gateway usually means the server is failing to start, and whatever proxy is running in front of the server is unable to serve requests because the server isn’t actually running. Therefore, my guess is the server is probably crashing on startup for some reason or another, but I can’t say why without more information.
I have dug through lots of logs in /var/log/ but have found nothing suspicious.
But I am a complete noob when it comes to working with logs. Where/What/How should I do it?
This is a digital-ocean one-click hasura droplet
So I now found https://docs.hasura.io/1.0/graphql/manual/deployment/docker/logging.html.
Was looking at the wrong place as this is the first time using docker.
The log right after updating from beta6 to beta10 contains lots of stuff but also this error: cannot continue due to new inconsistent metadata.
Here is the log: https://gist.github.com/barbalex/52a1ed5e9ea9d6bbde0631ed6c0f8283
I wonder if this could be due to problems with v1.0.0-beta.7. I had installed it before it was called back. So I then had to revert to beta.6.
So I looked up the docs and found that when downgrading I would have needed to run an sql skript that would have changed the version set in hdb_catalog.hdb_version.
I then looked up SELECT * FROM hdb_catalog.hdb_version. And found that the version set there was 17, set last June. Which I can't believe as I kept upgrading as new versions appeared.
This seems to be wrong. Could this be causing the issue?
That could be causing the issue, though it’s surprising that beta.6 would run at all if hdb_version reports 17. The catalog version used by beta.6 is 22 (you can see the full list of versions here). Something seems not right, but I’m not sure what—running beta.6 against your database should definitely change the version from 17!
If you want, you could try manually changing the version to 22 and running the latest version to see if that helps. Otherwise, I’m afraid I’m a little lost. If that doesn’t help, a possible next step would be to ask to send one of us your Postgres schema (without any of the data in it), and we can try and debug things on our end.
It’s unfortunate that the inconsistent metadata error messages you are getting don’t actually include the queries that are causing the errors—we should, at the very least, improve the logging for those messages so that they provide at least a little more information. I’ve opened #3363 about that. In the meantime, if you want, you could try enabling query logging on your Postgres instance to see exactly which queries are failing, which could be illuminative.
If you want, you could try manually changing the version to 22 and running the latest version to see if that helps
I tried that. Does not help.
If that doesn’t help, a possible next step would be to ask to send one of us your Postgres schema (without any of the data in it), and we can try and debug things on our end
This app is in development. Data in it is just for testing. I would happily give you full access.
How would I send you the schema? (assume I have no idea how to do it)
If the data is just test data, the easiest solution would be to use pg_dump to just snapshot the whole database and send it to us. Specifically, you’ll want to run this command:
pg_dump postgres://user@host:port/db_name --format=custom --file=out.pg_dump
That will dump the entire database into out.pg_dump, both schema and data, and if you send it to me, I can take a look at it. You can either put a link in this issue, or you can send it via email to [email protected] if you want to keep it private. (There’s also the Discord server if you want to talk more directly than via GH issue thread.)
O.k., will dump.
I just added the query-log flag to HASURA_GRAPHQL_ENABLED_LOG_TYPES, updated to beta.10, ran sudo docker-compose up -d then checked the logs. This is the result: https://gist.github.com/barbalex/aec9e4d14d345b3d8fc14b167976382c
Alright, after testing with the database you sent me, I think I’ve found the issue. When Hasura migrates to the latest version of the catalog, it drops and recreates certain views and triggers that it uses to implement permissions. However, it does this many more times than necessary, ending up dropping and recreating those views over a dozen times. This all happens within the span of a single transaction, which makes Postgres rather unhappy on a database with lots of permissions—it seems to be doing a lot of bookkeeping to track all those deletions and recreations, even though it really doesn’t have to (since it’s just dropping things created in the current transaction).
The fix is clearly to eliminate the duplication, but it isn’t immediately clear to me what the right approach is—it isn’t happening because of a simple mistake or anything like that. A temporary workaround is to crank up the value of Postgres’s max_locks_per_transaction configuration value. The default is 64, and increasing it to 128 was enough for the migration to complete successfully, though it took about 40 seconds to do so. If you’re interested in upgrading, that workaround should be otherwise benign, and you can even lower the value back to 64 after migrating if you’d like. Otherwise, we’ll fix try to fix this in the next release or two.
This is my first project to use a managed database (by digital ocean). So cranking up Postgres’s max_locks_per_transaction configuration value is most probably not an option, unless I move to a self-managed db on a droplet.
I will wait for a fix in an update. Thanks for caring!
~@barbalex In the meanwhile, I'd suggest resetting/squashing the migrations for this project if you cannot update max_locks setting. You can apply migrations on a local instance and then follow this guide to do so: https://blog.hasura.io/resetting-hasura-migrations/~
~You'll be creating a single migration from your current postgres state.~
@shahidhk
Sorry, I'm not entirely sure I'm understanding this correctly.
Does that mean to run TRUNCATE hdb_catalog.schema_migrations; on the server? (I am not using a local instance yet)
@barbalex Sorry, I mistook Hasura's internal catalog migrations to the schema/metadata migrations. My suggestion is not valid anymore.
We need to wait for @lexi-lambda to put in a fix. In the meanwhile, increasing max_locks_per_transaction is the only way.
Thanks for this very good and transparent information.
So I know that if I want to update or if I run into any problems before the issue is solved, I will have to migrate the db to a virtual droplet so I can increase max_locks_per_transaction.
@lexi-lambda Do you think starting with a clean slate on the latest version and then applying the migrations would be another workaround?
I’m not completely certain. It might help, but it might not, since even squashed migrations could trigger the issue (since they still have individual calls to things like track_table, IIUC?). If absolutely necessary, it would probably work to drop the catalog information and reapply the metadata in batches so that there aren’t too many individual query operations in each bulk batch (and therefore not too many query operations in a single transaction).
Hopefully I’ll have a less awkward solution available soon. I’ll update this issue once I have a development build available for testing.
I think #3381 is related.
I am a bit surprised that v1.0.0 was just published. While this issue is not solved. My app does not seem to be such an unusual case: 14 Tables, permissions for 3 roles. Not exactly big, is it?
As far as I understand rebuilding my api with v1.0.0 will not help without increasing max_locks_per_transaction.
Have I missed something?
Just not to sound too critical: The reason I am uneasy is not just that my project should go live in two months. It is the fact that hasura is absolutely great and I can not wait to be able to use the features added since v1.0.0-beta.6, especially concerning permissions spanning tables.
I also have two other, larger projects that I will probably sooner or later migrate to hasura. Once this issue is solved.
Yes, my apologies—I was hoping to leave a comment on this thread yesterday, but one more unexpected issue came up that led me to hold off.
The good news: I have been working on a fix for this in #3394, and I think it basically works. It would be great if either of you could try the experimental build in https://github.com/hasura/graphql-engine/pull/3394#issuecomment-566198192 and let me know if it resolves your problem. I would have liked for this change to go into v1.0.0, but it’s a large change, and there are some outstanding subtleties, so I’ve been hoping to have some people try it out before merging it.
The bad news: the change should work fine, but there are some lingering performance issues that seem to stem primarily from a poor interaction with the parallel GC running on machines where the number of cores the OS reports are available is larger than the number of cores graphql-engine should probably reasonably be using. For example, on a Heroku free dyno, nproc reports 8, so graphql-engine currently defaults to running on 8 cores. That choice is not a good one, however, as Heroku free-tier dynos are shared, and this seems to create a significant performance hit.
I am still looking into the appropriate solution for that, but in the meantime, if you want to try the build, consider restricting the number of cores graphql-engine uses manually. The easiest way to do that is to set the GHCRTS=-N<x> environment variable, replacing <x> with the number of cores you’d like it to run on. Setting GHCRTS=-N1 is a particularly conservative choice, since it will disable parallelism completely, but it will certainly mitigate the pathological behavior.
As a final point of note, the performance running the migrations is still not good—on the database you sent me, they take 15-20 seconds. However, they do eventually finish, and since migrating is a one-time cost, I haven’t worried about that too much yet. There are ways we can improve that number much further over time, it’s just a matter of work.
@lexi-lambda
I did:
cd /etc/hasura
sudo nano docker-compose.yaml
then replaced v1.0.0-beta6 with pull3394-67093178
then
sudo docker-compose up -d
sudo docker-compose restart caddy
and it works 😄
Did you happen to notice how long the startup time took? There should be a log message that Hasura spits out that includes something like this:
{"time_taken":12.3456,"message":"starting API server"}
If you could tell me what value you saw for time_taken, that would be helpful!
@lexi-lambda
root@hasura-vermehrung:/etc/hasura# docker logs ece3d9c11b3d reported this line:
{"type":"startup","timestamp":"2019-12-17T13:53:51.337+0000","level":"info","detail":{"kind":"server","info":{"time_taken":23.044833698,"message":"starting API server"}}}
and here the entire log: https://gist.github.com/barbalex/989d0e83260737ac623208a569e18319
Thanks! That number looks about right to me, so it looks like you didn’t hit the pathological behavior (which is obviously good).
@lexi-lambda please have yourself a beer - you made my day!
So if I do this upgrade, can I then upgrade to the newest released version on master?
@m0ngr31 I’m afraid not directly—the experimental build is incompatible with the version on master (and in fact might be incompatible with future versions of the same branch, though at this point that seems very unlikely since it’s basically finished). If you want to try it out, I’d recommend only doing it in a development/staging environment that you can throw away, not in production (and I’d recommend the same for any development builds unless stated otherwise).
Theoretically you could upgrade to the experimental version, then downgrade the schema to be compatible with v1.0.0. However, I haven’t written the down migration for these changes just yet, so that downgrade path does not currently exist. If you’re really eager to upgrade, I could do that very shortly and let you know, but hopefully the PR should be merged to master soon, so it might be easier to just wait for that to happen.
Okay, I'll just wait for it to get merged and then try it. Thanks for your hard work on this!
Thank you for your patience!
I am having problems now. I actually had very similar problems before the last update. They used to occur when changing insert permissions. Now they occur when inserting a value. PostgreSQL logs this happening:
WITH "public_event__mutation_result_alias" AS
(
INSERT INTO "hdb_views"."52c386eb018ed74878972f72ac5c350c37140f0de414dee984ec424c" ( "changed", "geplant", "changed_by", "teilkultur_id", "person_id", "beschreibung", "id", "kultur_id", "datum", "tsv" ) VALUES (DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) RETURNING *
)
SELECT json_build_object(
'returning',
(
SELECT coalesce(json_agg("root" ), '[]' ) AS "root" FROM (
SELECT row_to_json((SELECT "_1_e" FROM (SELECT "_0_root.base"."id" AS "id", 'event' AS "__typename" ) AS "_1_e" ) ) AS "root" FROM (SELECT * FROM "public_event__mutation_result_alias" WHERE ('true') ) AS "_0_root.base"
) AS "_2_root"
),
'__typename',
'event_mutation_response'
)
And it results in this error:
postgres-error : relation "hdb_views.52c386eb018ed74878972f72ac5c350c37140f0de414dee984ec424c" does not exist
Would this issue be related or should I open a new issue?
@barbalex That error does seem possibly related to me, but the fact that you said it was also happening to you before, just at a different time, makes me wonder if perhaps it’s a different underlying problem. Unfortunately, the hashes are a little opaque, aren’t they? I think it would probably help if you could share what query you’re running that triggers the error.
This is the query run:
mutation InsertDatasetForCreateNew2 {
insert_event(objects: [{}]) {
returning {
id
__typename
}
__typename
}
}
_EVERY_ query inserting causes this issue now (or so it seems).
Just like before, it seemed that every change to insert permissions caused the same error. But now I can change insert permissions.
As my project is in beta and not productive yet, I can probably give you any sort of access that may help you to debug the problem.
@barbalex To be honest, I am not certain whether the root cause of the issue you’re seeing is this change or something else… but I realized it doesn’t actually matter, because we’re getting rid of hdb_views for insert permissions entirely! See #3598.
I have now rebuilt my api from scratch with these changes:
max_locks_per_transaction to 2000Here is my docker-compose.yaml in case this helps someone else:
version: '3.7'
services:
postgres:
image: mdillon/postgis
# specify container name to make it easier to run commands.
# for example, you could run docker exec -i postgres psql -U postgres postgres < schema.sql to run an SQL file against the Postgres database
container_name: 'postgres'
restart: always
environment:
POSTGRES_PASSWORD: secret
ports:
# make the Postgres database accessible from outside the Docker container on port 5432
- '5432:5432'
volumes:
- db_data:/var/lib/postgresql/data
# hasura needs higher max_locks_per_transaction
command: postgres -c max_locks_per_transaction=2000
graphql-engine:
image: hasura/graphql-engine:v1.0.0
container_name: 'hasura'
ports:
- '8080:8080'
depends_on:
- 'postgres'
restart: always
environment:
# database url to connect
HASURA_GRAPHQL_DATABASE_URL: postgres://user:secret@postgres:5432/postgres
# enable the console served by server
HASURA_GRAPHQL_ENABLE_CONSOLE: 'true' # set "false" to disable console
HASURA_GRAPHQL_ADMIN_SECRET: secret
HASURA_GRAPHQL_JWT_SECRET: 'secret'
command:
- graphql-engine
- serve
caddy:
image: abiosoft/caddy:0.11.0
container_name: 'caddy'
depends_on:
- 'graphql-engine'
restart: always
ports:
- '80:80'
- '443:443'
volumes:
- ./Caddyfile:/etc/Caddyfile
- caddy_certs:/root/.caddy
volumes:
db_data:
caddy_certs:
Now data-sets can be inserted again without any errors.
In the process I have found a few similar issues, as far as needing a heightened max_locks_per_transaction goes. I will leave it to you but feel comfortable to close this one as a duplicate and thanks for great help.
So this should be fixed in the new beta then?
We decided to postpone the beta release to fix #3655 first, but the patch for that should land in the next few days, and we’ll release after that. This issue will be fixed in that release.
I can confirm that my app is working fine with v1.1.0-beta.2, without changing max_locks_per_transaction.
Thanks a lot for your hard work!
Actually no: https://github.com/hasura/graphql-engine/issues/3734
Most helpful comment
@lexi-lambda please have yourself a beer - you made my day!