Graphql-engine: "table or event-trigger not found in schema cache" error in logs

Created on 25 Jul 2020  路  17Comments  路  Source: hasura/graphql-engine

Just upgraded from 1.2.2 to 1.3.0, noticed this following error message popping up more than once:

graphql-engine {"type":"event-trigger","timestamp":"2020-07-24T23:37:43.617+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}

How do we go about debugging this error / what would be the cause? Hasura still functions normally, but the log message spams almost every second or minute.

event-triggers high

Most helpful comment

This has happened to us on multiple occasions specifically when we delete event triggers. Hasura seems to go crazy with it until we clear the event logs.

To diagnose I recommend the following:

select  trigger_name, archived, count(*) from hdb_catalog.event_log
group by trigger_name, archived;

There should be no event triggers listed that have been removed or look out of place.

If there are then clear them individually to avoid clearing out too much data that you might need from other event triggers. Replace :

DELETE FROM hdb_catalog.event_invocation_logs
  WHERE event_id IN (
    SELECT id FROM hdb_catalog.event_log
    WHERE trigger_name = '<event_trigger_name>' );

DELETE FROM hdb_catalog.event_log
  WHERE trigger_name = '<event_trigger_name>';

These have fixed the issue for our case.

It'd be great for the Hasura team to understand what's really causing this but as you mentioned it seems related to infinite loops of retries or something.

All 17 comments

Update:

DELETE FROM hdb_catalog.event_invocation_logs;

DELETE FROM hdb_catalog.event_log;

Query seems to help, I haven't seen the error messages anymore.

  • It seems a failed event-trigger (it has 3 retries) may cause such an issue to happen

This has happened to us on multiple occasions specifically when we delete event triggers. Hasura seems to go crazy with it until we clear the event logs.

To diagnose I recommend the following:

select  trigger_name, archived, count(*) from hdb_catalog.event_log
group by trigger_name, archived;

There should be no event triggers listed that have been removed or look out of place.

If there are then clear them individually to avoid clearing out too much data that you might need from other event triggers. Replace :

DELETE FROM hdb_catalog.event_invocation_logs
  WHERE event_id IN (
    SELECT id FROM hdb_catalog.event_log
    WHERE trigger_name = '<event_trigger_name>' );

DELETE FROM hdb_catalog.event_log
  WHERE trigger_name = '<event_trigger_name>';

These have fixed the issue for our case.

It'd be great for the Hasura team to understand what's really causing this but as you mentioned it seems related to infinite loops of retries or something.

The error message definitely needs improvement.

This may happen when you have events in queue for an event trigger that you just deleted. But, it could also be some other race condition.

@tirumaraiselvan Thank you clarifying!

Would you be looking at fixing the issue, improving the error message, or both?

We would have to prevent this from happening as Hasura creates a very high number of DB connections until it maxes out. We use Cloud SQL which limits and cuts off other valid connections. Ultimately this results in intermittent functionality failures which is affects app functionality.

We were looking at implementing a mitigation that detects the issue and then automatically clears out the affected event logs but we'll skip it if you're looking at a fix.

We would have to prevent this from happening as Hasura creates a very high number of DB connections until it maxes out. We use Cloud SQL which limits and cuts off other valid connections. Ultimately this results in intermittent functionality failures which is affects app functionality.

You can always limit the number of connections Hasura creates via HASURA_GRAPHQL_PG_CONNECTIONS (default: 50)

We will definitely investigate the cause of this error as part of improving the error message. Currently, it is not clear why it is happening. The only likely situation that I can think of is that the events were getting processed while the event trigger was dropped in which case it is rather benign.

This may happen when you have events in queue for an event trigger that you just deleted. But, it could also be some other race condition.

For me it was an event trigger that had retried 4 times and had failed all times.

@tirumaraiselvan this error still appears after updating to 1.3.2

2020-09-07T23:39:06.453708315Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:39:07.454725277Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:39:07.454784718Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:06.592350211Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:06.592429803Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:07.593581597Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:06.748399829Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:06.748460254Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:07.749948355Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:06.893702941Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:06.893762517Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:07.895079037Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:07.038807559Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:07.038866520Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:08.040231802Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:07.180090676Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:07.180130289Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:08.181308236Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:07.319800084Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:07.319856215Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:08.320483837Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:07.526498516Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:07.526577878Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:08.528339850Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:07.691373407Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:07.691451154Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:08.692429740Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:07.846698088Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:08.848335482Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:08.848402382Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:07.999542997Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:09.000834633Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:09.000902762Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:08.172744735Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:09.174826859Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:09.174894688Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}

Unfortunately, the improved error msg in https://github.com/hasura/graphql-engine/pull/5718 didn't get included in v1.3.2. We will definitely try to incorporate it in the next one.

Meanwhile, pls check the comment https://github.com/hasura/graphql-engine/issues/5461#issuecomment-664463460 for getting to know the offending event trigger.

@tirumaraiselvan We've noticed a more serious issue related to these errors. It's now clear that these errors are leading to an overload of DB connections from Hasura.

It can lead to failures in deployments as a new instance can struggle to connect to the DB, getting rejected due to max connections. We use kubernetes with replicated instances and rolling updates so we run into this often unfortunately.

You can see evidence of this in these two charts:

Screenshot 2020-09-30 at 14 38 44
Screenshot 2020-09-30 at 14 38 58

DB Connections match up perfectly to the amount of these errors per second.

The gap in the middle is when we cleared up the errors as mentioned in previous comment (#5461)

Improving the logs is definitely a step in the right direction but ideally this would be fixed?

@petecorreia Thanks for adding more details. We are investigating this.

Just flagging that this is happened to us to after an upgrade. The log file flooded with gigabytes of the same event-trigger error and crashed the server.

After clearing logs as per https://github.com/hasura/graphql-engine/issues/5461#issuecomment-664463460 and manually deleting the log file things seem ok for now.

Hi @tirumaraiselvan - I work with @petecorreia and I have a little more detail to this issue which might help you track things down; specifically I think there is perhaps two separate but related bugs here that combine to cause this issue.

Before detailing the two issues we found, a little context on our setup:

  • we have two Hasura replicas connected to the same Cloud SQL DB
  • we use the cli-migrations/v2 image when deploying updates to run migrations and to update metadata because we only enable the graphql API for security purposes.
  • our Hasura instances are currently logging ~30 error messages per second caused by invalid event triggers and maxing out available connections to the DB which causes other processes to fail

Issue 1 - Metadata not applying properly

On investigating our DB which is currently suffering from this issue via psql we noticed that a set of triggers that should no longer exist were still present on some tables despite no longer being in the metadata files. It's not clear to us how this has happened. The process we followed to remove the triggers was to remove them via the Hasura console, commit the changes to tables.yml and then deploy an updated docker image containing the updated metadata using the cli-migrations/v2 which theoretically should apply that metadata before starting Hasura.

It seems at some point this metadata application on deploy failed, and then once this had happened Hasura wasn't able to detect that the triggers had not been removed (I understand this may be difficult because there may also be user created triggers within the DB).

So at this point we have the unfortunate situation that Postgres still has triggers attached to a table, but from Hasura's perspective these triggers no longer exist.

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

When inspecting the event_log table, looking at some of the invalid calls I noticed the following issue: when Hasura sees an event for an invalid trigger it does not seem to update the tries field on the event_log table. This means that each event never hit the max_retries value so Hasura keeps trying these bogus events forever.

id            | 6fe105d7-085e-447c-ab15-5a7c7eee9125
schema_name   | public
table_name    | <redacted_table_name>
trigger_name  | <redacted-event-trigger-name>
payload       | {"op": "DELETE", "data": {...}": {"x-hasura-role": "admin"}}
delivered     | f
error         | f
tries         | 0
created_at    | 2020-10-05 13:00:22.745178
locked        | f
next_retry_at | 2020-10-05 15:50:23.507937
archived      | f

In this example entry from the hdb_catalog.event_table you can see that this event was created at 13:00 and Hasura has been retrying this event every minute since then, but tries remains at 0.

I also note that the error field is false when perhaps that should also be true.

Thanks in advance for your help

@smulube

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

Thanks for the insight, I also checked event_log table on my end and I have the same problem. The number of logs show up drastically with the number of events that failed to deliver.

@hrgui - we have attempted to mitigate this by adding a manual migration to remove any invalid triggers along the lines of:

DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_UPDATE" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_INSERT" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_DELETE" ON table_name CASCADE;

where those trigger names come from using psql to inspect each table and identify any triggers that should no longer exist.

We also added this function to delete invalid triggers:

CREATE OR REPLACE FUNCTION purge_invalid_events() RETURNS void AS $$
BEGIN
    WITH valid_trigger_names AS (
        SELECT DISTINCT name FROM hdb_catalog.event_triggers
    ), invalid_event_ids AS (
        SELECT id FROM hdb_catalog.event_log
        WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names)
    ), delete_event_invocations AS (
        DELETE FROM hdb_catalog.event_invocation_logs
        WHERE event_id IN (SELECT * FROM invalid_event_ids)
    )
    DELETE FROM hdb_catalog.event_log
    WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names);
END;
$$ LANGUAGE plpgsql;

which we call daily via a cron job. We're hoping this will mitigate the problem for the time being but hopefully there will be some cleaner way of fixing this than this workaround as we'll continue to need to check the database for any dangling invalid triggers.

I have been able to reproduce this, finally. Thanks for your details and patience.

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

This was done by design so as to not fail silently for an event which hitherto was known to be caused only for transient issues. We will see how we can improve this as well.

Basically, replace_metadata is broken when dropping event triggers. Repro:

  1. create table test(id serial primary key, name text);
  2. Initial metadata:
{ "type": "replace_metadata",
  "args":
{
  "version": 2,
  "tables": [
    {
      "table": {
        "schema": "public",
        "name": "test"
      },
      "event_triggers": [
        {
          "name": "test_trigger",
          "definition": {
            "enable_manual": false,
            "insert": {
              "columns": "*"
            }
          },
          "retry_conf": {
            "num_retries": 0,
            "interval_sec": 10,
            "timeout_sec": 60
          },
          "webhook": "http://httpbin.org/post"
        }
      ]
    }
  ]
}
}
  1. Final metadata:
{ "type": "replace_metadata",
  "args":
{
  "version": 2,
  "tables": [
    {
      "table": {
        "schema": "public",
        "name": "test"
      }
  }
  ]
}
}
  1. If you insert into test, then you can observe the errors. You can also see the trigger functions still exist in hdb_views schema.

Hey folks, while we roll out a fix for this in the next few days (v1.3.3). You can run this query to clear any invalid postgres triggers:

curl -d'{"type": "run_sql", "args": { "sql": "select 1", "check_metadata_consistency": true}}' localhost:8080/v1/query

Note that you may need to clear the events which might have been created due to these invalid triggers:

    WITH valid_trigger_names AS (
        SELECT DISTINCT name FROM hdb_catalog.event_triggers
    ), invalid_event_ids AS (
        SELECT id FROM hdb_catalog.event_log
        WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names)
    ), delete_event_invocations AS (
        DELETE FROM hdb_catalog.event_invocation_logs
        WHERE event_id IN (SELECT * FROM invalid_event_ids)
    )
    DELETE FROM hdb_catalog.event_log
    WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names);

Courtesy: https://github.com/hasura/graphql-engine/issues/5461#issuecomment-706095505

Another way is to run this SQL, before any metadata apply:

CREATE TEMP TABLE invalid_triggers AS
WITH valid_event_triggers AS (
  SELECT name FROM hdb_catalog.event_triggers
),
  archive_invalid_events AS (
  UPDATE hdb_catalog.event_log set archived = 't'
  WHERE trigger_name NOT IN (select name from valid_event_triggers)
)
SELECT routine_name FROM information_schema.routines
WHERE routine_type='FUNCTION' AND specific_schema='hdb_views' AND routine_name NOT IN (
  SELECT 'notify_hasura_' || name || '_INSERT' FROM valid_event_triggers
  UNION
  SELECT 'notify_hasura_' || name || '_UPDATE' FROM valid_event_triggers
  UNION
  select 'notify_hasura_' || name || '_DELETE' FROM valid_event_triggers
);

DO $$ DECLARE
r RECORD;
BEGIN
  FOR r IN (SELECT routine_name from invalid_triggers) LOOP
    EXECUTE 'DROP FUNCTION IF EXISTS hdb_views.' || quote_ident(r.routine_name) || ' CASCADE';
  END LOOP;
END $$;
Was this page helpful?
0 / 5 - 0 ratings

Related issues

egislook picture egislook  路  3Comments

rikinsk-zz picture rikinsk-zz  路  3Comments

codepunkt picture codepunkt  路  3Comments

sachaarbonel picture sachaarbonel  路  3Comments

tirumaraiselvan picture tirumaraiselvan  路  3Comments