Graphql-engine: Dead letter facility for event-trigger webhooks

Created on 30 Apr 2019 · 11Comments · Source: hasura/graphql-engine

We'd like a mechanism by which we can find out that webhooks (for event-triggers) are failing to run (after the max_retries and retry_interval based process is exhausted).

This "dead-letter queue" mechanism could perhaps be nominated on a global/installation basis.

Implemented how?

Brutalism: perhaps it is just a "well documented, semi offical SQL query" we can schedule to run regularly, which "selects" all failed event-triggers? This query would be against the extended Hasura metadata part of the database, I assume. Easy, but exposing implementation details is never a good idea.
Better: perhaps this dead-letter-queue should itself be a webhook to which failed events can be POSTed. Then it would be up to me, a developer, to ensure that this dead-letter webhook actually works via an alternative infrastructure to that which is failing elsewhere.

event-triggers server enhancement high triag2-needs-rfc

Source

mike-thompson-day8

👍15

Most helpful comment

So, just to be clear: for webhooks to be robust, this feature is essential. event-triggers are an important part of our architecture, and it is "a bad thing" that they can potentially be failing silently (perhaps because we simply misconfigured the URL).

So, I wouldn't be labelling this issue as "an idea". I'd say it more strongly and claim that it is a bug that this mechanism isn't there already. Please forgive my pushy-ness.

It is a well know pattern/need. For example, AWS's SQS supports the idea of a dead-letter queue.

mike-thompson-day8 on 1 May 2019

👍9 ❤5

All 11 comments

So, I wouldn't be labelling this issue as "an idea". I'd say it more strongly and claim that it is a bug that this mechanism isn't there already. Please forgive my pushy-ness.

It is a well know pattern/need. For example, AWS's SQS supports the idea of a dead-letter queue.

mike-thompson-day8 on 1 May 2019

👍9 ❤5

@tirumaraiselvan and I'd been discussing error handling for dead events too so this is definitely on point.

@tirumaraiselvan @dsandip Let's take this up quickly!

coco98 on 1 May 2019

@mike-thompson-day8 Yes, something like this has been in the plan.

For the time being, you may use the delivery_info in the payload with which you can determine if this is the last retry or not and maybe do some custom error handling in the webhook itself. This obv will be doable only in limited cases.

tirumaraiselvan on 1 May 2019

@mike-thompson-day8 Some more thoughts on this:

How do you intend to use the dead letter queue? As a notification system to check for failures or as another retry mechanism for guaranteed ack? We must keep in mind that we cannot expect 100% guaranteed ack from any dead letter endpoint as well. Using a dead letter queue as a tool for guaranteed processing might only cause more obfuscation.

So, a dead letter endpoint might be useful for notification use-cases which tells you to run diagnostics using your custom playbook. We can give all APIs to fetch dead events, remove events and so on which you use in your playbook. Again, the notification is not 100% guaranteed so it should be part of one of many alarms in your system.

tirumaraiselvan on 15 May 2019

If I were to misconfigure an event trigger by mistyping the webhook URL. Instead of http://... I put htt://... or some other equally bad typo.

At the moment, the max_retries and retry_interval won't help to deliver an event. It will ultimately completely fail. But it will currently fail silently. We'll never know there's a problem.

I'm looking for a way to be told there's a problem via a channel like slack or email.

BTW, after I fix the problem, I'd probably be wishing that there was a way to reprocess all failed triggers. But that's another matter.

mike-thompson-day8 on 15 May 2019

Cool. So its more of a notification system. In that case, are you okay with the "fire and forget" design of the dead letter endpoint? Whenever we encounter a failed event, we hit the notification endpoint once.

tirumaraiselvan on 15 May 2019

@mike-thompson-day8 We are beginning to work on this issue. Since a "fire and forget" notification would not work for reliability, we are thinking of providing a "monitoring" webhook instead. The monitoring webhook will periodically receive stats about the event trigger including number of pending events/failed events, breakdown by the hour etc?

Since the monitoring webhook is running forever, you are effectively guaranteed to recv the notification sometime.

tirumaraiselvan on 18 Jun 2019

👍1

Yeah, I think that would work.

This monitoring endpoint could be a serverless endpoint somewhere that sends an email in the case it sees many failed webhooks or many being retried. Or something. This satisfied my goal of "not failing silently".

So, questions:

would it be one monitoring webhook per regular webhook?
a configurable cadence? Once every hour? Once every 5 mins?

Aside: I'll mention again the related issue of "being able to request that all failures be retried again" (usecase: after some configuration problem is fixed)

mike-thompson-day8 on 18 Jun 2019

I would like to add another issue. What if it’s a network issue on the sender side? I think these events need to be written to a table primarily. We can have the notify webhook/endpoint, but dead letter events should be written to a table 1st, so they aren’t dropped in the case of a network issue.

I haven’t looked too much how events are implemented, but do the events get lost if the server restarts while processing an event? If so, these 2 problems are closely related and should be able to be fixed with the same mechanism.

taylorhakes on 3 Sep 2019

@taylorhakes No, none of the events are lost as they are persisted for reliability.

tirumaraiselvan on 4 Sep 2019

@tirumaraiselvan what is the current state of this issue? Because currently the possibility of silent failure keeps me off of using event triggers. Thanks a lot in advance :)