Prefect: Add exponential backoff to task retries

Created on 30 Jun 2018 · 3Comments · Source: PrefectHQ/prefect

enhancement needs description

Source

jlowin

👍1

Most helpful comment

An alternative implementation suggestion.

Accept an Iterable[TimeDelta] for retry_delay that will yield the next retry delay every time it is invoked. If this is not possible because the context doesn't allow it, it might be a Callable that receives the retry count and returns a retry delay.
Then, a set of retry strategies can be built on top of it.

bachsh on 21 May 2020

👍4

All 3 comments

+1, I use this a lot via https://github.com/litl/backoff.

evan-burke on 12 Feb 2020

I saw this has a needs description. I just came here because my first thought reading https://docs.prefect.io/core/tutorial/04-handling-failure.html#if-at-first-you-don-t-succeed was "I wonder if prefect also supports exponential backoff". I'd like propose a description, hope it helps:

Description of the problem

It's common for tasks that pull data from external systems to fail because of temporary unavailability of those systems. Those systems may, for example, include REST APIs, databases, file systems, or message brokers.

Some of the root causes for that unavailability may be unaffected by consumers putting load on the system, including:

temporary downtime due to maintenance like OS upgrades
temporary downtime due to data migrations
unreliable connection due to poor internet connectivity

In those cases, it makes sense to wait a bit and retry the tasks. prefect offers an easy way to do this with arguments max_retries and retry_delay to @task(). These arguments allow you to say "if this task fails, try it n more times and wait s seconds between each retry".

However, this type of logic could defeat its own purpose if the unavailability is because of a problem which is made worse by new load from consumers, including:

resource exhaustion
- example: 100% CPU utilization caused by a large number of requests or a few very-expensive requests (e.g. database queries that result in a full table scan)
quota limits reached
- example: many relational databases cap the number of connections which can be open simultaneously
routing bottleneck
- example: service sits behind a load balancer and the load balancer is receiving new requests faster than it can route them

In these situations, it's advisable to use a retry strategy which waits longer and longer after each failure. If all consumers do this, it will give the external system a better chance to recover. For one example reference (there are many), see this AWS blog post.

What it might look like to add this to `prefect`

Implementing exponential backoff means that the amount of time to wait between retries is a function of the number of retries so far.

It might look like this pseudocode:

# what is the shortest time you're willing to wait after a failed attempt?
wait_min = 1

# what is the longest you're willing to wait between retries?
wait_max = 10

# how fast do you want waiting time to scale relative to number of attempts?
wait_base = 2

# what is the most times you're willing to retry before saying a task failed?
max_attempts = 5

keep_trying = True
num_attempts_so_far = 0
while keep_retrying:
    result = task.run()
    num_attempts_so_far += 1
    if result == SUCCESS:
        keep_retrying = False
    elif num_attempts_so_far == max_attempts:
        keep_trying = False
    else:
        time_to_wait = max(
            wait_min,
            min(
                wait_max,
                random.uniform(0, wait_base * 2 ** num_attempts_so_far)
            )
        )
        time.sleep(time_to_wait)

To support this for prefect tasks, max_retries and retry_delay from the existing task() API could probably be reused. It might make sense to map retry_delay to wait_min. Then you'd have to give people the ability to add wait_base and wait_max, and probably one more keyword argument like retry_strategy="exponential_backoff".

I don't know enough about the prefect API to suggest other implementations but hopefully this description at least makes the problem and the value of solving it concrete.

Thanks for your time and consideration!