+1, I use this a lot via https://github.com/litl/backoff.
I saw this has a needs description. I just came here because my first thought reading https://docs.prefect.io/core/tutorial/04-handling-failure.html#if-at-first-you-don-t-succeed was "I wonder if prefect also supports exponential backoff". I'd like propose a description, hope it helps:
It's common for tasks that pull data from external systems to fail because of temporary unavailability of those systems. Those systems may, for example, include REST APIs, databases, file systems, or message brokers.
Some of the root causes for that unavailability may be unaffected by consumers putting load on the system, including:
In those cases, it makes sense to wait a bit and retry the tasks. prefect offers an easy way to do this with arguments max_retries and retry_delay to @task(). These arguments allow you to say "if this task fails, try it n more times and wait s seconds between each retry".
However, this type of logic could defeat its own purpose if the unavailability is because of a problem which is made worse by new load from consumers, including:
In these situations, it's advisable to use a retry strategy which waits longer and longer after each failure. If all consumers do this, it will give the external system a better chance to recover. For one example reference (there are many), see this AWS blog post.
prefectImplementing exponential backoff means that the amount of time to wait between retries is a function of the number of retries so far.
It might look like this pseudocode:
# what is the shortest time you're willing to wait after a failed attempt?
wait_min = 1
# what is the longest you're willing to wait between retries?
wait_max = 10
# how fast do you want waiting time to scale relative to number of attempts?
wait_base = 2
# what is the most times you're willing to retry before saying a task failed?
max_attempts = 5
keep_trying = True
num_attempts_so_far = 0
while keep_retrying:
result = task.run()
num_attempts_so_far += 1
if result == SUCCESS:
keep_retrying = False
elif num_attempts_so_far == max_attempts:
keep_trying = False
else:
time_to_wait = max(
wait_min,
min(
wait_max,
random.uniform(0, wait_base * 2 ** num_attempts_so_far)
)
)
time.sleep(time_to_wait)
To support this for prefect tasks, max_retries and retry_delay from the existing task() API could probably be reused. It might make sense to map retry_delay to wait_min. Then you'd have to give people the ability to add wait_base and wait_max, and probably one more keyword argument like retry_strategy="exponential_backoff".
I don't know enough about the prefect API to suggest other implementations but hopefully this description at least makes the problem and the value of solving it concrete.
Thanks for your time and consideration!
An alternative implementation suggestion.
Accept an Iterable[TimeDelta] for retry_delay that will yield the next retry delay every time it is invoked. If this is not possible because the context doesn't allow it, it might be a Callable that receives the retry count and returns a retry delay.
Then, a set of retry strategies can be built on top of it.
Most helpful comment
An alternative implementation suggestion.
Accept an
Iterable[TimeDelta]forretry_delaythat will yield the next retry delay every time it is invoked. If this is not possible because the context doesn't allow it, it might be aCallablethat receives the retry count and returns a retry delay.Then, a set of retry strategies can be built on top of it.