Airflow: Looping issue using Hashicorp Vault

Created on 7 Jul 2020 · 8Comments · Source: apache/airflow

Issue:
We encountered an issue that is mainly caused by overloading calls from Airflow to Vault while Vault was experiencing downtime.
Context:
A few days ago while Vault was down for 2 - 3 hours, all the worker pods in our Airflow cluster on K8s started retrying to pull credentials (URI) from Vault. This caused two things: the first one is that we found that airflow tried more than 500 calls per hour to call Vault and then lead to the second issue: our DevOps couldn't connect to Vault and figured out the root cause that brought down Vault. Thus Airflow became the target of blame...
Question:
Is there any built-in retry strategy to prevent this kind of issue happens again? To generalize this issue, it can occur while using any secret manager, like AWS secrets manager, Google sm, or even RDS.
If not and if you think it is worth being implemented, I would like to take this ticket to implement this kind of strategy to prevent this issue.

Thanks

feature

Source

benbenbang

All 8 comments

Currently there is no built-in retry strategy to prevent the issue you mentioned

I am happy for you to take on this ticket and implement a strategy to prevent this kind of issue.

Assigned the ticket to you, let me know how I can help.

kaxil on 9 Jul 2020

👍1

Thanks @kaxil, I am gonna deep dive to the source code first. I will bring more questions back to you.

benbenbang on 11 Jul 2020

👍1

Hello @potiuk, @kaxil

I need some suggestions on the default_var has been set case ( not sure is it fixed in 2.x or not ?)

if now we want to get variable by using Variable.get("X", default_var=123)
This is how the flow goes in 1.10.x:
↓ You want to get Var X and you give a default_var 123
------- Airflow Variable ------
↓ Go to Vault backend
------- Vault -----------------
↓ Vault Cannot find X
↓ log print out and return None
------- Airflow Variable ------
↓ find default_var 123
↓ return 123

Firstly, the log will be really confused. Since we cannot know will it return anything or not. Because in the Vault backend layer it doesn't know we did pass a default variable.

Secondly, is there a better practice to do the retry? I mean seems like keep trying to get the value it makes some sense but should we have an option that allows us to set a cooldown timer?

Thanks

benbenbang on 7 Aug 2020

My proposals:

1) I think we can implement tenacity's retrieval mechanism (https://pypi.org/project/tenacity/) with exponential back-off. We use it in a few places already and we could easily set a default sane behaviour and make sure in the interface we define a "Temporary" problem that needs retry. This needs to be done at the "backend" level (so that retrieving multiple variables from the backend are also treated this way).

2) I think also - I do not think having a "local" default" as fallback is a good idea in general. The way it is implemented now was implemented to make it easy to introduce by users of the previous variable mechanisms, but I think we should do it differently long term. I think we should have a way to define whether certain variable is "expected" to be retrieved from the secret backend ("secret=true") either at the DAG level (parameter) or possibly at the definition of the variable in the metastore (so reverse the check - first we check in metastore if a given variable has "expect_secet") flag set and then check tt in the secret.). I do not see any case where something that is expected to be secured in the Secret Backend has a reasonable local fallback value.

@kaxil WDYT?

potiuk on 18 Aug 2020

Thanks, @potiuk

To be more clear on the variable part, actually, I am only talking about Variable no connection at all. So basically these values shouldn't have any credentials involved, it will be purely some job-based parameters like some kind of dictionary that define some tasks as keys and the values are the details that need to be done, for example.

benbenbang on 24 Aug 2020

Hello @potiuk, @kaxil
We got the same issue today, airflow sent 12k per sec to Vault. And the outrage brought down Vault for more than 30 minutes. Are we the only one who has this issue?

benbenbang on 15 Oct 2020

@benbenbang This definitely needs to be addressed, one thing I am thinking of is a "ttl" for a local cache that will significantly reduce the load and the second thing is what Jarek mentioned around using retry mechanism

kaxil on 15 Oct 2020

Alright, I will start to do the dev asap.

For ref, this is insane:
grafana