Description
Allow running Google Cloud operators using Service Accounts, without having to provide key material while running on GCP. If the Compute instance Service Accounts on which Airflow is running have been granted "Service Account Token Creator" role on the target Service Account with which I want to run my operator, I do not need to download, or provide any key material for the impersonation to happen. This is a much more secure way to impersonate service accounts.
Use case / motivation
Allow running Google Cloud operators using Service Accounts, without having to provide key material while running on GCP. If the Compute instance Service Accounts on which Airflow is running have been granted "Service Account Token Creator" role on the target Service Account with which I want to run my operator, I do not need to download, or provide any key material for the impersonation to happen. This is a much more secure way to impersonate service accounts.
https://github.com/googleapis/google-auth-library-python/blob/master/docs/user-guide.rst#impersonated-credentials
Related Issues
None
Thanks for opening your first issue here! Be sure to follow the issue template!
Hi. It looks interesting. Would you like to start working on this change?
I talked to the Google team about this feature. We had serious security concerns. The implementation of this feature in the current Airflow architecture meant that DAG or the operator could access the access key or service account file that allows you to log in to any other account. This is unacceptable. We must think about how to provide these feature without introducing such a security risk. Ideally, the scheduler would not have access to any object from Connection and would only communicate using the API. However, it is unlikely to happen in the near future.
Another solution is to create a separate component that will generate the access code based on the allowed list. Such a thing can be based on Hashicorp Vault (https://www.vaultproject.io/docs/secrets/gcp#access-tokens) or other.
Another solution is to create worker for each service account and use workflow identity to provide access to access token.
Probably, each of them would require a lot of work.
The easiest will be to generate keys and add them to secret backends.
Thanks Kamil. I don't think this statement is accurate in the context I was proposing - "The implementation of this feature in the current Airflow architecture meant that DAG or the operator could access the access key or service account file that allows you to log in to any other account. This is unacceptable."
The account used by the airflow worker can impersonate another service account only if granted the appropriate permissions through IAM, so I don't think you can log into any account. Secondly, even when using a secrets backend, your DAGs still have permissions to access any secret in there (unless I'm missing something).
This is the relevant documentation.
Not having to deal with key material allows for -
1: Do not have to deal with key rotations.
2: When airflow operations is centralized in an organization, eliminate any coordination required for key management and transfer for setup - everything is controlled through IAM.
3: Controlling IAM access through terraform becomes easier, no key generation, transfer or load required.
I may be missing something, of course!
Unless I am mistaken account impersonation does not essentially solve any of those problems you mentioned (like key rotation or key management) because the main service account that you have can do anything via impersonation - and you continue having access to this account. So my first thought is that there is no added value in using impersonation for the purpose you described.
For example if someone steals the main "service account" credentials, that someone can still impersonate any of the other service accounts and do whatever those service accounts can do. You still have to manage the main service account key I believe and rotate it, and additionally you do not have separate access for each key, instead you have one "uber" service account that can impersonate any other service account and do everything. Which is not a good idea I think.
But maybe I do not fully understand what exactly you want to achieve and how this all plays with different roles you have in mind (like admin/dag user etc.) - I'd love to understand more from you and maybe see some diagram (? not sure if I can ask for it) where you would show how the key management and service account structure would look like?
Thanks for the response Jarek. I should state my assumptions :)
Long lived keys were generated for the main service account used by airflow, and these were compromised.
The risk of access to all accounts remains the same in both models. They can access all connections (in the first model) and retrieve key information, or impersonate all other service accounts that have granted permissions to the main account.
The resolution is harder in the current model - the compromised main account key will need to be disabled, and there's a risk that the key material for all other accounts were extracted and thus compromised as well. So keys used in all connections will need to be disabled as well.
In the proposed model, only the main account key will need to be disabled.
Hope this makes my request clearer.
Crystal-clear! Thank you. It does look reasonable. I will circle it back with a few people to see what they think and come back to you !
I like this idea, which I already discuss regarding API flow in detail with @mik-laj in a long conversation. Similar solutions are currently available from other cloud provider, among others AssumeRole at AWS, TokenExchange at HyperOne. I notice that when designing this element, interoperability should be strongly taken into account, because - if you go to the extreme - we will lead Airflow to work effectively in GCP only.
I'm thinking of a user interface that needs to be developed to integrate impersonate with Apache Airflow. In this issue, I have the following questions about GCP, which seem to me crucial for designing the appropriate change in Apache Airflow:
@olchas] Do you want to work on it? It looks like this is a task for you.
Dear @olchas, I will be very happy to support you to implement this in an effective way. I have spent hours analyzing these kinds of mechanisms at different cloud providers. I do not have deep knowledge about Airflow, but @mik-laj is on the issue and he will surely support us with his excellent knowledge about Airflow.
@mik-laj sure. Could you assign me to this issue, please?
@ad-m, thanks for the offer. I am uncertain about supporting impersonation via environment variable. As far as I know, there is no mechanism to provide a value of environment variable specifically for a single Task Instance, so with this approach all Task Instances would be impersonating the same account (for example, GOOGLE_APPLICATION_CREDENTIALS is used only if no other account details have been provided in gcp connection).
I was looking into the GoogleBaseHook implementation and I was thinking about specifying the chain of accounts to impersonate in the extras field of connection used by the hook, similarly to how other fields, like scopes or key_path, are provided. This way we would avoid the necessity to modify every hook derived from GoogleBaseHook and operators that are using them, as the information would be provided in the connection, not in operator definition. I guess we could follow the same solution in hooks dedicated for other cloud providers but I haven't looked at them yet.
However, this would still require a separate connection for every impersonated account, even if all of them were using the same service account as a source. You would not have to generate or rotate keys for impersonated accounts, but with a lot of accounts being impersonated by the same source account, this would put the effort to keep connections consistent on team managing airflow.
WDYT about this approach?
@olchas It seems to me that we should define it at the task level. From the user's point of view, this should be as easy to use as in gcloud.
gcloud \
[email protected] \
--impersonate-service-account=test-kamil@polidea-airflow.iam.gserviceaccount.com \
auth print-access-token
There is only one difference. Instead of using the --account option, we have gcp_conn_id.
If you want to play around with it then you can use the script below.
MAIN_ACCOUNT="[email protected]"
SECONDARY_ACCOUNT="[email protected]"
ACCESS_TOKEN="$(gcloud \
--account=${MAIN_ACCOUNT} \
auth print-access-token)"
curl -q "https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=${ACCESS_TOKEN}"
ACCESS_TOKEN="$(gcloud \
--account=${MAIN_ACCOUNT} \
--impersonate-service-account=${SECONDARY_ACCOUNT} \
auth print-access-token)"
curl -q "https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=${ACCESS_TOKEN}"
Remember that you need to have the appropriate permissions to use this feature
If you are using gcloud then you might want to enable the options below as well, which will allow you to better understand the flow.
gcloud config set core/log_http true
gcloud config set core/log_http_redact_token false
Please note that the second option is not described in the public documentation, so be careful.
@mik-laj I agree, for users it will definitely be better to be able to specify it at task level.
I was looking at implementation of google.auth.impersonated_credentials module and there I found that you can actually specify a chain of service accounts leading to the final one, that is supposed to grant the access token used for request. So I think the new argument for operators and hooks should accept both a string with single service account as well as a list, in case of which the last one is used as target_principal, while the rest are used as delegates. I think keeping one argument for operators/hooks designated for impersonation should suffice.
@olchas Sounds goods to me. Can you prepare a POC with one operator and no unit tests?
@amithmathew We are still working on the documentation, but could you please have a look if the current implementation looks good for you?
Will do, will take a look at it this week.
On Aug 24, 2020, at 7:49 AM, Kamil Breguła notifications@github.com wrote:

@amithmathew We are still working on the documentation, but could you please have a look if the current implementation looks good for you?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Took a look and did some quick and dirty tests with the impersonation_chain parameter. Looks good to me.
I would love to see this implemented for Dataflow as well, will follow #10596.
@olchas Is is done? Is there anything else to do that is not described in #10596?
Hi, @mik-laj, sorry for not responding. I guess one more thing not covered in https://github.com/apache/airflow/issues/10596 might be adding an example dag showing the usage of impersonation.
@olchas I am closing this ticket. Can you create a ticket about the missing DAG example?
Most helpful comment
Crystal-clear! Thank you. It does look reasonable. I will circle it back with a few people to see what they think and come back to you !