Airflow: Impersonate service accounts while running GCP Operators without key material (if airflow is running on GCP)

Created on 10 May 2020  Â·  20Comments  Â·  Source: apache/airflow

Description
Allow running Google Cloud operators using Service Accounts, without having to provide key material while running on GCP. If the Compute instance Service Accounts on which Airflow is running have been granted "Service Account Token Creator" role on the target Service Account with which I want to run my operator, I do not need to download, or provide any key material for the impersonation to happen. This is a much more secure way to impersonate service accounts.

Use case / motivation

Allow running Google Cloud operators using Service Accounts, without having to provide key material while running on GCP. If the Compute instance Service Accounts on which Airflow is running have been granted "Service Account Token Creator" role on the target Service Account with which I want to run my operator, I do not need to download, or provide any key material for the impersonation to happen. This is a much more secure way to impersonate service accounts.

https://github.com/googleapis/google-auth-library-python/blob/master/docs/user-guide.rst#impersonated-credentials

Related Issues

None

feature Google

Most helpful comment

Crystal-clear! Thank you. It does look reasonable. I will circle it back with a few people to see what they think and come back to you !

All 20 comments

Thanks for opening your first issue here! Be sure to follow the issue template!

Hi. It looks interesting. Would you like to start working on this change?

I talked to the Google team about this feature. We had serious security concerns. The implementation of this feature in the current Airflow architecture meant that DAG or the operator could access the access key or service account file that allows you to log in to any other account. This is unacceptable. We must think about how to provide these feature without introducing such a security risk. Ideally, the scheduler would not have access to any object from Connection and would only communicate using the API. However, it is unlikely to happen in the near future.

Another solution is to create a separate component that will generate the access code based on the allowed list. Such a thing can be based on Hashicorp Vault (https://www.vaultproject.io/docs/secrets/gcp#access-tokens) or other.

Another solution is to create worker for each service account and use workflow identity to provide access to access token.

Probably, each of them would require a lot of work.

The easiest will be to generate keys and add them to secret backends.

Thanks Kamil. I don't think this statement is accurate in the context I was proposing - "The implementation of this feature in the current Airflow architecture meant that DAG or the operator could access the access key or service account file that allows you to log in to any other account. This is unacceptable."

The account used by the airflow worker can impersonate another service account only if granted the appropriate permissions through IAM, so I don't think you can log into any account. Secondly, even when using a secrets backend, your DAGs still have permissions to access any secret in there (unless I'm missing something).

This is the relevant documentation.

Not having to deal with key material allows for -
1: Do not have to deal with key rotations.
2: When airflow operations is centralized in an organization, eliminate any coordination required for key management and transfer for setup - everything is controlled through IAM.
3: Controlling IAM access through terraform becomes easier, no key generation, transfer or load required.

I may be missing something, of course!

Unless I am mistaken account impersonation does not essentially solve any of those problems you mentioned (like key rotation or key management) because the main service account that you have can do anything via impersonation - and you continue having access to this account. So my first thought is that there is no added value in using impersonation for the purpose you described.

For example if someone steals the main "service account" credentials, that someone can still impersonate any of the other service accounts and do whatever those service accounts can do. You still have to manage the main service account key I believe and rotate it, and additionally you do not have separate access for each key, instead you have one "uber" service account that can impersonate any other service account and do everything. Which is not a good idea I think.

But maybe I do not fully understand what exactly you want to achieve and how this all plays with different roles you have in mind (like admin/dag user etc.) - I'd love to understand more from you and maybe see some diagram (? not sure if I can ask for it) where you would show how the key management and service account structure would look like?

Thanks for the response Jarek. I should state my assumptions :)

Assumptions

  1. Airflow is running on GCP (so workers use the instance accounts) - no long lived keys generated for the main service account.
  2. Airflow is centralized and lives in its own project - DAGs can be run in other projects, and owned by other teams in the organization.

Current model using connections and a secrets backend. -

  1. The DAG owner/user should generate a long lived key
  2. Either store it in the secrets backend themselves or transfer it to the team managing airflow to set it up.
  3. The Airflow connection needs to be created.
Challenges
  1. Long lived keys are generated.
  2. To rotate the key used by the connection, the DAG user must coordinate with the team managing airflow (if that is how the org is setup).
  3. If using terraform for these steps, the key transfer and connection setup end up adding manual steps to the process.

Proposed model using IAM permissions for impersonation

  1. The DAG owner/user determines whether to grant permissions to the Airflow service account.
  2. No long lived keys generated or need to be managed.
  3. Service Account user permissions (required for impersonation) can be controlled within the DAG Owner/Users project itself and does not require cross-gcp project or cross-team coordination.

Risks:

Assumption:

Long lived keys were generated for the main service account used by airflow, and these were compromised.

Detail

The risk of access to all accounts remains the same in both models. They can access all connections (in the first model) and retrieve key information, or impersonate all other service accounts that have granted permissions to the main account.

Current Model (using connections)

The resolution is harder in the current model - the compromised main account key will need to be disabled, and there's a risk that the key material for all other accounts were extracted and thus compromised as well. So keys used in all connections will need to be disabled as well.

Proposed Model (using IAM and impersonation)

In the proposed model, only the main account key will need to be disabled.

Hope this makes my request clearer.

Crystal-clear! Thank you. It does look reasonable. I will circle it back with a few people to see what they think and come back to you !

I like this idea, which I already discuss regarding API flow in detail with @mik-laj in a long conversation. Similar solutions are currently available from other cloud provider, among others AssumeRole at AWS, TokenExchange at HyperOne. I notice that when designing this element, interoperability should be strongly taken into account, because - if you go to the extreme - we will lead Airflow to work effectively in GCP only.

I'm thinking of a user interface that needs to be developed to integrate impersonate with Apache Airflow. In this issue, I have the following questions about GCP, which seem to me crucial for designing the appropriate change in Apache Airflow:

  • What projects currently support "impersonate" in GCP ecosystem? How they solved user interface?
  • Is support for "impersonate" planned by an environment variable similar to GOOGLE_APPLICATION_CREDENTIALS? Amazon uses the AWS_ROLE_ARN environment variable already.

@olchas] Do you want to work on it? It looks like this is a task for you.

Dear @olchas, I will be very happy to support you to implement this in an effective way. I have spent hours analyzing these kinds of mechanisms at different cloud providers. I do not have deep knowledge about Airflow, but @mik-laj is on the issue and he will surely support us with his excellent knowledge about Airflow.

@mik-laj sure. Could you assign me to this issue, please?

@ad-m, thanks for the offer. I am uncertain about supporting impersonation via environment variable. As far as I know, there is no mechanism to provide a value of environment variable specifically for a single Task Instance, so with this approach all Task Instances would be impersonating the same account (for example, GOOGLE_APPLICATION_CREDENTIALS is used only if no other account details have been provided in gcp connection).

I was looking into the GoogleBaseHook implementation and I was thinking about specifying the chain of accounts to impersonate in the extras field of connection used by the hook, similarly to how other fields, like scopes or key_path, are provided. This way we would avoid the necessity to modify every hook derived from GoogleBaseHook and operators that are using them, as the information would be provided in the connection, not in operator definition. I guess we could follow the same solution in hooks dedicated for other cloud providers but I haven't looked at them yet.

However, this would still require a separate connection for every impersonated account, even if all of them were using the same service account as a source. You would not have to generate or rotate keys for impersonated accounts, but with a lot of accounts being impersonated by the same source account, this would put the effort to keep connections consistent on team managing airflow.

WDYT about this approach?

@olchas It seems to me that we should define it at the task level. From the user's point of view, this should be as easy to use as in gcloud.

gcloud \
[email protected] \
--impersonate-service-account=test-kamil@polidea-airflow.iam.gserviceaccount.com \
auth print-access-token

There is only one difference. Instead of using the --account option, we have gcp_conn_id.

If you want to play around with it then you can use the script below.

MAIN_ACCOUNT="[email protected]"
SECONDARY_ACCOUNT="[email protected]"

ACCESS_TOKEN="$(gcloud \
    --account=${MAIN_ACCOUNT} \
    auth print-access-token)"
curl -q "https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=${ACCESS_TOKEN}"

ACCESS_TOKEN="$(gcloud \
    --account=${MAIN_ACCOUNT} \
    --impersonate-service-account=${SECONDARY_ACCOUNT} \
    auth print-access-token)"
curl -q "https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=${ACCESS_TOKEN}"

Remember that you need to have the appropriate permissions to use this feature

  • The main account has access to the secondary account. You can set-up it in the permissions of the secondary account.
  • The main account has "roles/iam.serviceAccountTokenCreator" role.

If you are using gcloud then you might want to enable the options below as well, which will allow you to better understand the flow.

gcloud config set core/log_http true
gcloud config set core/log_http_redact_token false

Please note that the second option is not described in the public documentation, so be careful.

@mik-laj I agree, for users it will definitely be better to be able to specify it at task level.

I was looking at implementation of google.auth.impersonated_credentials module and there I found that you can actually specify a chain of service accounts leading to the final one, that is supposed to grant the access token used for request. So I think the new argument for operators and hooks should accept both a string with single service account as well as a list, in case of which the last one is used as target_principal, while the rest are used as delegates. I think keeping one argument for operators/hooks designated for impersonation should suffice.

@olchas Sounds goods to me. Can you prepare a POC with one operator and no unit tests?

@amithmathew We are still working on the documentation, but could you please have a look if the current implementation looks good for you?

Will do, will take a look at it this week.

On Aug 24, 2020, at 7:49 AM, Kamil Breguła notifications@github.com wrote:


@amithmathew We are still working on the documentation, but could you please have a look if the current implementation looks good for you?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

Took a look and did some quick and dirty tests with the impersonation_chain parameter. Looks good to me.

I would love to see this implemented for Dataflow as well, will follow #10596.

@olchas Is is done? Is there anything else to do that is not described in #10596?

Hi, @mik-laj, sorry for not responding. I guess one more thing not covered in https://github.com/apache/airflow/issues/10596 might be adding an example dag showing the usage of impersonation.

@olchas I am closing this ticket. Can you create a ticket about the missing DAG example?

Was this page helpful?
0 / 5 - 0 ratings