Argo: Output artifacts saving sometime fails using Workload Identity on a GKE cluster

Created on 24 Nov 2020  路  8Comments  路  Source: argoproj/argo

Setup

Running Argo 2.10.2 and the log bucket is a GCS one relying on Workload Identity.

Issue

It seems that the default service account is used when retrieving logs from an archived workflow. Two things, let me think that:

  • the error message is the following:
Get https://storage.googleapis.com/storage/v1/b/xxxxx-test-logs/o?alt=json&delimiter=&pageToken=&prefix=artifact-logs%2F2020.11.24%xxxx-test-llfwg%xxxx-test-llfwg-2987507332%2Fmain.log&prettyPrint=false&projection=full&versions=false: metadata: GCE metadata "instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control" not define

It contains instance/service-accounts/default/token which refers to a default SA.

  • When I annotate the default SA with iam.gke.io/gcp-service-account: [email protected] which corresponds to the same annotation for the argo-server and workflow, I can access to the logs from the UI. Meaning the access relies on this SA (default).

The workflow spec is correctly using the workflow SA and the argo-server is correctly using the argo-server SA.

Is that possible that somehow the argo-server (where the logs are fetched I guess) uses the default SA to make a remote call?

bug

All 8 comments

Note that the argo controller does not have the permission to upload the logs. Is the controller supposed to be able to upload logs to a bucket?

I guess my question is which service accounts are supposed to have the correct credential to read/write logs? argo used by the controller, argo-server used by the server and/or workflow used by the workflow's pod. It looks like the answer is all of them. AM I correct?

The Argo Server, in server auth mode uses the argo-server service account. In client auth mode, it will use the service account related you provided. Could the latter be a possibility?

I am currently using the client auth model with the argo-server token and the UI confirms the argo-server SA is actually used. I actually noticed the same issue some time to upload the logs when a workflow terminates.

So if I understand correctly the only and only one process that read/write logs is tighter to the SA provided in client auth mode? Do you confirm that neither the argo-server nor the controller will read/write logs using their own SAs? (question apply to archived workflows as well)

For kubectl logs:

  • Workflow logs are always created by the workflow pod and therefore using the service account the workflow pod is configured to use.
  • The controller never reads or writes logs (except writing its own obviously).
  • The server only ever reads logs (except writing its own log).

For logs stored in your artifact repository (e.g. main.log):

  • The workflow pods write these logs looking up the repo creds via its own SA.
  • The controller never reads or writes these logs.
  • The argo-server only reads these logs.

Thanks a lot, @alexec, this is what I had in mind without being sure. Now it's crystal clear. I will investigate on my side then and report here any interesting things related to GKE and workload identity (likely to be the culprit here).

That being said I am still confused about this default SA that makes it works when I annotate it with the workload identity tag...

From the Workload Identity documentation:

The GKE metadata server takes a few seconds to start to run on a newly created Pod. Therefore, attempts to authenticate or authorize using Workload Identity made within the first few seconds of a Pod's life may fail. Retrying the call will resolve the problem.

It "could" explain why sometimes saving the logs in the default artifacts sometimes works and sometimes failed. Here is the error I am getting from the gke-metadata-server-xxx pod when the saving step in the wait container fails:

metadata.go:151] Handler returns generic::not_found: GenerateAccessToken("[email protected]"):
googleapi: Error 404: Requested entity was not found., notFound, Reason: IAM,
UserMessage: Unable to generate access token; IAM returned 404 Not Found: Requested entity was not found.

Two things confused me:

  • First, if the delay is the issue then adding a simple sleep 3m at the top of a step of a workflow should fix it. But I see the exact same random fails when adding sleep. Maybe it's because the sleep is in the main container while saving happens in the wait container? Maybe having a sleep or retry mechanism in the wait container could help identify that.

  • Second, this random failure to write to the GCS bucket only happens in a workflow. When running a pod with kubectl, I can write to the log bucket without any problem (using the same K8s SA as when running a workflow):

kubectl run --rm -ti --image google/cloud-sdk --serviceaccount workflow --namespace argo -- bash
echo "hello" > test.txt && gsutil cp test.txt gs://my-log-bucket/test.txt

A temporary fix that makes it work is to add the JSON key of the SA as a secret and configure argo to use it in the workflow-controller-configmap (Workload Identity is bypassed then):

gcs:
    bucket: my-log-bucket
    serviceAccountKeySecret:
      name: temp-gcs-logs
      key: serviceAccountKey

In that sense, Workload Identity is part of the problem but it does not explain why when using kubectl run it works while it fails from an argo workflow.

After testing on a new cluster it seems to work ok. Since the GCP service account has been recreated during the lifetime of the previous cluster, I suspect the metadata server to have been confused by this (same name but probably a different uuid). ANyway with a new service account and a new cluster, it works. Hope it can be useful to someone else.

(that being said I am still confused why it worked sometimes and why it failed sometime... so it's not impossible this delay can be an issue)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kounoike picture kounoike  路  4Comments

hden picture hden  路  3Comments

stevef1uk picture stevef1uk  路  4Comments

vicaire picture vicaire  路  4Comments

basanthjenuhb picture basanthjenuhb  路  3Comments