Pipelines: Cannot run pipeline samples in GCP IAP Deployment

Created on 25 Dec 2019 · 17Comments · Source: kubeflow/pipelines

What happened:
We cannot run pipeline samples.
Seems that gcloud related command cannot get workload identity correctly.
The error messages are

ERROR: (gsutil) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

What did you expect to happen:
We should run pipeline samples smoothly.

What steps did you take:
Created a run and an experiment.

Anything else you would like to add:
I tried this implementation and still cannot get correct result.
https://github.com/kubeflow/pipelines/blob/master/samples/core/secret/secret.py

Source

bruce3557

Most helpful comment

When I retried deployment again, the message is changed to

AccessDeniedException: 403 Primary: /namespaces/dcard-data.svc.id.goog with additional claims does not have storage.objects.list access to dcard--bruce.

bruce3557 on 25 Dec 2019

👍2

All 17 comments

When I retried deployment again, the message is changed to

AccessDeniedException: 403 Primary: /namespaces/dcard-data.svc.id.goog with additional claims does not have storage.objects.list access to dcard--bruce.

bruce3557 on 25 Dec 2019

👍2

not sure whether that is related,
when I run secret sample, I will get these messages.
It seems that cloud sdk cannot link to metadata.google.internal

Traceback (most recent call last):
  File "<string>", line 6, in <module>
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 212, in _items_iter
    for page in self._page_iter(increment=False):
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 243, in _page_iter
List of buckets:
    page = self._next_page()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 369, in _next_page
    response = self._get_next_page_response()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 419, in _get_next_page_response
    method=self._HTTP_METHOD, path=self.path, query_params=params
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 417, in api_request
    timeout=timeout,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 275, in _make_request
    method, url, headers, data, target_object, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 313, in _do_request
    url=url, method=method, headers=headers, data=data, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/auth/transport/requests.py", line 277, in request
    self.credentials.before_request(auth_request, method, url, request_headers)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/compute_engine/credentials.py", line 102, in refresh
    six.raise_from(new_exc, caught_exc)
  File "/usr/lib/python2.7/dist-packages/six.py", line 737, in raise_from
    raise value
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

bruce3557 on 26 Dec 2019

I found this issue also: https://github.com/googleapis/google-auth-library-python/issues/211

bruce3557 on 26 Dec 2019

After set workload identity to pipeline-runner in kubeflow namespace,
I can read data via gcloud command but still timeout after few minutes.

bruce3557 on 26 Dec 2019

I can read data via gcloud command but still timeout after few minutes.

Maybe this has to do with how gcloud obtains/refreshes credentials? Even when using the old secret method (e.g. .apply(gcp.use_gcp_secret("user-gcp-sa")) I still get the timeouts and have to rely on setting the retry attempts for the component.

parthmishra on 27 Dec 2019

About timeout problem, I think that is GKE problem. That will use default credential client and the certification is timeout around 1 hour.
But I think binding workload identity to pipeline-runner is workable for kubeflow ~

@parthmishra I tried that but it didn’t work because gcloud sdk implementation

bruce3557 on 28 Dec 2019

@bruce3557, also running into this on some training experiments (using Katib outside pipelines). I end up with that same error when trying to download training data:

google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

Please post back if you find a fix

wronk on 3 Jan 2020

@wronk I find a workaround solution to prevent this problem in kubeflow issue 4607.
You can restart metadata pods regularly (around half hour)
The command is:
kubectl delete pods -n kube-system --selector=k8s-app=gke-metadata-server

Before GCP fix the issue, we cannot do anything I think.
The related GCP issue is here:
https://issuetracker.google.com/issues/146622472

bruce3557 on 3 Jan 2020

As mentioned in the GCP issue, did you try the workarounds.

2 workarounds:

1) Disable workload identity
2) Downgrade GKE to a version that uses 0.2.13 of GKE Metadata server (1.14.8-gke.18)

2) has been working well for me using the following command
gcloud container clusters upgrade <cluster-name> --master --cluster-version 1.14.8-gke.17

Bobgy on 20 Jan 2020

@Bobgy I get the following error when trying to downgrade

Master of cluster [xxxxx] will be upgraded from version [1.14.9-gke.2] to version [1.14.8-gke.17]. This operation is long-running and will block other operations on the cluster (including
delete) until it has run to completion.
Do you want to continue (Y/n)?
ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

yantriks-edi-bice on 27 Jan 2020

But I think binding workload identity to pipeline-runner is workable for kubeflow ~

I don't yet understand how all of kubeflow is set up but am wondering about the effect such change would have on the other components. Would they continue to work assuming pipeline works ?

yantriks-edi-bice on 27 Jan 2020

AFAIK there is an ongoing issue related with recent GKE release. Will keep this thread updated.

numerology on 28 Jan 2020

ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

It means a new patch version has been released. The new 1.14.8-gke.x probably already have the fix.

Bobgy on 28 Jan 2020

👍1

@Bobgy thanks - found latest in 1.18.8 series is 1.14.8-gke.33 and used your command to upgrade from earlier kubeflow 0.7 default version. Still getting this error though and cluster-user has Storage Admin role

File "kfp_component/google/dataflow/_launch_python.py", line 58, in launch_python
job_id, location = read_job_id_and_location(storage_client, staging_location)
File "kfp_component/google/dataflow/_common_ops.py", line 99, in read_job_id_and_location
if job_blob.exists():
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 404, in exists
_target_object=None,
File "/usr/local/lib/python2.7/site-packages/google/cloud/_http.py", line 319, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET
https://www.googleapis.com/storage/v1/b/edi_bice/o/kubeflow%2Fpipelines%2F378a9083ca79da0fc8b315b96dd965d8%2Fkfp%2Fdataflow%2Flaunch_python%2Fjob.txt?fields=name
: Primary: /namespaces/xxx-xx-xxx.svc.id.goog with additional claims does not have storage.objects.get access to edi_bice/kubeflow/pipelines/378a9083ca79da0fc8b315b96dd965d8/kfp/dataflow/launch_python/job.txt.

yantriks-edi-bice on 28 Jan 2020

@yantriks-edi-bice Sorry for late notice, you probably also need to upgrade your google/cloud-sdk client versions as mentioned in https://github.com/kubeflow/pipelines/issues/3069#issuecomment-595047150

Bobgy on 5 Mar 2020

It seems the original issue is a GKE workload identity problem, closing now.
/close

Bobgy on 5 Mar 2020

@Bobgy: Closing this issue.

In response to this:

It seems the original issue is a GKE workload identity problem, closing now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.