What happened:
We cannot run pipeline samples.
Seems that gcloud related command cannot get workload identity correctly.
The error messages are
ERROR: (gsutil) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
What did you expect to happen:
We should run pipeline samples smoothly.
What steps did you take:
Created a run and an experiment.
Anything else you would like to add:
I tried this implementation and still cannot get correct result.
https://github.com/kubeflow/pipelines/blob/master/samples/core/secret/secret.py
When I retried deployment again, the message is changed to
AccessDeniedException: 403 Primary: /namespaces/dcard-data.svc.id.goog with additional claims does not have storage.objects.list access to dcard--bruce.
not sure whether that is related,
when I run secret sample, I will get these messages.
It seems that cloud sdk cannot link to metadata.google.internal
Traceback (most recent call last):
File "<string>", line 6, in <module>
File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 212, in _items_iter
for page in self._page_iter(increment=False):
File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 243, in _page_iter
List of buckets:
page = self._next_page()
File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 369, in _next_page
response = self._get_next_page_response()
File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 419, in _get_next_page_response
method=self._HTTP_METHOD, path=self.path, query_params=params
File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 417, in api_request
timeout=timeout,
File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 275, in _make_request
method, url, headers, data, target_object, timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 313, in _do_request
url=url, method=method, headers=headers, data=data, timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/google/auth/transport/requests.py", line 277, in request
self.credentials.before_request(auth_request, method, url, request_headers)
File "/usr/local/lib/python2.7/dist-packages/google/auth/credentials.py", line 124, in before_request
self.refresh(request)
File "/usr/local/lib/python2.7/dist-packages/google/auth/compute_engine/credentials.py", line 102, in refresh
six.raise_from(new_exc, caught_exc)
File "/usr/lib/python2.7/dist-packages/six.py", line 737, in raise_from
raise value
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)
I found this issue also: https://github.com/googleapis/google-auth-library-python/issues/211
After set workload identity to pipeline-runner in kubeflow namespace,
I can read data via gcloud command but still timeout after few minutes.
I can read data via gcloud command but still timeout after few minutes.
Maybe this has to do with how gcloud obtains/refreshes credentials? Even when using the old secret method (e.g. .apply(gcp.use_gcp_secret("user-gcp-sa")) I still get the timeouts and have to rely on setting the retry attempts for the component.
About timeout problem, I think that is GKE problem. That will use default credential client and the certification is timeout around 1 hour.
But I think binding workload identity to pipeline-runner is workable for kubeflow ~
@parthmishra I tried that but it didn鈥檛 work because gcloud sdk implementation
@bruce3557, also running into this on some training experiments (using Katib outside pipelines). I end up with that same error when trying to download training data:
google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)
Please post back if you find a fix
@wronk I find a workaround solution to prevent this problem in kubeflow issue 4607.
You can restart metadata pods regularly (around half hour)
The command is:
kubectl delete pods -n kube-system --selector=k8s-app=gke-metadata-server
Before GCP fix the issue, we cannot do anything I think.
The related GCP issue is here:
https://issuetracker.google.com/issues/146622472
As mentioned in the GCP issue, did you try the workarounds.
2 workarounds:
1) Disable workload identity
2) Downgrade GKE to a version that uses 0.2.13 of GKE Metadata server (1.14.8-gke.18)
2) has been working well for me using the following command
gcloud container clusters upgrade <cluster-name> --master --cluster-version 1.14.8-gke.17
@Bobgy I get the following error when trying to downgrade
Master of cluster [xxxxx] will be upgraded from version [1.14.9-gke.2] to version [1.14.8-gke.17]. This operation is long-running and will block other operations on the cluster (including
delete) until it has run to completion.
Do you want to continue (Y/n)?
ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.
But I think binding workload identity to pipeline-runner is workable for kubeflow ~
I don't yet understand how all of kubeflow is set up but am wondering about the effect such change would have on the other components. Would they continue to work assuming pipeline works ?
AFAIK there is an ongoing issue related with recent GKE release. Will keep this thread updated.
ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.
It means a new patch version has been released. The new 1.14.8-gke.x probably already have the fix.
@Bobgy thanks - found latest in 1.18.8 series is 1.14.8-gke.33 and used your command to upgrade from earlier kubeflow 0.7 default version. Still getting this error though and cluster-user has Storage Admin role
File "kfp_component/google/dataflow/_launch_python.py", line 58, in launch_python
job_id, location = read_job_id_and_location(storage_client, staging_location)
File "kfp_component/google/dataflow/_common_ops.py", line 99, in read_job_id_and_location
if job_blob.exists():
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 404, in exists
_target_object=None,
File "/usr/local/lib/python2.7/site-packages/google/cloud/_http.py", line 319, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET
https://www.googleapis.com/storage/v1/b/edi_bice/o/kubeflow%2Fpipelines%2F378a9083ca79da0fc8b315b96dd965d8%2Fkfp%2Fdataflow%2Flaunch_python%2Fjob.txt?fields=name
: Primary: /namespaces/xxx-xx-xxx.svc.id.goog with additional claims does not have storage.objects.get access to edi_bice/kubeflow/pipelines/378a9083ca79da0fc8b315b96dd965d8/kfp/dataflow/launch_python/job.txt.
@yantriks-edi-bice Sorry for late notice, you probably also need to upgrade your google/cloud-sdk client versions as mentioned in https://github.com/kubeflow/pipelines/issues/3069#issuecomment-595047150
It seems the original issue is a GKE workload identity problem, closing now.
/close
@Bobgy: Closing this issue.
In response to this:
It seems the original issue is a GKE workload identity problem, closing now.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
When I retried deployment again, the message is changed to