Kubeflow: MNist tutorial : no access to cloud storage bucket

Created on 9 Jan 2020  路  3Comments  路  Source: kubeflow/kubeflow

/kind bug

What steps did you take and what happened:
After installing Kubeflow (0.7.0), I followed this tutorial (mnist) on GKE : https://www.kubeflow.org/docs/gke/gcp-e2e/

I encountered the following issues :
1- When I train the model (cd $WORKDIR/training/GCS and kustomize build . |kubectl apply -f - ), there was no workload in the GKE workload page as explained in th tutorial.

After some investigations, I was that the namespace generated by kustomize in the yaml is kubeflow. I decided t replace it by kubeflow-jal (my namespace) and then a workload is generated in the namespace kubeflow-jal.

2- However, when running the workload in namespace kubeflow-jal, the workload crashes after several minutes with error 401 'Anonymous caller does not have storage.buckets.get access to dfy-bac-a-sable-kubeflow-bucket'there are some problems when trying to access from the pods to the Cloud storage bucket.

My cloud storage bucket is still empty after the job crashes. In the logs, I see the errors (see log below).

I am new to Kubeflow.
Did I miss something ?
Thanks for your help !!!

What did you expect to happen:
I should see the model data written in the cloud storage bucket but the cloud storage bucket is empty.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I don't know if it could help, but When connecting in the pod, I was expecting to find a credentials file /var/secrets/user-gcp-sa.json. But there is no such file.

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): 0.7.0
  • kfctl version: (use kfctl version):kfctl v0.7.0
  • Kubernetes platform: (e.g. minikube)GKE
  • Kubernetes version: (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-gke.2", GitCommit:"0f206d1d3e361e1bfe7911e1e1c686bc9a1e0aa5", GitTreeState:"clean", BuildDate:"2019-11-25T19:35:58Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

  • OS (e.g. from /etc/os-release):Ubuntu 18.04.3 LTS
    logsKubeflow.txt

areapplications kinbug platforgcp prioritp2

Most helpful comment

I'm having the exact same issue. I was following this guide - https://www.kubeflow.org/docs/gke/gcp-e2e/.

First of all I ran into an issue when trying to train on GKE:
kustomize build . |kubectl -n kubeflow apply -f -

Workload wasn't being created due to missing editor role. I fixed it by using this tip from troubleshooting guide: https://www.kubeflow.org/docs/notebooks/troubleshoot/#note-for-gcp-users

kubectl create sa default-editor
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user default-editor

Afterwards container mnist-train-dist-chief-0 failed with:

{
 insertId: "40sqa6wfmmdxg9246"  

labels: {鈥  
 logName: "projects/<my-project>/logs/stderr"  
 receiveTimestamp: "2020-01-14T11:12:04.156213649Z"  

resource: {

labels: {
   cluster_name: "kf-test"    
   container_name: "tensorflow"    
   location: "europe-west1-b"    
   namespace_name: "kubeflow"    
   pod_name: "mnist-train-dist-chief-0"    
   project_id: "<my-project>"    
  }
  type: "k8s_container"   
 }
 severity: "ERROR"  
 textPayload: "WARNING:tensorflow:PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message ''), response '{
"  
 timestamp: "2020-01-14T11:11:17.220535047Z"  
}

Like @jal06, I'm completely new to kubeflow and trying to debug issues like these when following an official hello world tutorial is quite overwhelming.

All 3 comments

I'm having the exact same issue. I was following this guide - https://www.kubeflow.org/docs/gke/gcp-e2e/.

First of all I ran into an issue when trying to train on GKE:
kustomize build . |kubectl -n kubeflow apply -f -

Workload wasn't being created due to missing editor role. I fixed it by using this tip from troubleshooting guide: https://www.kubeflow.org/docs/notebooks/troubleshoot/#note-for-gcp-users

kubectl create sa default-editor
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user default-editor

Afterwards container mnist-train-dist-chief-0 failed with:

{
 insertId: "40sqa6wfmmdxg9246"  

labels: {鈥  
 logName: "projects/<my-project>/logs/stderr"  
 receiveTimestamp: "2020-01-14T11:12:04.156213649Z"  

resource: {

labels: {
   cluster_name: "kf-test"    
   container_name: "tensorflow"    
   location: "europe-west1-b"    
   namespace_name: "kubeflow"    
   pod_name: "mnist-train-dist-chief-0"    
   project_id: "<my-project>"    
  }
  type: "k8s_container"   
 }
 severity: "ERROR"  
 textPayload: "WARNING:tensorflow:PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message ''), response '{
"  
 timestamp: "2020-01-14T11:11:17.220535047Z"  
}

Like @jal06, I'm completely new to kubeflow and trying to debug issues like these when following an official hello world tutorial is quite overwhelming.

Looks that this issue is solved : https://github.com/kubeflow/kubeflow/issues/4642

I uninstalled Kubeflow and reinstalled it applying the new script proposed by @kunmingg and it works for me now :-)

Looks that this issue is solved : https://github.com/kubeflow/kubeflow/issues/4642

I uninstalled Kubeflow and reinstalled it applying the new script proposed by @kunmingg and it works for me now :-)

Was this page helpful?
0 / 5 - 0 ratings