/kind bug
What steps did you take and what happened:
After installing Kubeflow (0.7.0), I followed this tutorial (mnist) on GKE : https://www.kubeflow.org/docs/gke/gcp-e2e/
I encountered the following issues :
1- When I train the model (cd $WORKDIR/training/GCS and kustomize build . |kubectl apply -f - ), there was no workload in the GKE workload page as explained in th tutorial.
After some investigations, I was that the namespace generated by kustomize in the yaml is kubeflow. I decided t replace it by kubeflow-jal (my namespace) and then a workload is generated in the namespace kubeflow-jal.
2- However, when running the workload in namespace kubeflow-jal, the workload crashes after several minutes with error 401 'Anonymous caller does not have storage.buckets.get access to dfy-bac-a-sable-kubeflow-bucket'there are some problems when trying to access from the pods to the Cloud storage bucket.
My cloud storage bucket is still empty after the job crashes. In the logs, I see the errors (see log below).
I am new to Kubeflow.
Did I miss something ?
Thanks for your help !!!
What did you expect to happen:
I should see the model data written in the cloud storage bucket but the cloud storage bucket is empty.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I don't know if it could help, but When connecting in the pod, I was expecting to find a credentials file /var/secrets/user-gcp-sa.json. But there is no such file.
Environment:
kfctl version):kfctl v0.7.0minikube)GKEKubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-gke.2", GitCommit:"0f206d1d3e361e1bfe7911e1e1c686bc9a1e0aa5", GitTreeState:"clean", BuildDate:"2019-11-25T19:35:58Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
OS (e.g. from /etc/os-release):Ubuntu 18.04.3 LTS
logsKubeflow.txt
I'm having the exact same issue. I was following this guide - https://www.kubeflow.org/docs/gke/gcp-e2e/.
First of all I ran into an issue when trying to train on GKE:
kustomize build . |kubectl -n kubeflow apply -f -
Workload wasn't being created due to missing editor role. I fixed it by using this tip from troubleshooting guide: https://www.kubeflow.org/docs/notebooks/troubleshoot/#note-for-gcp-users
kubectl create sa default-editor
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user default-editor
Afterwards container mnist-train-dist-chief-0 failed with:
{
insertId: "40sqa6wfmmdxg9246"
labels: {鈥
logName: "projects/<my-project>/logs/stderr"
receiveTimestamp: "2020-01-14T11:12:04.156213649Z"
resource: {
labels: {
cluster_name: "kf-test"
container_name: "tensorflow"
location: "europe-west1-b"
namespace_name: "kubeflow"
pod_name: "mnist-train-dist-chief-0"
project_id: "<my-project>"
}
type: "k8s_container"
}
severity: "ERROR"
textPayload: "WARNING:tensorflow:PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message ''), response '{
"
timestamp: "2020-01-14T11:11:17.220535047Z"
}
Like @jal06, I'm completely new to kubeflow and trying to debug issues like these when following an official hello world tutorial is quite overwhelming.
Looks that this issue is solved : https://github.com/kubeflow/kubeflow/issues/4642
I uninstalled Kubeflow and reinstalled it applying the new script proposed by @kunmingg and it works for me now :-)
Looks that this issue is solved : https://github.com/kubeflow/kubeflow/issues/4642
I uninstalled Kubeflow and reinstalled it applying the new script proposed by @kunmingg and it works for me now :-)
Most helpful comment
I'm having the exact same issue. I was following this guide - https://www.kubeflow.org/docs/gke/gcp-e2e/.
First of all I ran into an issue when trying to train on GKE:
kustomize build . |kubectl -n kubeflow apply -f -Workload wasn't being created due to missing editor role. I fixed it by using this tip from troubleshooting guide: https://www.kubeflow.org/docs/notebooks/troubleshoot/#note-for-gcp-users
Afterwards container
mnist-train-dist-chief-0failed with:Like @jal06, I'm completely new to kubeflow and trying to debug issues like these when following an official hello world tutorial is quite overwhelming.