Hello, I am using Kubeflow 1.0.0
and I am trying to launch a tfjob from the pipeline using the tfjob_launcher.py.
I noted that while katib works excellently returning a json file with tuned hyperparameters, the tfjob never succeed, and is killed by the execution wallclock.
This error is rising by executing pipeline in sample.py.
https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher
The same error occurs when running this pipeline
https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb
Katib is executed and it gets stuck in the "train" component that execute the tfjob.
Can this be related to this recent fix https://github.com/kubeflow/tf-operator/pull/1185/commits ?
Also, it is unclear to me how to re-inject tfjob outputs into the pipeline and pass those artifacts to other pipeline components.
Any comment or suggestion is very welcomed and appreciated! Thanks.
Not solving the problem but checked with the master branch, and then it seems to work. Will try to test it out on the last stable version later of Kubeflow Pipelines.
@NikeNano thank you very much for the test! Can my problem be related to v1.0.0 instead of the lastest?
Yes, I will try to check the versions of Kubeflow
Pipelines related to Kubeflow 1.0.0 tomorrow and see if it works.
@NikeNano I also encountered this issue, so are you confirming that tfjob_launcher.py works well with kubeflow 1.0.4 but not with 1.0.0?
I have tried with Kubeflow Pipelines 1.0.4 but it don't work, will time out when I look the tf-jobs version to https://github.com/kubeflow/manifests/tree/v1.1.0/tf-training. I think it could be on the tf-job side but need to debug this further @f-wole and @pretidav. Will update with what I find.
This seems to be related to the tf-operator and not to kubeflow pipelines or the launcher component. Deployed the latest(i used commit: 338b1241a2214e898a0f9cd5a70ee11e9556c272) version of the tf-operator from here and got the component working.
Could we close or is it still an issue @pretidav ?