Pipelines: Release 1.7 - TFX taxi cab example failing the deploy step

Created on 16 Jan 2019 · 9Comments · Source: kubeflow/pipelines

Fresh deployment the taxi cab TFX example is failing the deploy step with :

++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found

'[' -z '' ']'
++ date +%s
current_time=1547611953
++ expr 1547611953 + 1 - 1547610952
elapsed_time=1002
[[ 1002 -gt 1000 ]]
echo timeout
timeout
exit 1

Source

SinaChavoshi

Most helpful comment

solved in https://github.com/kubeflow/pipelines/pull/704

gaoning777 on 18 Jan 2019

🎉2

All 9 comments

Is it a recurring error or happens rarely? I think I've come across with the error in the past.

gaoning777 on 16 Jan 2019

3 runs consistently failed. kicking off the 4th run.

SinaChavoshi on 16 Jan 2019

quick update 4th run also failed.

SinaChavoshi on 16 Jan 2019

I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails.
This can be solved by deleting all model-server pods that gets created before starting a fresh run. But this makes recurring / parallel / multiple runs hard.

swiftdiaries on 17 Jan 2019

A temporary fix is to delete the deployment.
run "kubectl get deployment -n kubeflow" and look for the taxi-cab-, then
run "kubectl delete deployment taxi-cab- -n kubeflow".

gaoning777 on 17 Jan 2019

In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}}

gaoning777 on 17 Jan 2019

👍1

Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore.

swiftdiaries on 18 Jan 2019

Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision.
Will send a PR to randomize the deployer name

gaoning777 on 18 Jan 2019

solved in https://github.com/kubeflow/pipelines/pull/704

gaoning777 on 18 Jan 2019

🎉2

Was this page helpful?

0 / 5 - 0 ratings