Pipelines: Release 1.7 - TFX taxi cab example failing the deploy step

Created on 16 Jan 2019  路  9Comments  路  Source: kubeflow/pipelines

Fresh deployment the taxi cab TFX example is failing the deploy step with :

++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found

  • '[' -z '' ']'
    ++ date +%s
  • current_time=1547611953
    ++ expr 1547611953 + 1 - 1547610952
  • elapsed_time=1002
  • [[ 1002 -gt 1000 ]]
  • echo timeout
    timeout
  • exit 1

Most helpful comment

All 9 comments

Is it a recurring error or happens rarely? I think I've come across with the error in the past.

3 runs consistently failed. kicking off the 4th run.

quick update 4th run also failed.

I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails.
This can be solved by deleting all model-server pods that gets created before starting a fresh run. But this makes recurring / parallel / multiple runs hard.

A temporary fix is to delete the deployment.
run "kubectl get deployment -n kubeflow" and look for the taxi-cab-, then
run "kubectl delete deployment taxi-cab-
-n kubeflow".

In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}}

Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore.

Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision.
Will send a PR to randomize the deployer name

Was this page helpful?
0 / 5 - 0 ratings