Fresh deployment the taxi cab TFX example is failing the deploy step with :
++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found
Is it a recurring error or happens rarely? I think I've come across with the error in the past.
3 runs consistently failed. kicking off the 4th run.
quick update 4th run also failed.
I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails.
This can be solved by deleting all model-server pods that gets created before starting a fresh run. But this makes recurring / parallel / multiple runs hard.
A temporary fix is to delete the deployment.
run "kubectl get deployment -n kubeflow" and look for the taxi-cab-, then
run "kubectl delete deployment taxi-cab- -n kubeflow".
In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}}
Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore.
Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision.
Will send a PR to randomize the deployer name
Most helpful comment
solved in https://github.com/kubeflow/pipelines/pull/704