With the new multi-user pipelines feature, more pods are used which have emptyDir (local storage). Local storage causes issues for K8S clusters using the cluster-autoscaler, as it will refuse to evict pods which have local storage unless they have the following annotation:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
We should either make this anotations the default in the manifests, or make this a simple config to change for users.
After further investigation it is almost always the istio-sidecar which is causing this issue.
The following Pipeline Deployment resources have this issue:
metadata-writerml-pipelineml-pipeline-persistenceagentml-pipeline-scheduledworkflowml-pipeline-uiml-pipeline-viewer-crdml-pipeline-visualizationserverThese two Deployment resources are more problematic (because a new one is created for each Profile):
ml-pipeline-ui-artifactml-pipeline-visualizationserverTo fully fix this, we need to:
safe-to-evict safe-to-evict on all static resources (directly in the kubeflow/manifests repo)safe-to-evict on the Profile resources here: pipeline/installs/multi-user/pipelines-profile-controller/sync.py#L56Users can work around this issue (if they are installing istio with the istio helm chart), using the sidecarInjectorWebhook.injectedAnnotations in values.yaml to inject the safe-to-evict annotation.
WARNING: this is only a solution for users who are happy with applying safe-to-evict to all ALL workloads with an istio-sidecar, so we should still fix this in our manifests.
WARNING2: this will also affect Notebook pods, which is probably not the best, as users don't want their Pods randomly evicted from a Node
Totally makes sense to me.
Regarding notebook pods, do we have any other mechanism to prevent them from being evicted?
emptyDir will be provided by ephemeral storage. Having many pod using ephemeral storage is not a good practice. I think applying safe-to-evict is a short term fix. It would be better to understand why all pods need emptyDir. Is it necessary?
Totally makes sense to me.
Regarding notebook pods, do we have any other mechanism to prevent them from being evicted?
@Bobgy I was talking about the workaround, this is NOT an issue if Kubeflow Pipelines updates their manifests
However, I agree that the Notebooks owners should make an effort to support users who apply the workaround (as its probably not that uncommon for Istio users to apply safe-to-evict with sidecarInjectorWebhook.injectedAnnotations
emptyDirwill be provided by ephemeral storage. Having many pod using ephemeral storage is not a good practice. I think applyingsafe-to-evictis a short term fix. It would be better to understand why all pods needemptyDir. Is it necessary?
@Jeffwan the emptyDir is a feature of istio-sidecar, and is not specific to Kubeflow, see here: https://github.com/istio/istio/issues/19395
Until Changes are made in kubeflow manifests, a temporary workaround is to edit the configmap manually.
kubectl edit cm istio-sidecar-injector -n istio-system
Search for injectedAnnotations and add cluster-autoscaler.kubernetes.io/safe-to-evict: "true".
injectedAnnotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
@kubeflow/wg-pipeline-leads is there anyone looking at this issue currently?
Not yet, do you want to submit a PR?
/assign
Most helpful comment
After further investigation it is almost always the istio-sidecar which is causing this issue.
The following Pipeline
Deploymentresources have this issue:metadata-writerml-pipelineml-pipeline-persistenceagentml-pipeline-scheduledworkflowml-pipeline-uiml-pipeline-viewer-crdml-pipeline-visualizationserverThese two
Deploymentresources are more problematic (because a new one is created for eachProfile):ml-pipeline-ui-artifactml-pipeline-visualizationserverTo fully fix this, we need to:
safe-to-evictsafe-to-evicton all static resources (directly in thekubeflow/manifestsrepo)safe-to-evicton the Profile resources here: pipeline/installs/multi-user/pipelines-profile-controller/sync.py#L56