Checklist:
What happened:
When a worker node inside k8s cluster crashed, daemon container running on it couldn't restart.
The daemon container becomes "Error" state, however I've set up its retry strategy like
retryStrategy:
limit: 10
retryPolicy: Always
What you expected to happen:
Argo controller should recreate the daemon container pod.
How to reproduce it (as minimally and precisely as possible):
Just use kubectl delete to delete one of running daemon container's pod.
Anything else we need to know?:
In fact, I think we could design "daemon container" better.
One of reason that daemon container cannot restart I guess, is the pod ip of daemon container has already been passed to following DAG node. I think a better solution for it, is binding a k8s service object to daemon container(s), and then pass cluster IP of the service to following DAG nodes.
That could also enable load balancing mechanism between multiple daemon containers.
Environment:
Argo version:
v2.7.5
Kubernetes version :
v1.13.12
Message from the maintainers:
If you are impacted by this bug please add a 馃憤 reaction to this issue! We often sort issues this way to know what to prioritize.
Good suggestion, I'll look into this
By the way, what do you think about manage daemon containers by k8s deployment/statefulset object? @simster7 I think daemon containers is designed for background long-running services, so some configurations like RetryStrategy.Limit are useless in such cases.
Our limitation with using deployments or stateful sets is that IP addresses of the Pods are not guaranteed to stay the same across restarts.
Such a functionality would require Argo or the user to set up a Service in order to interact with them reliably
Yes. I think a good design is to use the deployment, stateful set to achieve the function of the daemon container, or implement failover logic of the daemon container in Argo side. Eventually, we need to bind pods to the service object to guarantee connectivity during pod failover.
By the way, do you (or other contributors) have efforts to implement this feature? Our team is evaluating Argo, If you don鈥檛 have priority on that, we may add a patch to Argo. @simster7
I'm looking into this feature myself as there are some tangential changes I want to make. I'll let you know if that changes. Thanks, @cy-zheng
Hi @cy-zheng! Sorry for the delay on this. We had some internal discussions about this feature and I did some testing and we decided that the best way to support this use case is for users to manually create a StatefulSet and a Service using the resource template.
The example below:
Service and a StatefulSet in parallel using resource templatesStatefulSet to be ready using another resource templateStatefulSet is accessible by sending a curl requestStatefulSet and Service in an onExit handler. The onExit handler is used to ensure that the resources get deleted regardless of if the Workflow fails and stops, or if it's stopped manually with argo stop.Keep in mind that for Argo to create the StatefulSet and the Service it will need K8s RBAC permissions to do so. For testing in my computer, I just gave the ServiceAccount that Argo was using admin permissions:
$ kubectl -n argo create rolebinding delete-me-admin --clusterrole=admin --serviceaccount=argo:default
(This exact line might change depending on where Argo is installed and what service account it's configured to use)
Here is the full Workflow. Let me know if you have any questions.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: daemoned-stateful-set-with-service-
spec:
entrypoint: create-wait-and-test
onExit: delete
templates:
- name: create-wait-and-test
steps:
- - name: create-service
template: create-service
- name: create-stateful-set
template: create-stateful-set
- - name: wait-stateful-set
template: wait-stateful-set
- - name: test
template: test
- name: delete # This is called as an onExit handler
steps:
- - name: delete-service
template: delete-service
- name: delete-stateful-set
template: delete-stateful-set
- name: create-service
resource:
action: create
manifest: |
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
- name: create-stateful-set
resource:
action: create
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
- name: wait-stateful-set
resource:
action: get
successCondition: status.readyReplicas == 1
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
- name: test
container:
image: curlimages/curl:latest
command: [sh, -c]
args: ["curl nginx"]
- name: delete-service
resource:
action: delete
flags: ["--ignore-not-found"]
manifest: |
apiVersion: v1
kind: Service
metadata:
name: nginx
- name: delete-stateful-set
resource:
action: delete
flags: ["--ignore-not-found"]
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
@cy-zheng Let me know if the solution works for you!
Cool. It does work for me. The only one question from me is:
Would you remove the feature "daemon container" in the future? It seems useless if we start daemon services at this way. Also, RetryStrategy field inside daemon container could make user feel confused. @simster7
Hi @simster7 , there are some drawbacks if I use the solution above.
To sum up, this solution is a good workaround to our problem, but there are something we can do to make Argo better. I think a good tool should cover these details from user, maybe a enhancement to daemon container is the best choice.
Anyway, thanks for your kindly solution!