What did you do?
I've created a new ansible operator using:
operator-sdk new gogs-operator --api-version=org.example.gogs/v1alpha1 --kind=Gogs --type=ansible --generate-playbook
What did you expect to see?
I expect to see a deployment object using liveness and readiness probe to check the health of my operator.
The file is gogs-operator/deploy/operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gogs-operator
spec:
replicas: 1
selector:
matchLabels:
name: gogs-operator
template:
metadata:
labels:
name: gogs-operator
spec:
serviceAccountName: gogs-operator
containers:
- name: ansible
command:
- /usr/local/bin/ao-logs
- /tmp/ansible-operator/runner
- stdout
# Replace this with the built image name
image: "{{ REPLACE_IMAGE }}"
imagePullPolicy: "{{ pull_policy|default('Always') }}"
volumeMounts:
- mountPath: /tmp/ansible-operator/runner
name: runner
readOnly: true
- name: operator
# Replace this with the built image name
image: "{{ REPLACE_IMAGE }}"
imagePullPolicy: "{{ pull_policy|default('Always') }}"
volumeMounts:
- mountPath: /tmp/ansible-operator/runner
name: runner
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: OPERATOR_NAME
value: "gogs-operator"
volumes:
- name: runner
emptyDir: {}
What did you see instead? Under which circumstances?
A deployment without liveness and readiness probe.
Environment
operator-sdk version v0.6.0
oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Additional context
Additionally, operator-sdk should:
Do these requirements above make sense?
Do these requirements above make sense?
cc @shawn-hurley @jmrodri as it's an ansible operator enhancement issue ^
@LiliC
Maybe we should extend this issue to other kind of operators. All operators should have liveness and readiness out of the box IMO.
@Lili @luszczynski We actually used to have a default readiness check but it (in combination with leader election) interfered with rolling out operator deployment updates. See #920 and #932.
If we reintroduce these (or similar) checks, we need to make sure we don't have a regression with the operator deployment rollout issue.
@luszczynski Thanks for the request! I think this makes sense to tackle in a future sprint.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.
What did the checks look like when they were conflicting with the election?
EDIT: Found where this was originally discussed in #920.
I got asked this question at a workshops and it's actually a valid one. There should be SOME sort of default readiness/health check for an ansible operator
Is there a design proposal on how to handle this?
I haven't found a design proposal in my poking around.
I also haven't run into any problems simply using ansible --version since my operators are mostly Ansible. I've been thinking about developing a python script to do some more intense checking for permissions, dependencies and the like, but haven't yet. (Someday maybe I'll be a cool Go guy. Y'all do some great work.)
EDIT: Removed sentence about something I was looking for because I found it.
cc @chris-short
I see a health check as being as simple as running ansible --version and getting a 0 exit code. Yes, I'd love to see something that is a more accurate description of health but, I don't know what that would be at this point.
Any planning to add non-ansible operator liveness back?
Can we start the server at the default port (8686) under operator-sdk start first with the /healthz endpoint?
Liveness probe has been added: https://github.com/operator-framework/operator-sdk/pull/2936
Leaving this open to cover the readiness probe
Should we not add for all types? We might could to do it in upstream .. wdyt @asmacdo
IMO: to close this one we need to address it to upstream (go) and in SDK for the helm
@asmacdo @camilamacedo86 Short question, is liveness and readiness probe available for go operator ?
is it implemented in controller-runtime already? See: https://github.com/kubernetes-sigs/controller-runtime/issues/855
If yes I think we should use the controller-runtime implementation. So, IMO we need here:
@asmacdo I understand that we have liveness check for Ansible. So, could you please give a hand by supplementing here what is or not done for ansible and why? PS. would be nice to have the link for the code implementation and/or pr where it was addressed.
Just to keep it update. The controller-runtime provides features to address this need and it is addressed on upstream for v3-alpha plugin. See:https://github.com/kubernetes-sigs/kubebuilder/pull/1856
So, the next step would be we check if we can add it to v1+ ansible/helm plugin or we need to push it for v2+ plugin as well for them. Also. we might need to deprecate some specific implementation done for Ansible to address the same in favor of controller-runtime/upstream one.
@asmacdo have we any specific reason for not use the controller-runtime implementation to address it for Ansible as well?
c/c @joelanford @jmrodri @estroz @asmacdo
/lifecycle frozen
Most helpful comment
Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.