Operator-sdk: Health Check (Liveness and Readiness probe for Operator)

Created on 20 Mar 2019 · 24Comments · Source: operator-framework/operator-sdk

What did you do?

I've created a new ansible operator using:

operator-sdk new gogs-operator --api-version=org.example.gogs/v1alpha1 --kind=Gogs --type=ansible --generate-playbook

What did you expect to see?

I expect to see a deployment object using liveness and readiness probe to check the health of my operator.

The file is gogs-operator/deploy/operator.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gogs-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: gogs-operator
  template:
    metadata:
      labels:
        name: gogs-operator
    spec:
      serviceAccountName: gogs-operator
      containers:
        - name: ansible
          command:
          - /usr/local/bin/ao-logs
          - /tmp/ansible-operator/runner
          - stdout
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
            readOnly: true
        - name: operator
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "gogs-operator"
      volumes:
        - name: runner
          emptyDir: {}

What did you see instead? Under which circumstances?

A deployment without liveness and readiness probe.

Environment

operator-sdk version:

operator-sdk version v0.6.0

Kubernetes version information:

oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Kubernetes cluster kind:

Additional context

Additionally, operator-sdk should:

Create liveness and readiness in the Deployment object
Expose a rest endpoint to be checked by Openshift/Kubernetes

Do these requirements above make sense?

help wanted kinfeature languaggo languaghelm lifecyclfrozen needs discussion prioritimportant-soon

Source

luszczynski

👍9 ❤1

Most helpful comment

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

camilamacedo86 on 23 Sep 2019

👍4

All 24 comments

Do these requirements above make sense?

cc @shawn-hurley @jmrodri as it's an ansible operator enhancement issue ^

lilic on 22 Mar 2019

@LiliC
Maybe we should extend this issue to other kind of operators. All operators should have liveness and readiness out of the box IMO.

luszczynski on 22 Mar 2019

👍3

@Lili @luszczynski We actually used to have a default readiness check but it (in combination with leader election) interfered with rolling out operator deployment updates. See #920 and #932.

If we reintroduce these (or similar) checks, we need to make sure we don't have a regression with the operator deployment rollout issue.

joelanford on 22 Mar 2019

@luszczynski Thanks for the request! I think this makes sense to tackle in a future sprint.

shawn-hurley on 25 Mar 2019

👍2

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 25 Jul 2019

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 24 Aug 2019

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 23 Sep 2019

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot on 23 Sep 2019

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

camilamacedo86 on 23 Sep 2019

👍4

What did the checks look like when they were conflicting with the election?

EDIT: Found where this was originally discussed in #920.

flickerfly on 17 Oct 2019

I got asked this question at a workshops and it's actually a valid one. There should be SOME sort of default readiness/health check for an ansible operator

christianh814 on 15 Nov 2019

Is there a design proposal on how to handle this?

jeyaramashok on 5 Dec 2019

I haven't found a design proposal in my poking around.

I also haven't run into any problems simply using ansible --version since my operators are mostly Ansible. I've been thinking about developing a python script to do some more intense checking for permissions, dependencies and the like, but haven't yet. (Someday maybe I'll be a cool Go guy. Y'all do some great work.)

EDIT: Removed sentence about something I was looking for because I found it.

flickerfly on 5 Dec 2019

cc @chris-short

christianh814 on 5 Dec 2019

👍1

I see a health check as being as simple as running ansible --version and getting a 0 exit code. Yes, I'd love to see something that is a more accurate description of health but, I don't know what that would be at this point.

chris-short on 6 Dec 2019

👍2

Any planning to add non-ansible operator liveness back?

haoqing0110 on 14 Apr 2020

👍1

Can we start the server at the default port (8686) under operator-sdk start first with the /healthz endpoint?

krish7919 on 30 May 2020

Liveness probe has been added: https://github.com/operator-framework/operator-sdk/pull/2936

Leaving this open to cover the readiness probe

asmacdo on 10 Jul 2020

Should we not add for all types? We might could to do it in upstream .. wdyt @asmacdo

camilamacedo86 on 10 Jul 2020

👀1

IMO: to close this one we need to address it to upstream (go) and in SDK for the helm

camilamacedo86 on 1 Sep 2020

👍3

@asmacdo @camilamacedo86 Short question, is liveness and readiness probe available for go operator ?

akang0 on 24 Sep 2020

is it implemented in controller-runtime already? See: https://github.com/kubernetes-sigs/controller-runtime/issues/855
If yes I think we should use the controller-runtime implementation. So, IMO we need here:

Check if the above are features that can be used from controller runtime
To do it in upstream (Kubebuilder)
Path the solution for Helm/Ansible (we might need to review what is done for ansible already)

@asmacdo I understand that we have liveness check for Ansible. So, could you please give a hand by supplementing here what is or not done for ansible and why? PS. would be nice to have the link for the code implementation and/or pr where it was addressed.

camilamacedo86 on 30 Oct 2020

Just to keep it update. The controller-runtime provides features to address this need and it is addressed on upstream for v3-alpha plugin. See:https://github.com/kubernetes-sigs/kubebuilder/pull/1856

So, the next step would be we check if we can add it to v1+ ansible/helm plugin or we need to push it for v2+ plugin as well for them. Also. we might need to deprecate some specific implementation done for Ansible to address the same in favor of controller-runtime/upstream one.

@asmacdo have we any specific reason for not use the controller-runtime implementation to address it for Ansible as well?

c/c @joelanford @jmrodri @estroz @asmacdo

camilamacedo86 on 9 Nov 2020

/lifecycle frozen

camilamacedo86 on 9 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings