Operator-sdk: Health Check (Liveness and Readiness probe for Operator)

Created on 20 Mar 2019  路  24Comments  路  Source: operator-framework/operator-sdk

What did you do?

I've created a new ansible operator using:

operator-sdk new gogs-operator --api-version=org.example.gogs/v1alpha1 --kind=Gogs --type=ansible --generate-playbook

What did you expect to see?

I expect to see a deployment object using liveness and readiness probe to check the health of my operator.

The file is gogs-operator/deploy/operator.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gogs-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: gogs-operator
  template:
    metadata:
      labels:
        name: gogs-operator
    spec:
      serviceAccountName: gogs-operator
      containers:
        - name: ansible
          command:
          - /usr/local/bin/ao-logs
          - /tmp/ansible-operator/runner
          - stdout
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
            readOnly: true
        - name: operator
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "gogs-operator"
      volumes:
        - name: runner
          emptyDir: {}

What did you see instead? Under which circumstances?

A deployment without liveness and readiness probe.

Environment

  • operator-sdk version:

operator-sdk version v0.6.0

  • Kubernetes version information:

oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

  • Kubernetes cluster kind:

Additional context

Additionally, operator-sdk should:

  1. Create liveness and readiness in the Deployment object
  2. Expose a rest endpoint to be checked by Openshift/Kubernetes

Do these requirements above make sense?

help wanted kinfeature languaggo languaghelm lifecyclfrozen needs discussion prioritimportant-soon

Most helpful comment

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

All 24 comments

Do these requirements above make sense?

cc @shawn-hurley @jmrodri as it's an ansible operator enhancement issue ^

@LiliC
Maybe we should extend this issue to other kind of operators. All operators should have liveness and readiness out of the box IMO.

@Lili @luszczynski We actually used to have a default readiness check but it (in combination with leader election) interfered with rolling out operator deployment updates. See #920 and #932.

If we reintroduce these (or similar) checks, we need to make sure we don't have a regression with the operator deployment rollout issue.

@luszczynski Thanks for the request! I think this makes sense to tackle in a future sprint.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

What did the checks look like when they were conflicting with the election?

EDIT: Found where this was originally discussed in #920.

I got asked this question at a workshops and it's actually a valid one. There should be SOME sort of default readiness/health check for an ansible operator

Is there a design proposal on how to handle this?

I haven't found a design proposal in my poking around.

I also haven't run into any problems simply using ansible --version since my operators are mostly Ansible. I've been thinking about developing a python script to do some more intense checking for permissions, dependencies and the like, but haven't yet. (Someday maybe I'll be a cool Go guy. Y'all do some great work.)

EDIT: Removed sentence about something I was looking for because I found it.

cc @chris-short

I see a health check as being as simple as running ansible --version and getting a 0 exit code. Yes, I'd love to see something that is a more accurate description of health but, I don't know what that would be at this point.

Any planning to add non-ansible operator liveness back?

Can we start the server at the default port (8686) under operator-sdk start first with the /healthz endpoint?

Liveness probe has been added: https://github.com/operator-framework/operator-sdk/pull/2936

Leaving this open to cover the readiness probe

Should we not add for all types? We might could to do it in upstream .. wdyt @asmacdo

IMO: to close this one we need to address it to upstream (go) and in SDK for the helm

@asmacdo @camilamacedo86 Short question, is liveness and readiness probe available for go operator ?

is it implemented in controller-runtime already? See: https://github.com/kubernetes-sigs/controller-runtime/issues/855
If yes I think we should use the controller-runtime implementation. So, IMO we need here:

  • Check if the above are features that can be used from controller runtime
  • To do it in upstream (Kubebuilder)
  • Path the solution for Helm/Ansible (we might need to review what is done for ansible already)

@asmacdo I understand that we have liveness check for Ansible. So, could you please give a hand by supplementing here what is or not done for ansible and why? PS. would be nice to have the link for the code implementation and/or pr where it was addressed.

Just to keep it update. The controller-runtime provides features to address this need and it is addressed on upstream for v3-alpha plugin. See:https://github.com/kubernetes-sigs/kubebuilder/pull/1856

So, the next step would be we check if we can add it to v1+ ansible/helm plugin or we need to push it for v2+ plugin as well for them. Also. we might need to deprecate some specific implementation done for Ansible to address the same in favor of controller-runtime/upstream one.

@asmacdo have we any specific reason for not use the controller-runtime implementation to address it for Ansible as well?

c/c @joelanford @jmrodri @estroz @asmacdo

/lifecycle frozen

Was this page helpful?
0 / 5 - 0 ratings