Operator-sdk: Ansible operator PVC creation and neverending reconcile loop

Created on 6 Apr 2020  路  9Comments  路  Source: operator-framework/operator-sdk

Bug Report

What did you do?
I am creating a PVC resource in ansible using its native k8s module.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name:kojihub
  labels:
    app: koji-hub
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      app: "koji-hub"

What did you expect to see?
The CR condition status to be finished and waiting for the next reconcile loop/event.

What did you see instead? Under which circumstances?

The reconcile loop never finishes and stays on "running" state forever even though both PVC and PV status are set as Bound.

Environment

  • operator-sdk version: operator-sdk version:

operator-sdk version: "v0.15.2", commit: "ffaf278993c8fcb00c6f527c9f20091eb8dd3352", go version: "go1.13.6 linux/amd64"

  • go version:

go version go1.13.6 linux/amd64

  • Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}/
  • Kubernetes cluster kind: minikube and local sdk test environment

  • Are you writing your operator in ansible, helm, or go?

Ansible

languagansible triagsupport

Most helpful comment

OK I found the issue, the playbook was creating a self signed cert on every play which resulted in a secret getting update every time.. not an operator issue.

Closing it.

All 9 comments

Hi @odra,

Shows that some error in the task (role/playbook) has been faced and because of this the reconcile is retrigged and kept running which is the expected behaviour. Then, if you check the CR status you will be able to check the error and the reason for the reconcile has been re-trigged.

Example

Screenshot 2020-04-06 at 23 17 20

However, if you are not able to check the error msg in the CustomResource then, I'd recommend you upgrade your project to use SDK 0.16 since a bug fixed in this release is indeed to ensure that the error status msg will appear in all scenarios for ansible based-operators. See in the Changelog.

Fix missing error status when the error faced in the Ansible do not return an event status. (#2661)

See here in the migration guide some breaking changes for Ansible based-operators and how you can fix it. You can also check the Memcached Ansible sample as an example.

In this way, I am closing this one as sorted out, however, if you still needing help with please feel free to ping us here to re-open it and then please provide the following information.

  • Ensure that you are using the version 0.16 to have the bug fixed and your project upgraded properly. And then, provide the content of the build/Dockerfile of your project for we are able to check it.
  • Enable the Ansible debug logs with the env var ANSIBLE_DEBUG_LOGS in the operator.yaml file as done here in the Ansible Mencache example.
  • Provide the full logs of your operator for we are able to check the operator logs and the ansible ones as the re-trigging the reconcile and analyse the scenario.

There are no ansible playbook errors, the pvc is bound as expected but it just keeps running and never finishes.

The cr does not show any kind of errors in the conditions list:

apiVersion: apps.fedoraproject.org/v1alpha1
kind: MBKojiHub
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps.fedoraproject.org/v1alpha1","kind":"MBKojiHub","metadata":{"annotations":{},"labels":{"app":"mb-koji-hub"},"name":"mb-koji-hub","namespace":"default"},"spec":{"cacert_insecure":true,"cacert_secret":"cacert","configmap":"koji-hub-configmap","fedmsg_configmap":"fedmsg-configmap","hostname":"","persistent":true,"replicas":1}}
  creationTimestamp: "2020-04-07T09:30:24Z"
  generation: 1
  labels:
    app: mb-koji-hub
  name: mb-koji-hub
  namespace: default
  resourceVersion: "3645"
  selfLink: /apis/apps.fedoraproject.org/v1alpha1/namespaces/default/mbkojihubs/mb-koji-hub
  uid: a0c3d10e-92a3-49e3-902c-502c331e10d9
spec:
  cacert_insecure: true
  cacert_secret: cacert
  configmap: koji-hub-configmap
  fedmsg_configmap: fedmsg-configmap
  hostname: ""
  persistent: true
  replicas: 1
status:
  conditions:
  - lastTransitionTime: "2020-04-07T09:30:24Z"
    message: Running reconciliation
    reason: Running
    status: "True"
    type: Running

I 've updated the operator-sdk cli to v.0.16.0 and migrated it to use the new container but the problem presists.

Hi @odra,

Also, I will be working on in a POC to try to reproduce your scenario as well as soon as possible based on the Memcached sample. It may take a few days but I will keep you posted.

Could you please provide the full logs or we are able to check it as well? Following the steps.

  • Enable the Ansible debug logs with the env var ANSIBLE_DEBUG_LOGS in the operator.yaml file as done here in the Ansible Mencache example.
  • Provide the full logs of your operator for we are able to check the operator logs and the ansible ones from the operator startup until the reconcile be re-trigged.

Dockerfile:

FROM quay.io/operator-framework/ansible-operator:v0.16.0

COPY requirements.yml ${HOME}/requirements.yml
RUN ansible-galaxy collection install -r ${HOME}/requirements.yml \
&& chmod -R ug+rwx ${HOME}/.ansible

COPY watches.yaml ${HOME}/watches.yaml

COPY roles/ ${HOME}/roles/

Logs: https://gist.github.com/odra/09c0991de02bbf67df4cae00a61b4ea7

I've created a sample repository which creates a pvc and replicates the error: https://github.com/odra/pvc-operator

lol looks like the sample repository works ^:O

I noticed that the status is updated as "completed" but it then starts the reconcile loop right after that, as if something changed.

quick question: how does the ansible operator decides that it should run again? is it based on the "changed" status from the run?

OK I found the issue, the playbook was creating a self signed cert on every play which resulted in a secret getting update every time.. not an operator issue.

Closing it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

magescher picture magescher  路  3Comments

bobdonat picture bobdonat  路  3Comments

smiklosovic picture smiklosovic  路  4Comments

linuxbsdfreak picture linuxbsdfreak  路  4Comments

hasbro17 picture hasbro17  路  3Comments