Operator-sdk: ansible-operator hangs on k8s state: present

Created on 25 Oct 2019  路  41Comments  路  Source: operator-framework/operator-sdk

Bug Report

What did you do?
The ansible operator hangs in different places/on different resources on

- k8s:
      state: present.

The k8s request is successfully executed, but next operation is never started.

kubectl get/apply are working on the pod.

Presence of resources.requests and requests.limits make it happens more often, while memory reported is will under limits.

What did you expect to see?

The playbooks should continue on.

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).

The latest log is on start of task

{"level":"info","ts":1571992864.8623073,"logger":"logging_event_handler","msg":"[playbook task]","name":"iscplatform","namespace":"isc-relite","gvk":"isc.ibm.com/v1, Kind=ISCSequence","event_type":"playbook_on_task_start","job":"8236951410584007958","EventData.Name":"icp.operator.actions : default deployment for inf-configstore"}

and the next operation never started.

Environment

  • operator-sdk version:

happens with v0.10.0 and v0.11.0

FROM quay.io/operator-framework/ansible-operator:v0.11.0
...

ENTRYPOINT ["/tini", "--", "bash", "-c", "${OPERATOR} run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false $@"]
ansible-operator version
operator-sdk version: "v0.11.0", commit: "39c65c36159a9c249e5f3c178205cc6e86c16f8d", go version: "go1.12.10 linux/amd64"
  • go version: go1.12.10 linux/amd64
  • Kubernetes version information:
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2019-09-12T23:41:09Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

Red Hat Openshift 3.11

  • Are you writing your operator in ansible, helm, or go?

ansible

Possible Solution

Additional context
Add any other context about the problem here.

help wanted languagansible

Most helpful comment

Just to be clear, the issue is in the client generator. Seems to generate code that does not respect the CPython contract, so the issue/fix will be to fix the code generator. In this sense, the bug is swaggers, not CPythons.

All 41 comments

Hmm, that's really bad. Can you show the output of ps aux in the operator container?

We've seen a bug in the past where Python garbage collection hangs when trying to clean up an object that references an open socket (like our kubernetes client object), though that shouldn't be occurring in the newer base images.

How reproducible is this failure?

It shows to be related to https://github.com/operator-framework/operator-sdk/issues/2094. Cannot be the python version in the env where it has been executed? If it is the case, it should work with the python3.

There is only python3.6 on quay.io/operator-framework/ansible-operator:v0.11.0
I'm building the operator with ps (its removed in v0.11) and will wait for next hanging process

@fabianvf the output of ps

Tue Oct 29 11:44:51 UTC 2019
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ansible       1  0.0  0.0   4352   412 ?        Ss   11:31   0:00 /tini -- bash -c ${OPERATOR} run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false $@
ansible       8  2.6  0.1 158972 56928 ?        Sl   11:31   0:21 /usr/local/bin/ansible-operator run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false
ansible     960  0.0  0.0  12036  1708 ?        Ss   11:33   0:00 bash -c while sleep 5; do date; ps auxww; done
ansible    5938  0.0  0.0  12036  1708 ?        Ss   11:37   0:00 bash -c while sleep 5; do date; ps 
auxww; done
ansible   13857 32.6  0.0 121368 25364 ?        S    11:41   0:57 /usr/bin/python3.6 /usr/local/bin/ansible-runner -vv --rotate-artifacts 20 -p /opt/ansible/playbooks/sequence.yml -i 1598098976185383115 run /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform
ansible   13861 34.0  0.2 297380 87904 pts/0    Ssl+ 11:41   0:59 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/env/extravars -vv /opt/ansible/playbooks/sequence.yml
ansible   19336  8.0  0.2 299504 83624 pts/0    S+   11:44   0:00 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/env/extravars -vv /opt/ansible/playbooks/sequence.yml
ansible   19338  0.0  0.0  23032   948 ?        S    11:44   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 5
ansible   19347  1.0  0.0  12396  1756 pts/0    S+   11:44   0:00 /bin/sh -c /usr/bin/python3.6 /tmp/.ansible/tmp/ansible-tmp-1572349491.1019864-279436973975095/AnsiballZ_k8s.py && sleep 0
ansible   19348 42.0  0.0  90548 21960 pts/0    R+   11:44   0:00 /usr/bin/python3.6 /tmp/.ansible/tmp/ansible-tmp-1572349491.1019864-279436973975095/AnsiballZ_k8s.py
ansible   19350  0.0  0.0  43956  1748 ?        R    11:44   0:00 ps auxww

The AnsiballZ_k8s.py - is a k8s 'present' command.

If I'm sending kill 19348 the current ansible command returns 0 and the playbook continues

an addition - if the AnsiballZ_k8s.py is quite regular and could be handled by watchdog script, e.g.
get the pids of all AnsiballZ_k8s.py processes
sleep 1minute
kill all the AnsiballZ_k8s.py processes still running
we have time to time non-regular (happens something like once a week) pattern like this

 1 ?        Ssl    3:15 /usr/local/bin/ansible-operator run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false
     8 ?        S      0:00 /bin/bash /opt/ansible/init/watchdog.sh
  1994 ?        S      0:08 /usr/bin/python3.6 /usr/local/bin/ansible-runner -vv --rotate-artifacts 20 -p /opt/ansible/playbooks/sequence.yml -i 1976235410884491574 run /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/de
  1998 pts/0    Ssl+  58:44 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/de/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/de/env/extravars -vv /opt/ansible/playbooks/sequence.yml
  4613 pts/0    Sl+    0:00 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/de/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/de/env/extravars -vv /opt/ansible/playbooks/sequence.yml
 80832 ?        S      0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 60
 80877 ?        Rs     0:00 ps axwww

and there is no obvious solution how to deal with such scenarios as there is no defined pattern to check.

So we have the exact same problem. Any update on this?

I've not been able to replicate this yet, though I did just get reminded of a similar bug that was filed in the ansible/ansible repo: https://github.com/ansible/ansible/issues/55554 , which unfortunately also doesn't have a resolution yet. Is anyone able to reliably replicate this in a local environment, and if so could you post the steps?

We have the same issue on all our OpenShift 3.11 and 4.1 clusters with operators using an ansible role. Our operators are namespaced.

The operator freezes at a random task in the role and we see that memory usage stays flatlined at his peak (Grafana). It can happen a few minutes after the operator was deployed, but also after two days. Eventually, it always freezes in all namespaces.
We tried to add additional cpu/resources and even disable quota's entirely, nothing helped so far.

We have the issue with v0.10 and v0.11. We don't have the issue with v0.8, but that one consumes a lot of CPU (1 full core) and Mem (almost 1gb), which is about ten times more than v0.10 and v0.11.

We prefer not to move to a Go operator because of the effort of refactoring, but it seems that this is much more stable. It also uses only one container iso two in the operator pod. Also, all the operators included in Openshift 4.x are based on Go. I wonder if RedHat is still investing in Ansible Operators.

image

I am seeing the same behaviour in the 0.12 version of the ansible operator. I m using the k8s_status module to update the status of the CR. What we are seeing is that the step runs OK but the status of the CR is left in "Running Reconciliation" and it never returns. At this point, the operator will stop listening to the watches and needs restarting.

As stated above - we do not want to move to the Go operator as we will still have to invoke Ansible scripts so the Ansible operator fits the process much better.

@accammer Indeed, I can confirm your observations. We also have the issue with v0.12.

Hi @fitex,

Note that OCP 3.11 is no longer compatibles with the latest versions of SDK since it has been using upper versions of k8s API which will not work with.

@mishindm, @fitex, @accammer could you help us by providing the detailed steps to reproduce it as a POC on top of Memcached Ansible Sample with Minikube? Would it be possible?

@camilamacedo86 memcached is probably bad example as it has too few objects created.
For example, we have 2 operators - in one of them ansible mainly invokes helm binary with few secrets/jobs created with k8s and it may work for a week without issues.
In comparison operator which creates about 400 objects during full deployment may be frozen once-twice during such deployment - and would be frozen nearly for sure in a day time.
If you have any idea of the extra diagnostics to produce I may try to produce such diagnostics locally.

As far as I see 0.12 works with 3.11 without issues except the one being discussed. Is there any pointers to the expected changes?

Hi @fitex,

Note that OCP 3.11 is no longer compatibles with the latest versions of SDK since it has been using upper versions of k8s API which will not work with.

@mishindm, @fitex, @accammer could you help us by providing the detailed steps to reproduce it as a POC on top of Memcached Ansible Sample with Minikube? Would it be possible?

I am running on microK8s 1.1.3 not OCP.

The main issue with this issue is that it is inconsistent and not easily reproduced although it is happening regularly

Hi @mishindm,

The k8s API changed and it is currently in 1.17+ when OCP 3.11 is using 1.11+ which as you can check in the README of SDK is lower than the prerequirements. See that ansible in the master is pinned kubernetes-1.16.2 which is not compatible with OCP 3.11 at all as the version 0.12 is using kubernetes-1.15.4.

I'd recommend you use minikube and/or OCP 4 since 3.11 is no longer supported by it. Are you able to move forward for OCP 4 and let us know if the issue still?

Hi @accammer,

What is the k8s API version used by microK8s 1.1.3?

As previously mentioned, we have the problem with v0.10, v0.11 and v0.12 on both OpenShift 3.11 and 4.1.

Hi @accammer,

What is the k8s API version used by microK8s 1.1.3?

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:17:47Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.6", GitCommit:"abdda3f9fefa29172298a2e42f5102e777a8ec25", GitTreeState:"clean", BuildDate:"2019-05-08T13:46:28Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

We have back-ported our code to use operator-sdk 0.8.1.
For this we had to refactor our ansible role to make it backwards compatible with ansible 2.7.

This operator is already running 48h on both OpenShift 3.11 and Openshift 4.1 without freezing. So far it's looking good.

One of the differences is that ansible 2.7 uses python 2 (iso 3) and so the k8s module behaves differently then in ansible 2.8 and 2.9.

I still wonder if this bug can be fixed in a higher operator-sdk version so that we at least can use it in Openshift 4.x. For now we will stick to operator-sdk 0.8.1.

Hi @fitex,

One of the differences is that ansible 2.7 uses python 2 (iso 3) and so the k8s module behaves differently than in ansible 2.8 and 2.9.

It is definitely not the only difference. Note that the operator-sdk 0.8.1. use an old k8s API which is compatible with OCP 3.11. Also, see that OCP 4.1 is using an old k8s api version Kubernetes 1.13.

I truly believe that you would not able to see this issue using OCP 4.2 which is using Kubernetes 1.14 and SDK 0.12. If you or @accammer and or @mishindm check it with 4.2 or minikube will be great and please let us know.

c/c @fabianvf

Hi @camilamacedo86, as soon as we have a 4.2 cluster, we will try it out.

But so far, as the documentation is not clear which version of operator-sdk is compatible with which OpenShift version, we will stay with operator-sdk 0.8.1.

Also, according to the k8s module of Ansible 2.7/2.8/2.9, it is compatible with Openshift >= 0.6:
https://docs.ansible.com/ansible/latest/modules/k8s_module.html

Even if that k8s module is using Python OpenShift and Python Kubernetes underneath, I don't see why it would fail on OpenShift 3.11 or OpenShift 4.1. Keep also in mind that there are numerous of operator reconcile runs that work fine and it just freezes after a random number of reconciles.
If there would be a compatibility issue, then I guess we would see it from the first run.

Hi @fitex,

Ansible and python is one thing. But the operators use the k8s API which is another dependency and you are right currently the docs do not make it clear enough. It has an issue raised here to create a matrix with the supported versions.

Please, let me know if you are able to check it and would be great to get your confirmation that the scenario is not faced in the OCP 4.2 with SDK 0.12.

Hi @fitex,

Could you confirm that the issue does not occur with SDK 0.12 and OCP >= 4.2 ? Really thank you for your collab and contribution with.

Hi @camilamacedo86 ,

We have upgraded our cluster to OpenShift 4.2.9, so I will give it a try with SDK 0.12.
In the meantime I see that SDK 0.13 is released. Will that version also work with OpenShift 4.x?

Hi @fitex,

Really thank you for your reply.

Yes, it will. Please, feel free to check 0.12 or any upper version which in this case is 0.13.

@camilamacedo86
Unfortunately we have the same behaviour with OpenShift 4.2.9 and Operator-sdk 0.13.0.

image

also have the issue that my ansible based operator gets stuck after a while. Some reconciliation loops are fine, then after few minutes to hours the ansible process gets stuck blocking the operator from further processing.

I enabled ANSIBLE_DEBUG=1 to get bit more output. here are the last 30 lines:

#> cat ./stdout  | tail -n30
  5054 1576664654.12903: _low_level_execute_command() done: rc=0, stdout=/opt/ansible
, stderr=
  5054 1576664654.13001: _low_level_execute_command(): starting
  5054 1576664654.13204: _low_level_execute_command(): executing: /bin/sh -c '( umask 77 && mkdir -p "` echo /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070 `" && echo ansible-tmp-1576664654.0683-190010863364070="` echo /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070 `" ) && sleep 0'
  5054 1576664654.13292: in local.exec_command()
  5054 1576664654.13405: opening command with Popen()
  5054 1576664654.14975: done running command with Popen()
  5054 1576664654.15097: getting output with communicate()
  5054 1576664654.16764: done communicating
  5054 1576664654.16904: done with local.exec_command()
  5054 1576664654.17008: _low_level_execute_command() done: rc=0, stdout=ansible-tmp-1576664654.0683-190010863364070=/opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070
, stderr=
  5054 1576664654.17213: ANSIBALLZ: using cached module: /opt/ansible/.ansible/tmp/ansible-local-4889xiy1pbw_/ansiballz_cache/k8s-ZIP_DEFLATED
  5054 1576664654.17666: transferring module to remote /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py
  5054 1576664654.17939: done transferring module to remote
  5054 1576664654.18086: _low_level_execute_command(): starting
  5054 1576664654.18329: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/ /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py && sleep 0'
  5054 1576664654.18476: in local.exec_command()
  5054 1576664654.18533: opening command with Popen()
  5054 1576664654.19800: done running command with Popen()
  5054 1576664654.19937: getting output with communicate()
  5054 1576664654.21163: done communicating
  5054 1576664654.21302: done with local.exec_command()
  5054 1576664654.21491: _low_level_execute_command() done: rc=0, stdout=, stderr=
  5054 1576664654.21596: _low_level_execute_command(): starting
  5054 1576664654.21701: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python3.6 /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py && sleep 0'
  5054 1576664654.21843: in local.exec_command()
  5054 1576664654.21996: opening command with Popen()
  5054 1576664654.44408: done running command with Popen()
  5054 1576664654.44574: getting output with communicate()

This last log line is from here: https://github.com/ansible/ansible/blob/480b106d6535978ae6ecab68b40942ca4fa914a0/lib/ansible/plugins/connection/local.py#L127

I can make the thing "unstuck" by killing the python process. Loglines after I killed the process:

  5054 1576677442.60567: done communicating
  5054 1576677442.61731: done with local.exec_command()
  5054 1576677442.61878: _low_level_execute_command() done: rc=143, stdout=
{"changed": false, "result": {"kind": "NetworkPolicy", "apiVersion": .... }
, stderr=/usr/local/lib/python3.6/site-packages/kubernetes/config/kube_config.py:509: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config_dict=yaml.load(f),
/bin/sh: line 1:  5064 Terminated              /usr/bin/python3.6 /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py
  5054 1576677442.62367: done with _execute_module (k8s, {'state': 'present', 'definition': {'apiVersion': 'networking.k8s.io/v1', 'kind': 'NetworkPolicy' ..., '_ansible_check_mode': False, '_ansible_no_log': False, '_ansible_debug': True, '_ansible_diff': False, '_ansible_verbosity': 2, '_ansible_version': '2.9.2', '_ansible_module_name': 'k8s', '_ansible_syslog_facility': 'LOG_USER', '_ansible_selinux_special_fs': ['fuse', 'nfs', 'vboxsf', 'ramfs', '9p', 'vfat'], '_ansible_string_conversion_action': 'warn', '_ansible_socket': None, '_ansible_shell_executable': '/bin/sh', '_ansible_keep_remote_files': False, '_ansible_tmpdir': '/opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/', '_ansible_remote_tmp': '~/.ansible/tmp'})

Its random on which resource the ansible process gets stuck.

operator-sdk: 0.13
OpenShift 3.11

Managed to get an strace.

I can see that the command python3.6 ... AnsiballZ_k8s.py actually writes the response to STDOUT and that the ansible-playbook process also reads the response.

After the python3.6 process writes the output there is one getpid() call and afterwards only futex().

As the response is written it looks to me like k8s is actually finished but the process still waits on something.

@vinzent that's some great info, any chance you've got the strace dump available? If it's hanging after the response is already processed it could be an issue with garbage collection/cleanup (we saw an issue a while back on python2 that was caused by a GC bug when calling __del__ on an object that referenced an open socket, I wonder if this is something similar...)

@fabianvf

strace around python3.6 writing the response to stdout

the tail of the python3.6 ansiballz process: https://gist.github.com/vinzent/5c81ee3135f56979eafee28e3aeb8dbf#file-strace-49277-tail-n100

@fabianvf managed to get a gdb bt .

#0  0x00007f6cb506ae56 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f6cb506af48 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f6cb536a105 in PyThread_acquire_lock_timed () from /lib64/libpython3.6m.so.1.0
#3  0x00007f6cb53e78a8 in lock_PyThread_acquire_lock () from /lib64/libpython3.6m.so.1.0
#4  0x00007f6cb53e99b3 in _PyCFunction_FastCallDict () from /lib64/libpython3.6m.so.1.0
#5  0x00007f6cb53e9d6d in call_function () from /lib64/libpython3.6m.so.1.0
#6  0x00007f6cb5428faa in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
#7  0x00007f6cb537b7db in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0
#8  0x00007f6cb53b57d0 in fast_function () from /lib64/libpython3.6m.so.1.0
#9  0x00007f6cb53e9e52 in call_function () from /lib64/libpython3.6m.so.1.0
#10 0x00007f6cb5428faa in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
#11 0x00007f6cb537b7db in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0
#12 0x00007f6cb53b57d0 in fast_function () from /lib64/libpython3.6m.so.1.0
#13 0x00007f6cb53e9e52 in call_function () from /lib64/libpython3.6m.so.1.0
#14 0x00007f6cb5428faa in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
#15 0x00007f6cb53b561a in fast_function () from /lib64/libpython3.6m.so.1.0
#16 0x00007f6cb53e9e52 in call_function () from /lib64/libpython3.6m.so.1.0
#17 0x00007f6cb5428faa in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
#18 0x00007f6cb537c89a in _PyFunction_FastCallDict () from /lib64/libpython3.6m.so.1.0
#19 0x00007f6cb537d27e in _PyObject_FastCallDict () from /lib64/libpython3.6m.so.1.0
#20 0x00007f6cb5388790 in _PyObject_Call_Prepend () from /lib64/libpython3.6m.so.1.0
#21 0x00007f6cb537cfeb in _PyObject_FastCallDict () from /lib64/libpython3.6m.so.1.0
#22 0x00007f6cb53f32ef in slot_tp_finalize () from /lib64/libpython3.6m.so.1.0
#23 0x00007f6cb5376f1b in collect () from /lib64/libpython3.6m.so.1.0
#24 0x00007f6cb544ebcd in collect_with_callback () from /lib64/libpython3.6m.so.1.0
#25 0x00007f6cb54847e8 in PyGC_Collect () from /lib64/libpython3.6m.so.1.0
#26 0x00007f6cb54988dd in Py_FinalizeEx () from /lib64/libpython3.6m.so.1.0
#27 0x00007f6cb5498a48 in Py_Exit () from /lib64/libpython3.6m.so.1.0
#28 0x00007f6cb5498b33 in handle_system_exit () from /lib64/libpython3.6m.so.1.0
#29 0x00007f6cb5498b96 in PyErr_PrintEx () from /lib64/libpython3.6m.so.1.0
#30 0x00007f6cb5499005 in PyRun_SimpleFileExFlags () from /lib64/libpython3.6m.so.1.0
#31 0x00007f6cb549a07b in Py_Main () from /lib64/libpython3.6m.so.1.0
#32 0x0000565493d4cc68 in main ()

couldn't do a py-bt (https://wiki.python.org/moin/DebuggingWithGdb) . there was some message from gdb that I should run dnf debuginfo-install platform-python-3.6.8-15.1.el8.x86_64 but that resulted in

Could not find debuginfo for package: platform-python-3.6.8-15.1.el8.x86_64
No debuginfo packages available to install

@fabianvf managed to get debuginfo installed. needed to pass --enablerepo="*" to get access to debug repos.

py-bt looks like its hung in Garbage collecting.

Traceback (most recent call first):
  <built-in method acquire of _thread.lock object at remote 0x7faa367603a0>
  File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/usr/lib64/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 550, in join
    p.join()
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 78, in __del__
    self.pool.join()
  Garbage-collecting

py-bt and bt: https://gist.github.com/vinzent/5c81ee3135f56979eafee28e3aeb8dbf#file-gdb-backtrace-txt

Was able to force a hang with the following code:

import kubernetes

client = kubernetes.config.new_client_from_config()
thread = kubernetes.client.ApisApi(client).get_api_versions(async_req=True)
del client

Not super familiar with threading in Python so not sure what a fix would like yet, but for now filed this issue against the kubernetes client: https://github.com/kubernetes-client/python/issues/1037

To test I replaced Python3 with Python2 in the container: The ansible-operator didn't get stuck with Python 2.7 since friday.

Whereas with Python 3 it only takes few minutes until it hangs.

As a workaround I removed the __del__ method from api_client.py .

Hi @vinzent,

Its random on which resource the ansible process gets stuck.

operator-sdk: 0.13
OpenShift 3.11

Really thank you for the information provided and tests performed. Note that the OpenShift 3.11 is no longer supported and it is k8s API version is incompatible with the k8s version used in the SDK 0.13

Regards the workaround adopted:

As a workaround I removed the __del__ method from api_client.py ?

Please, could you add the links and provide more details over the steps? So, just to confirm; after you remove it then all worked well with 0.13 and python 3. Am I right?

@camilamacedo86 at least ot works fine with 0.13 and OpenShift 3.11 even though you say it's not supported. The error also isn't tied to any openshift version at all, because the problem can be reproduced without even connecting to a cluster at all.

I don't know what further details to provide. I removed the __del__ method in the ApiClient class (api_client.py) as a workaround. I think it's not an issue to remove it because every resource managed by ansible creates a process thats just does api calls for that specific resource and then exits again.

@camilamacedo86 please also have a look at https://github.com/kubernetes-client/python/issues/1037

Root cause seems to be a Python3 issue reported 6 days ago: https://bugs.python.org/issue39360 . We make use of the kubernetes-client/python project, which is generated by swagger-codegen.

Corresponding issue for swagger-codegen is here: https://github.com/swagger-api/swagger-codegen/issues/9991

Just to be clear, the issue is in the client generator. Seems to generate code that does not respect the CPython contract, so the issue/fix will be to fix the code generator. In this sense, the bug is swaggers, not CPythons.

Python dependencies have been fixed and bumped, this should finally be fixed

https://github.com/operator-framework/operator-sdk/pull/2652/files#diff-dd9d2b60820cd667310a1c6a16f46522R79

Was this page helpful?
0 / 5 - 0 ratings