Python: Python freeze/hang on exit

Created on 7 Dec 2017 · 20Comments · Source: kubernetes-client/python

Context

Hello,

In a batch manager project, we are using this python client to submit jobs to the Kubernetes API.
In one word, the python application loads the library, submit the job, watch the related events, clean/delete the job, then return the succeeded or failed status. But sometimes, the application hang at the exit.

After investigation, it seems the ThreadPool used in the ApiClient class is not properly clean on the python process exit.

Reproduce

The easiest way to reproduce is to run this snippet:

python_version=3.6
kubernetes_version=4.0.0

docker run --name testing --rm -it --entrypoint "" python:$python_version /bin/bash -c "
pip install 'kubernetes==$kubernetes_version'
while true; do echo ===;
  for i in {0..50}; do python -c '

from kubernetes import client
coreapi = client.CoreV1Api()
print(0)' &

  done
wait
done"

This will run Python in a Docker container, install the Kubernetes python module, then run the test indefinitely. The test starts a simple application 50 times in order to increase the probability. This application loads the Kubernetes python module, create a CoreV1Api, which creates its ApiClient (with Async enabled using ThreadPool), then print 0 showing the freeze occured during the python exit sequence.

To stop the test:

docker rm -f testing

Expected:

This code should run indefinitely.

Result:

The loop hang on list of 0 after some time.

Workaround:

To avoid this issue, we override the ApiClient class to disable Async / ThreadPool feature. It seems to work without any issues so far. Downside is we are loosing the Async mode.

Thank you.

Source

johnmarcou

👍9

Most helpful comment

This seems to be related to the __del__ method on ApiClient cleaning up its ThreadPool: https://github.com/kubernetes-incubator/client-python/blob/30b8ee44f4d14546e651dead91306719d53f8c37/kubernetes/client/api_client.py#L76-L78

This can cause a deadlock when the api clients are garbage collected as Python exits. I can reproduce with the following:

# deadlock.py
from multiprocessing.pool import ThreadPool

class Deadlocker:
    def __init__(self):
        self.pool = ThreadPool()

    def __del__(self):
        self.pool.close()
        self.pool.join()

d1 = Deadlocker()
d2 = Deadlocker()
print('exiting...')

and running:

while true; do python3 deadlock.py; done

On macOS 10.12.6 with Python 3.6.3, after 1-50 executions, it will print "exiting..." and stall. The python process won't ever terminate until you hit ctrl-c. I can also reproduce on Linux but it seems to be less frequent that on macOS.

This means a simple script like this:

from kubernetes import client
coreapi = client.CoreV1Api()
batchapi = client.BatchV1Api()

may never terminate because CoreV1Api and BatchV1Api will both instantiate ApiClients which have the problematic __del__ method. We can reduce the likelihood of a deadlock by creating a single ApiClient and passing it into CoreV1Api and BatchV1Api but the problem doesn't go away entirely. There are also some classes like Watch that always instantiate their own ApiClient.

I wonder whether the multiprocessing.pool.ThreadPool class is suitable for production use cases. According to a stackoverflow comment I came across:

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation.

RobbieClarken on 7 Dec 2017

👍3

All 20 comments

This can cause a deadlock when the api clients are garbage collected as Python exits. I can reproduce with the following:

# deadlock.py
from multiprocessing.pool import ThreadPool

class Deadlocker:
    def __init__(self):
        self.pool = ThreadPool()

    def __del__(self):
        self.pool.close()
        self.pool.join()

d1 = Deadlocker()
d2 = Deadlocker()
print('exiting...')

and running:

while true; do python3 deadlock.py; done

This means a simple script like this:

from kubernetes import client
coreapi = client.CoreV1Api()
batchapi = client.BatchV1Api()

I wonder whether the multiprocessing.pool.ThreadPool class is suitable for production use cases. According to a stackoverflow comment I came across:

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation.

RobbieClarken on 7 Dec 2017

👍3

I'm seeing what I suspect is another symptom of this problem. When I use the kubernetes Python module, I often see this when the program exits:

Exception ignored in: <bound method ApiClient.__del__ of <kubernetes.client.api_client.ApiClient object at 0x2ae3248da470>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/kubernetes/client/api_client.py", line 78, in __del__
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 501, in join
  File "/usr/lib/python3.4/threading.py", line 1056, in join
TypeError: 'NoneType' object is not callable

It doesn't seem to cause any harm, since as the message says, the exception is ignored since it's from the __del__ method, but I still get these ugly messages to stderr.

I didn't see this symptom when using version 3.0.0.

jphx on 7 Dec 2017

I think the hang happens with the join() method, as it waits until the process is done, can someone verify this? If so, it has to be investigated why!

I introduced the __del__() as the ThreadPool isn't cleared automatically all the time when apiClient is destructed/ collected by GC. On rapid creation of many clients (bad way to do things) and stop using them, you can quickly run out of threads as the destruction of these apiClient instances doesn't necessarily close and join the ThreadPool. At least in C/ C++ sense, threads should be closed and joined appropriately, so in that sense, forcing the ThreadPool to close and join is correct. Lack of documentation of ThreadPool made it impossible to figure out the right way (without looking into its implementation)

@RobbieClarken You are correct on the ThreadPool's bad documentation and all. I googled a lot to figure out how to "correctly" deal with a threadpool instance.

@jphx a check can be done to see if the object is not none. Ugly but should work.

tuxbotix on 8 Feb 2018

Is there a recommended workaround for this issue? Currently I'm pursuing the approach of deploying a modified api_client.py with the pool code disabled.

Perhaps lazily creating the thread pool only when some call occurs with async=True would be a better approach than always creating the pool? While doing so doesn't necessarily resolve the bug, it would at least confine the bug to those using async=True, and prevent unnecessary churn of thread pool objects for those not using async=True.

sbconsulting on 23 May 2018

@johnmarcou @RobbieClarken I'm having exactly the same situation as @sbconsulting and I'm trying to find out a workaround to this issue. Would you mind sharing some code that you are using to temporally patch this issue? I can see that John reported that they override the APIClient class and disabled the threadpool feature, but would be good to know how you did it in case other people have the same issue as all us.

Anyways, thanks a lot for the discussion in this issue, I've been a week hunting this ghost bug!

Sturgelose on 16 Jul 2018

👍1

Hi @Sturgelose,

The client.CoreV1Api() object use dependencies injection to get its ApiClient.
We can use this to create our own ApiClient, then inject it.

To create our own K8sApiClient, we are using inheritance of the ApiClient, then we override some functions.

We are talking about this Class:
https://github.com/kubernetes-client/python/blob/v4.0.0/kubernetes/client/api_client.py#L32

Here comes the patched K8sApiClient:

class K8sApiClient(client.ApiClient):
    def call_api(self, *args, async=None, **kwargs):
        return super().call_api(*args, async=False, **kwargs)

    def __init__(self, configuration=None, header_name=None, header_value=None, cookie=None):
        if configuration is None:
            configuration = Configuration()
        self.configuration = configuration

#        self.pool = ThreadPool()
        self.rest_client = RESTClientObject(configuration)
        self.default_headers = {}
        if header_name is not None:
            self.default_headers[header_name] = header_value
        self.cookie = cookie
        # Set default User-Agent.
        self.user_agent = 'Swagger-Codegen/4.0.0/python'

    def __del__(self):
        pass

Then how to use it:

        config = client.Configuration()
        api_client = K8sApiClient(configuration=config)
        coreapi = client.CoreV1Api(api_client)
        batchapi = client.BatchV1Api(api_client)

It's just some copy/paste from a project, it's not tested code. But you should have everything to disable this ThreadPool issue.

johnmarcou on 16 Jul 2018

👍1

This is broken for me as well.

Seems like https://github.com/swagger-api/swagger-codegen/pull/8061 would fix it, since it only instantiates the problematic ThreadPool on first-request. If no requests in a session are async it should never create the pool.

007 on 5 Nov 2018

I'm wondering what the state of this is. Given that there's a series of MRs, some open and some merged, and that this issue is itself open, would anyone be willing to provide a general summary of the path forward, and what versions of the client will have a fix?

spacez320 on 29 Jan 2019

@spacez320 The workaround was added and it's available in the latest version 9.0.0a1.

tomplus on 29 Jan 2019

🎉1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 30 Apr 2019

/remove-lifecycle stale

007 on 30 Apr 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Jul 2019

@spacez320 could you please verify if the workaround works?

yliaog on 27 Aug 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 26 Sep 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Oct 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.