Grpc: Client-side Python fork support can hang on Python 3.7

Created on 15 Feb 2019  路  3Comments  路  Source: grpc/grpc

https://github.com/grpc/grpc/pull/16513 adds fork-without-exec tests for Python clients using GRPC_ENABLE_FORK_SUPPORT=1 and the multiprocessing library. The tests are flaky on Kokoro using Python 3.7, leading to what appear to be hanging child processes.

I have been unable to reproduce the problem locally to attach a debugger and determine the exact problem, but based on a series of trials on Kokoro it looks the child's post-fork Python handler is deadlocking either when obtaining the GIL and or at some point before beginning to clean-up inherited channels. This was determined via inserting calls to C++'s abort() or Python's os._exit at various points within the post-fork handler: aborting the process before obtaining the GIL deterministically avoided the deadlock, but even simply exiting the process with the GIL (before doing anything potentially suspicious, such as obtaining a condition variable's lock) could still result in a hung child process.

Since it appears that the deadlock is present on Python 3.7 and not on Python 2, and the hang does not appear to occur as a result of an error in our fork-handlers (such as trying to obtain a lock that was held by another thread pre-fork), I suspect that the issue is a result of our pre-fork handlers only pausing active gRPC Python threads in block_if_fork_in_progress, where threads block on a condition variable until the fork completes. My guess is that these threads, since they are abruptly terminated at fork, are still holding some internal Python lock or otherwise leave the interpreter's state in a broken condition post-fork. I couldn't find a corresponding known issue on the Python issue tracker, nor have I confirmed this to be the case, but it seems the most likely cause at this point - it's also not clear that this would even indicate a bug in Python, as the documentation is unclear as to whether it's even intended to be safe to fork with living (but blocking) threads via the multiprocessing module.

One option to both test the above hypothesis and hopefully fix the underlying issue would be to change gRPC Python's fork handlers to actually stop (join) its threads pre-fork, and resume them post-fork. This is the behavior of gRPC core's fork handlers. This is a somewhat invasive change, as unlike core, gRPC Python doesn't have any executor/queue concept for the thread's on-going work to be placed while the thread itself is joined and then re-created in the parent's post-fork handler. A lighter weight approach that may also suffice is to have the threads simply release the GIL prior to fork, block, then reobtain the GIL after fork in the parent.

The spawn option added to multiprocessing in Python 3.4 may also present an alternative to the current fork-support via fork handlers for Python 3.7 users that would workaround the above issue entirely.

kinbug lanPython prioritP2

Most helpful comment

I was able to reproduce the issue in a locally running docker container matching our kokoro test runs. It took repeated runs of the new fork tests for a couple hours to produce the first hang, but artificially increasing CPU load on my desktop seemed to make it much easier to reproduce, which would help explain why it happens so frequently on our CI builds but was hard to reproduce locally.

The backtrace from attaching to one of the hanging processes:

#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x8a6bfc <_PyRuntime+1340>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x8a6c00 <_PyRuntime+1344>, cond=0x8a6bd0 <_PyRuntime+1296>)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x8a6bd0 <_PyRuntime+1296>, mutex=0x8a6c00 <_PyRuntime+1344>)
    at pthread_cond_wait.c:655
#3  0x000000000054ccb1 in drop_gil.lto_priv () at ../Python/ceval_gil.h:183
#4  0x0000000000553fd1 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:979
#5  0x000000000054a8d2 in PyEval_EvalFrameEx (throwflag=0, f=0x7f79f5b785d0) at ../Python/ceval.c:547
#6  _PyEval_EvalCodeWithName () at ../Python/ceval.c:3930
#7  0x00000000005d7750 in _PyFunction_FastCallDict () at ../Objects/call.c:367
#8  0x00000000005910b3 in _PyObject_FastCallDict (kwargs=0x0, nargs=1, args=0x7ffe6be3cbf0, 
    callable=0x7f79f5b38268) at ../Objects/call.c:98
#9  _PyObject_Call_Prepend (kwargs=0x0, args=0x7f79f6308048, obj=<optimized out>, callable=0x7f79f5b38268)
    at ../Objects/call.c:904
#10 slot_tp_init () at ../Objects/typeobject.c:6610
#11 0x0000000000597231 in type_call.lto_priv () at ../Objects/typeobject.c:949
#12 0x00007f79f539a2a4 in __Pyx_PyObject_Call (kw=0x0, arg=0x7f79f6308048, func=0xce4138)
    at src/python/grpcio/grpc/_cython/cygrpc.cpp:68112
#13 __Pyx_PyObject_CallNoArg (func=0xce4138) at src/python/grpcio/grpc/_cython/cygrpc.cpp:3178
#14 0x00007f79f53d2406 in __pyx_f_4grpc_7_cython_6cygrpc___postfork_child ()
    at src/python/grpcio/grpc/_cython/cygrpc.cpp:50541
#15 0x00007f79f6484008 in __run_fork_handlers (who=who@entry=atfork_run_child) at register-atfork.c:134
#16 0x00007f79f6443b9a in __libc_fork () at ../sysdeps/nptl/fork.c:137
#17 0x000000000060794b in os_fork_impl (module=<optimized out>) at ../Modules/posixmodule.c:5438

14 is our Python postfork handler (cygrpc.cpp:50541), and this code is part of the block that corresponds with the following line of fork_posix.pyx.pxi:

            _fork_state.active_thread_count = _ActiveThreadCount()

The stack trace proceeds to drop_gil.lto_priv () at ../Python/ceval_gil.h:183 and the hang occurs.

From the above file's "Notes on the implementation" (of the CPython GIL, which I would think is always a bad sign if your stack trace leads you there...):

  • When a thread releases the GIL and gil_drop_request is set, that thread
    ensures that another GIL-awaiting thread gets scheduled.
    It does so by waiting on a condition variable (switch_cond) until
    the value of last_holder is changed to something else than its
    own thread state pointer, indicating that another thread was able to
    take the GIL.

That certainly makes it sound like the thread is waiting until another thread grabs the GIL and notifies the condition variable - unfortunately, since we just forked, there is no other thread. It would appear that the hang occurs when the parent process has a Python thread which sets the GIL's gil_drop_request to true just before forking, which triggers the hang when the sole surviving thread post-fork attempts to dutifully yield the GIL to the requester.

In conversation with @gnossen today, it came up that our handling of the _consume_request_iterator thread in _channel.py on fork may not be 100% safe - since this gRPC Python thread may block indefinitely in user-code, we have to ignore it for fork accounting purposes via cygrpc.enter_user_request_generator() and cygrpc.return_from_user_request_generator(), so this thread may not be actually blocking on anything - other than user-provided code - at fork time. It's possible that this explains the not-quite-always failing tests, if the hang depends on where the consume request iterator has reached when fork is invoked. I will update the tests to guarantee this thread is blocking when fork is invoked and see if that has any impact on the deadlock. However, the test has previously deadlocked on non-bidi-streaming tests, which suggest that the existence of any other threads when fork is called, regardless of whether they are blocking or not, is unsafe.

All 3 comments

I was able to reproduce the issue in a locally running docker container matching our kokoro test runs. It took repeated runs of the new fork tests for a couple hours to produce the first hang, but artificially increasing CPU load on my desktop seemed to make it much easier to reproduce, which would help explain why it happens so frequently on our CI builds but was hard to reproduce locally.

The backtrace from attaching to one of the hanging processes:

#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x8a6bfc <_PyRuntime+1340>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x8a6c00 <_PyRuntime+1344>, cond=0x8a6bd0 <_PyRuntime+1296>)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x8a6bd0 <_PyRuntime+1296>, mutex=0x8a6c00 <_PyRuntime+1344>)
    at pthread_cond_wait.c:655
#3  0x000000000054ccb1 in drop_gil.lto_priv () at ../Python/ceval_gil.h:183
#4  0x0000000000553fd1 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:979
#5  0x000000000054a8d2 in PyEval_EvalFrameEx (throwflag=0, f=0x7f79f5b785d0) at ../Python/ceval.c:547
#6  _PyEval_EvalCodeWithName () at ../Python/ceval.c:3930
#7  0x00000000005d7750 in _PyFunction_FastCallDict () at ../Objects/call.c:367
#8  0x00000000005910b3 in _PyObject_FastCallDict (kwargs=0x0, nargs=1, args=0x7ffe6be3cbf0, 
    callable=0x7f79f5b38268) at ../Objects/call.c:98
#9  _PyObject_Call_Prepend (kwargs=0x0, args=0x7f79f6308048, obj=<optimized out>, callable=0x7f79f5b38268)
    at ../Objects/call.c:904
#10 slot_tp_init () at ../Objects/typeobject.c:6610
#11 0x0000000000597231 in type_call.lto_priv () at ../Objects/typeobject.c:949
#12 0x00007f79f539a2a4 in __Pyx_PyObject_Call (kw=0x0, arg=0x7f79f6308048, func=0xce4138)
    at src/python/grpcio/grpc/_cython/cygrpc.cpp:68112
#13 __Pyx_PyObject_CallNoArg (func=0xce4138) at src/python/grpcio/grpc/_cython/cygrpc.cpp:3178
#14 0x00007f79f53d2406 in __pyx_f_4grpc_7_cython_6cygrpc___postfork_child ()
    at src/python/grpcio/grpc/_cython/cygrpc.cpp:50541
#15 0x00007f79f6484008 in __run_fork_handlers (who=who@entry=atfork_run_child) at register-atfork.c:134
#16 0x00007f79f6443b9a in __libc_fork () at ../sysdeps/nptl/fork.c:137
#17 0x000000000060794b in os_fork_impl (module=<optimized out>) at ../Modules/posixmodule.c:5438

14 is our Python postfork handler (cygrpc.cpp:50541), and this code is part of the block that corresponds with the following line of fork_posix.pyx.pxi:

            _fork_state.active_thread_count = _ActiveThreadCount()

The stack trace proceeds to drop_gil.lto_priv () at ../Python/ceval_gil.h:183 and the hang occurs.

From the above file's "Notes on the implementation" (of the CPython GIL, which I would think is always a bad sign if your stack trace leads you there...):

  • When a thread releases the GIL and gil_drop_request is set, that thread
    ensures that another GIL-awaiting thread gets scheduled.
    It does so by waiting on a condition variable (switch_cond) until
    the value of last_holder is changed to something else than its
    own thread state pointer, indicating that another thread was able to
    take the GIL.

That certainly makes it sound like the thread is waiting until another thread grabs the GIL and notifies the condition variable - unfortunately, since we just forked, there is no other thread. It would appear that the hang occurs when the parent process has a Python thread which sets the GIL's gil_drop_request to true just before forking, which triggers the hang when the sole surviving thread post-fork attempts to dutifully yield the GIL to the requester.

In conversation with @gnossen today, it came up that our handling of the _consume_request_iterator thread in _channel.py on fork may not be 100% safe - since this gRPC Python thread may block indefinitely in user-code, we have to ignore it for fork accounting purposes via cygrpc.enter_user_request_generator() and cygrpc.return_from_user_request_generator(), so this thread may not be actually blocking on anything - other than user-provided code - at fork time. It's possible that this explains the not-quite-always failing tests, if the hang depends on where the consume request iterator has reached when fork is invoked. I will update the tests to guarantee this thread is blocking when fork is invoked and see if that has any impact on the deadlock. However, the test has previously deadlocked on non-bidi-streaming tests, which suggest that the existence of any other threads when fork is called, regardless of whether they are blocking or not, is unsafe.

We're planning on looking into this issue in Q3. We've determined some acceptance criteria:

  • Determine whether cpython 3.8 is affected
  • Understand the nature of the problem and post a summary here. (if any more information is gleaned)
  • (stretch goal) Resolve the issue for all versions of cpython.

is there any update on whether this affects cpython 3.8?

Was this page helpful?
0 / 5 - 0 ratings