mxnet-mkl hangs indefinitely when trying to spawn subprocesses. This is a recent issue we are observing with Sockeye and may be related to #8532, but it can be reproduced without Sockeye (see below).
conda install mkl ; conda install numpyThe following code reliably reproduces the deadlock/indefinite hang in the main process.
It creates a minimal module and 'trains' for 500 iterations, spawning a subprocess every 100 iterations. The main process is supposed to wait until the subprocess finishes before starting the next one.
code.py:
import subprocess
import sys
import mxnet as mx
if __name__ == '__main__':
if len(sys.argv) > 1:
print("TESTING")
test = True
iterations = 50
else:
print("TRAINING")
test = False
iterations = 500
x = mx.sym.Variable('x')
y = mx.sym.Variable('y')
sym = mx.sym.FullyConnected(x, num_hidden=5)
sym = mx.sym.SoftmaxOutput(sym, y)
x_data = mx.nd.uniform(0, 1, (32, 16))
y_data = mx.nd.zeros((32, 5))
batch = mx.io.DataBatch(data=[x_data], label=[y_data])
mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
for_training=True, grad_req='write' if not test else 'null')
mod.init_params()
mod.init_optimizer()
process = None
for i in range(iterations):
mod.forward(batch)
if not test:
mod.backward()
mod.update()
if i % 100 == 0 and i > 0:
print(i)
if not test:
if process:
print("Waiting for process")
process.wait()
cmd = [sys.executable, sys.argv[0], 'test']
print("Starting process: '%s'" % " ".join(cmd))
process = subprocess.Popen(cmd)
if process:
process.wait()
conda install mklconda install numpypip install mxnet-mkl --no-depspython3 code.pyReplacing mxnet-mkl with mxnet or conda's numpy with pip-installed numpy (conda uninstall numpy; conda uninstall mkl; pip install numpy) resolves the issue and the output is as expected:
100
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
TESTING
@mxnet-label-bot [MKL]
Actually, there is no need for the subprocess to be a Python process. replacing cmd = [sys.executable, sys.argv[0], 'test'] with cmd = ['ls'] produces the same hanging.
The hang does not occur if one comments the following lines from the code:
mod.forward(batch)
mod.backward()
mod.update()
So it seems that mxnet-mkl is somehow preventing any subprocess forking.
If the above code example is run in a debugger, the hanging occurs in the call to self._execute_child(...) in subprocess.py, line 1268.
Thanks @fhieber to raise this issue. I will take a look after China holiday from 1st to 7th OCT.
thanks for looking into the issue!
Just as a note: In the installation steps above one needs to add a --no-deps to the mxnet installation to make sure the conda numpy version, which uses mkl, will not be overwritten by the pip version:
We start to look at the issue and will back soon :)
thanks you! We are currently experimenting with a workaround that can be found here:
https://github.com/awslabs/sockeye/tree/forkserver
The bottom line is to create a forkserver with a clean python interpreter process before MXNet is imported. If we use that forkserver for forking our decoder process we do not observe the behavior. That said, it is still concerning that one can no longer fork after MXNet with MKL was imported.
@tdomhan we can reproduce the issue on MacOS, no problem on Linux.
As you mentioned, the issue happens between conda numpy-mkl with fork but the normal version of numpy is OK. And if we use os.system() to execute the cmd, it's also fine.
Still debugging, but it looks like this is a cross-platform and software issues. I will contact with mkl team for some feedbacks.
Hi folks,
In a case when you're using mxnet with mkl, can you please set environment variable MKL_VERBOSE=1 and share the output?
Best regards,
Alexander
Hi folks,
@tdomhan, Can you please run the application with MKL_VERBOSE=1, this will help us to determine the MKL version and threading layer.
I guess your issue can be related to Intel OpenMP + fork(), see https://github.com/numpy/numpy/issues/10060.
So, you can also try the workaround -- set KMP_INIT_AT_FORK to false.
Please, let me know what you find out!
Best regards,
Maria
Hi @mzhukova, thanks for the workaround with KMP_INIT_AT_FORK=false! This seems to fix the hanging issue for me.
Here's the version information when running with MKL_VERBOSE=1:
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180710 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x7f9cc9432f80,1,0x7f9cc9432f80,1) 8.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(T,N,5,32,16,0x700005cf0738,0x7f9ccb85ddc0,16,0x7f9cca4e4c00,16,0x700005cf0740,0x7f9ccb9d8a40,5) 175.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
MKL_VERBOSE SAXPY(5,0x700005cf0738,0x7f9ccb85df00,1,0x7f9ccb9d8a40,1) 10.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
[...followed by an infinite amount of lines similar to the last the one above...]
Hi @fhieber ,
So, the MKL indeed uses OpenMP threading in this case, which is the root cause of the hang that you observe.
numpy can be forced to use sequential or tbb threading by corresponding MKL_THREADING_LAYER settings. However, as mxnet uses libmklml which support only intel_thread, the good option here will be to use this workaround KMP_INIT_AT_FORK=false.
You can also check version and may be try to update the intel-openmp, as this issue can be already fixed in one of the latest releases.
Best regards,
Maria
really appreciate for the help @mzhukova @akalinki
Close this issue for now. Please feel free to reopen it if you are facing more problems with it.
Most helpful comment
Hi @mzhukova, thanks for the workaround with
KMP_INIT_AT_FORK=false! This seems to fix the hanging issue for me.Here's the version information when running with
MKL_VERBOSE=1: