Incubator-mxnet: Subprocess Deadlock with mxnet-mkl

Created on 1 Oct 2018  路  14Comments  路  Source: apache/incubator-mxnet

Description

mxnet-mkl hangs indefinitely when trying to spawn subprocesses. This is a recent issue we are observing with Sockeye and may be related to #8532, but it can be reproduced without Sockeye (see below).

Environment info (Required)

  • Python 3.6.6
  • MacOs
  • mxnet-mkl==1.3.0.post0
  • Anaconda Numpy (with MKL optimization): conda install mkl ; conda install numpy

Minimum reproducible example

The following code reliably reproduces the deadlock/indefinite hang in the main process.
It creates a minimal module and 'trains' for 500 iterations, spawning a subprocess every 100 iterations. The main process is supposed to wait until the subprocess finishes before starting the next one.

code.py:

import subprocess
import sys

import mxnet as mx

if __name__ == '__main__':

    if len(sys.argv) > 1:
        print("TESTING")
        test = True
        iterations = 50
    else:
        print("TRAINING")
        test = False
        iterations = 500

    x = mx.sym.Variable('x')
    y = mx.sym.Variable('y')

    sym = mx.sym.FullyConnected(x, num_hidden=5)
    sym = mx.sym.SoftmaxOutput(sym, y)

    x_data = mx.nd.uniform(0, 1, (32, 16))
    y_data = mx.nd.zeros((32, 5))
    batch = mx.io.DataBatch(data=[x_data], label=[y_data])

    mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
    mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
             label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
             for_training=True, grad_req='write' if not test else 'null')
    mod.init_params()
    mod.init_optimizer()
    process = None
    for i in range(iterations):
        mod.forward(batch)
        if not test:
            mod.backward()
            mod.update()
        if i % 100 == 0 and i > 0:
            print(i)
            if not test:
                if process:
                    print("Waiting for process")
                    process.wait()
                cmd = [sys.executable, sys.argv[0], 'test']
                print("Starting process: '%s'" % " ".join(cmd))
                process = subprocess.Popen(cmd)
    if process:
        process.wait()

Steps to reproduce

  1. conda install mkl
  2. conda install numpy
  3. pip install mxnet-mkl --no-deps
  4. python3 code.py

What have you tried to solve it?

Replacing mxnet-mkl with mxnet or conda's numpy with pip-installed numpy (conda uninstall numpy; conda uninstall mkl; pip install numpy) resolves the issue and the output is as expected:

100
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
TESTING
MKL

Most helpful comment

Hi @mzhukova, thanks for the workaround with KMP_INIT_AT_FORK=false! This seems to fix the hanging issue for me.
Here's the version information when running with MKL_VERBOSE=1:

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180710 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x7f9cc9432f80,1,0x7f9cc9432f80,1) 8.17ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(T,N,5,32,16,0x700005cf0738,0x7f9ccb85ddc0,16,0x7f9cca4e4c00,16,0x700005cf0740,0x7f9ccb9d8a40,5) 175.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE SAXPY(5,0x700005cf0738,0x7f9ccb85df00,1,0x7f9ccb9d8a40,1) 10.67us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
[...followed by an infinite amount of lines similar to the last the one above...]

All 14 comments

@mxnet-label-bot [MKL]

Actually, there is no need for the subprocess to be a Python process. replacing cmd = [sys.executable, sys.argv[0], 'test'] with cmd = ['ls'] produces the same hanging.

The hang does not occur if one comments the following lines from the code:

mod.forward(batch)
mod.backward()
mod.update()

So it seems that mxnet-mkl is somehow preventing any subprocess forking.
If the above code example is run in a debugger, the hanging occurs in the call to self._execute_child(...) in subprocess.py, line 1268.

Thanks @fhieber to raise this issue. I will take a look after China holiday from 1st to 7th OCT.

thanks for looking into the issue!

Just as a note: In the installation steps above one needs to add a --no-deps to the mxnet installation to make sure the conda numpy version, which uses mkl, will not be overwritten by the pip version:

  1. conda install mkl
  2. conda install numpy
  3. pip install mxnet-mkl --no-deps
  4. python3 code.py

We start to look at the issue and will back soon :)

thanks you! We are currently experimenting with a workaround that can be found here:
https://github.com/awslabs/sockeye/tree/forkserver

The bottom line is to create a forkserver with a clean python interpreter process before MXNet is imported. If we use that forkserver for forking our decoder process we do not observe the behavior. That said, it is still concerning that one can no longer fork after MXNet with MKL was imported.

@tdomhan we can reproduce the issue on MacOS, no problem on Linux.

As you mentioned, the issue happens between conda numpy-mkl with fork but the normal version of numpy is OK. And if we use os.system() to execute the cmd, it's also fine.

Still debugging, but it looks like this is a cross-platform and software issues. I will contact with mkl team for some feedbacks.

Hi folks,
In a case when you're using mxnet with mkl, can you please set environment variable MKL_VERBOSE=1 and share the output?

Best regards,
Alexander

Hi folks,

@tdomhan, Can you please run the application with MKL_VERBOSE=1, this will help us to determine the MKL version and threading layer.
I guess your issue can be related to Intel OpenMP + fork(), see https://github.com/numpy/numpy/issues/10060.
So, you can also try the workaround -- set KMP_INIT_AT_FORK to false.

Please, let me know what you find out!

Best regards,
Maria

Hi @mzhukova, thanks for the workaround with KMP_INIT_AT_FORK=false! This seems to fix the hanging issue for me.
Here's the version information when running with MKL_VERBOSE=1:

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180710 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x7f9cc9432f80,1,0x7f9cc9432f80,1) 8.17ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(T,N,5,32,16,0x700005cf0738,0x7f9ccb85ddc0,16,0x7f9cca4e4c00,16,0x700005cf0740,0x7f9ccb9d8a40,5) 175.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE SAXPY(5,0x700005cf0738,0x7f9ccb85df00,1,0x7f9ccb9d8a40,1) 10.67us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
[...followed by an infinite amount of lines similar to the last the one above...]

Hi @fhieber ,
So, the MKL indeed uses OpenMP threading in this case, which is the root cause of the hang that you observe.
numpy can be forced to use sequential or tbb threading by corresponding MKL_THREADING_LAYER settings. However, as mxnet uses libmklml which support only intel_thread, the good option here will be to use this workaround KMP_INIT_AT_FORK=false.
You can also check version and may be try to update the intel-openmp, as this issue can be already fixed in one of the latest releases.

Best regards,
Maria

really appreciate for the help @mzhukova @akalinki

Close this issue for now. Please feel free to reopen it if you are facing more problems with it.

Was this page helpful?
0 / 5 - 0 ratings