Incubator-mxnet: importing mxnet causing subprocess to crash

Created on 14 Jan 2019 · 23Comments · Source: apache/incubator-mxnet

may be related to #13831

Description

importing mxnet causes OSErrors in subprocess

Environment info (Required)

Scientific Linux 7.5
Python 3.6.3
MXnet 1.5.0 (from packages)
(tried on multiple computers running different cuda builds)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

Using the following script (or just using the appropriate commands)

import mxnet
import subprocess
n = 0
while True:
    if not n%1000: print ("RUN", n)
    ret = subprocess.call(['ls', '/tmp'], stdout=subprocess.PIPE)
    n += 1

will eventually give this error message:

Traceback (most recent call last):
File "subcrash.py", line 13, in
ret = subprocess.call(['ls', '/'], stdout=subprocess.PIPE)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 267, in call
with Popen(popenargs, *kwargs) as p:
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 709, in __init__
restore_signals, start_new_session)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'ls'

Doesn't seem to matter which executable.

What have you tried to solve it?

Don't even know where to start.
If you try putting in a stack tract or pdb it won't break.

Build MKL

Source

dabraude

Most helpful comment

The env variable fixed the problem reported in this issue. So if GluonNLP CI is facing the same issue, I think it can be fixed by this env variable too.

TaoLv on 26 Sep 2019

👍2

All 23 comments

@dabraude Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.
@mxnet-label-bot add [Build]

I tried running the script you provided locally on my Mac (So essentially non-CUDA build).
I did not face any crashes. My MXNet version is : 1.5.0b20190112.

Can you try this on your machine and then run the script ?
Run : pip install -U mxnet --pre

I'm trying to see if it's something to do with the CUDA builds of MXNet.
Also, what's the CUDA version that you tried it on ?

piyushghai on 14 Jan 2019

Ok we will try running that script and get back to you.

We are running CUDA 10

dabraude on 15 Jan 2019

It still crashes with the --pre

We have found that it only happens with the MKL version:
mxnet-cu100mkl-1.5.0b20190115 - crashing (Intel or AMD CPU)
mxnet-cu100-1.5.0b20190115 - stable (up to ~4 million calls)

dabraude on 15 Jan 2019

@mxnet-label-bot update [Build, MKL]

@azai91 @mseth10 Can we have a look at this crash ?

piyushghai on 15 Jan 2019

I'd like to note that the website CI pipeline has been intermittently failing with subprocess errors ever since the MKLDNN merge. This is when it started:
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/website/job/master/141/
It's really important that we have the website check in CI, but right now it is turned off because of the failures.

aaronmarkham on 17 Jan 2019

@dabraude @aaronmarkham thanks for reporting the issues.

We will take a look for the potential issue @ZhennanQin @TaoLv

pengzhao-intel on 17 Jan 2019

@aaronmarkham I guess for "the MKLDNN merge" you mean #13681, right? But I tried the script @dabraude shared in this issue, it also crashes with mxnet-mkl==1.3.1.

@dabraude said this issue only happens with MKL build. But website build is not using MKL or MKL-DNN. So I'm afraid they are not the same issue.

BTW, @dabraude have you ever tried it with python2?

TaoLv on 17 Jan 2019

@dabraude Please try export KMP_INIT_AT_FORK=false before running your script. Let me know if it works for you. Thank you.

TaoLv on 17 Jan 2019

I can confirm it happens with python2 and that
export KMP_INIT_AT_FORK=false seems to stop it, but with intermittent errors I can't be 100% it did

dabraude on 17 Jan 2019

Researching the build logs from the first crash... I see that mkldnn is set to 0 in some of the earlier build routines, but when making the docs, mkldnn files are being built by mshadow:

+ make docs SPHINXOPTS=-W
/work/mxnet /work/mxnet
make -C docs html
make[1]: Entering directory '/work/mxnet/docs'
export BUILD_VER=
Env var set for BUILD_VER: 
sphinx-build -b html -d _build/doctrees  -W . _build/html
Running Sphinx v1.5.6
making output directory...
Building version default
Document sets to generate:
scala_docs    : 1
java_docs     : 1
clojure_docs  : 1
doxygen_docs  : 1
r_docs        : 0
Building MXNet!
Building Doxygen!
Building Scala!
Building Scala Docs!
Building Java Docs!
Building Clojure Docs!
loading pickled environment... not yet created
make[2]: Entering directory '/work/mxnet'
g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -g -O0 -I/work/mxnet/3rdparty/mshadow/ -I/work/mxnet/3rdparty/dmlc-core/include -fPIC -I/work/mxnet/3rdparty/tvm/nnvm/include -I/work/mxnet/3rdparty/dlpack/include -I/work/mxnet/3rdparty/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -I/include -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv -fopenmp -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_LAPACK  -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c \
src/operator/subgraph/mkldnn/mkldnn_conv_property.cc -o build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o

Then down further I see more...

ar crv lib/libmxnet.a build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o 
build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o build/src/operator/subgraph/mkldnn/mkldnn_conv.o build/src/operator/nn/mkldnn/mkldnn_convolution.o build/src/operator/nn/mkldnn/mkldnn_concat.o build/src/operator/nn/mkldnn/mkldnn_base.o build/src/operator/nn/mkldnn/mkldnn_act.o build/src/operator/nn/mkldnn/mkldnn_softmax.o build/src/operator/nn/mkldnn/mkldnn_deconvolution.o build/src/operator/nn/mkldnn/mkldnn_copy.o 
...
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv.o
a - build/src/operator/nn/mkldnn/mkldnn_convolution.o
a - build/src/operator/nn/mkldnn/mkldnn_concat.o
...

So would this help reveal why docs is experiencing the same kind of crashing?

aaronmarkham on 17 Jan 2019

@dabraude Can you confirm if the issue is still there with the environmental variable?

TaoLv on 23 Jan 2019

@aaronmarkham The log of build mkldnn file is expected on USE_MKLDNN=0. Because USE_MKLDNN is only used as c macro in those files, instead of Makefile source control. In other words to say, USE_MKLDNN won't change the source files collected to build, but change the mkldnn file contains seen by compiler.

ZhennanQin on 23 Jan 2019

@TaoLv It didn't crash when running overnight so I assume it is working.

dabraude on 23 Jan 2019

Hi all,
this bug keeps biting us. This is easily reproducible, meaning that it occurs randomly but with pretty high frequency, always within a few hundred attempts, but non deterministic.
The code I'm using (essentially the same as above):

import mxnet
import subprocess

for i in range(1000):
    if not i%100: print(i)
    try:
        ret = subprocess.call(["ls","/tmp"], stdout=subprocess.PIPE)
    except Exception as e:
        print(i, e)
        exit()

and you always get a nice
OSError: [Errno 14] Bad address: 'ls'

I managed to isolate some requirements to recreate a conda environment where this issue occurs.
This is obtained with conda + pip as follows (I have conda 4.7.12):

conda create -n mxnet-test --file env_conda.txt
conda activate mxnet-test
#make sure we're using the pip in the env
echo $(which pip)
pip install -r env_pip.txt

where the content of the files is:

env_conda.txt:
mkl=2019.0=118
numpy=1.16.4=py36h99e49ec_0
env_pip.txt:
mxnet-cu80mkl==1.5.0

and this is the result of conda list -e

# platform: linux-64
_libgcc_mutex=0.1=main
blas=1.0=openblas
ca-certificates=2019.5.15=1
certifi=2019.9.11=py36_0
chardet=3.0.4=pypi_0
idna=2.8=pypi_0
intel-openmp=2019.4=243
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libopenblas=0.3.6=h5a2b251_1
libstdcxx-ng=9.1.0=hdf63c60_0
mkl=2019.0=118
mxnet-cu80mkl=1.5.0=pypi_0
ncurses=6.1=he6710b0_1
numpy=1.16.4=py36h99e49ec_0
numpy-base=1.16.4=py36h2f8d375_0
openssl=1.1.1d=h7b6447c_1
pip=19.2.2=py36_0
python=3.6.9=h265db76_0
python-graphviz=0.8.4=pypi_0
readline=7.0=h7b6447c_5
requests=2.22.0=pypi_0
setuptools=41.0.1=py36_0
sqlite=3.29.0=h7b6447c_0
tk=8.6.8=hbc83047_0
urllib3=1.25.5=pypi_0
wheel=0.33.4=py36_0
xz=5.2.4=h14c3975_4
zlib=1.2.11=h7b6447c_3

I was able to reproduce this issue both with and without gpu (mxnet-mkl), on ubuntu 16.04, also inside docker containers.

Note : this is non-deterministic also at "build-time", in the sense that, creating environments with _exactly the same requirements and exactly the same installed libraries_, you can randomly end up with an environment where the issue does not occur.

sbebo on 23 Sep 2019

This does seem related (or really, the same thing) as https://github.com/numpy/numpy/issues/10060 and https://github.com/apache/incubator-mxnet/issues/12710
and setting KMP_INIT_AT_FORK=FALSE as suggested, seems to fix the issue.
Not sure if something could be done by the libraries that use MKL, such as mxnet, to warn about this behavior and how to prevent it.

sbebo on 23 Sep 2019

@sbebo I take this issue as a defect of the openmp library and the library will be excluded in the next minor release.

TaoLv on 24 Sep 2019

👍1

@szha @eric-haibin-lin this bugreports describes the cause of the OSErrors that time-to-time happen on ci.mxnet.io (CI used by gluon-nlp.mxnet.io , gluon-cv.mxnet.io , ...)

leezu on 24 Sep 2019

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

leezu on 24 Sep 2019

Hi @leezu, is it possible for you to try export KMP_INIT_AT_FORK=false in your CI environment?

TaoLv on 24 Sep 2019

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

Oh, will try to move to Ubuntu 18.04 if possible. Thanks!

sbebo on 24 Sep 2019

@TaoLv the issue occurred only rarely (few times a month) for us and did not occur anymore during the recent months. What would be the expectation of setting export KMP_INIT_AT_FORK=false? Should it fix the issue or are you asking to confirm if setting the env variable reintroduces the problem?

leezu on 24 Sep 2019

The env variable fixed the problem reported in this issue. So if GluonNLP CI is facing the same issue, I think it can be fixed by this env variable too.

TaoLv on 26 Sep 2019

👍2

@leezu, the same issue and same fix as #14979. I'm closing this issue as: