Incubator-mxnet: [BUG] Using a package with MKL and GPU versions, using python to open a new process will cause an error

Created on 17 May 2019  Â·  51Comments  Â·  Source: apache/incubator-mxnet

Hardware and version information:

----------Python Info----------
Version : 3.6.8
Compiler : GCC 7.3.0
Build : ('default', 'Dec 30 2018 01:22:34')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.1.1
Directory : /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.4.1
Directory : /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.15.0-50-generic-x86_64-with-debian-buster-sid
system : Linux
node : ctmp
release : 4.15.0-50-generic
version : #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping: 3
CPU MHz: 800.218
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6816.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

Python package version

Package        Version 
-------------- --------
certifi        2019.3.9
chardet        3.0.4   
gluonnlp       0.6.0   
graphviz       0.8.4   
idna           2.8     
mxnet-cu100mkl 1.4.1   
numpy          1.14.6  
pip            19.1.1  
requests       2.22.0  
setuptools     41.0.1  
urllib3        1.25.2  
wheel          0.33.4

In a GPU package with MKL, if you create a new process in Python and use multiple processes to load data at the same time, you will get an error.

from multiprocessing import Process
import gluonnlp as nlp
import numpy as np
from gluonnlp.data import SQuAD
from mxnet import nd,gluon
import mxnet as mx
from mxnet.gluon import nn

class Transform(object):
    def __init__(self):
        pass

    def __call__(self, record_index, question_id, question, context, answer_list,
                 answer_start_list):
        return np.ones((100,1)),np.ones((100,3))

def train():
    train_data = SQuAD('train')
    dataloader = gluon.data.DataLoader(train_data.transform(Transform()),batch_size=128, shuffle=True, num_workers=4)
    net = nn.HybridSequential()
    net.add(nn.Dense(10))
    net.initialize(mx.init.Xavier(), ctx=mx.gpu(0))
    print(net)

p = Process(target=train)
p.start()
p.join()
Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f935a) [0x7ff39d25735a]
[bt] (1) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3513b36) [0x7ff3a0371b36]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7ff3e124ff20]
[bt] (3) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libiomp5.so(+0xa9ea5) [0x7ff3dce09ea5]
[bt] (4) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libiomp5.so(+0xa9ba4) [0x7ff3dce09ba4]
[bt] (5) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2da4d13) [0x7ff39fc02d13]
[bt] (6) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2db56c8) [0x7ff39fc136c8]
[bt] (7) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x251) [0x7ff39fc18501]
[bt] (8) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2dbd359) [0x7ff39fc1b359]
[bt] (9) /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2da9428) [0x7ff39fc07428]

If you change the mxnet version to mxnet-cu100-1.4.1, there will be no errors.
Similarly, mxnet-cu100mkl-1.5.0b20190516 will fail and mxnet-cu100-1.5.0b20190516 will not go wrong.

In addition
Using cpu
Remove the num_workers parameter
Do not create a new process
Nothing in any of the above three cases will go wrong

Bug MKL MKLDNN

All 51 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Installation

@fierceX I'm not sure and don't know why but can you try the magic below? ;)

export KMP_INIT_AT_FORK=false

@TaoLv Yes, this will not go wrong, but this should be a bug. What is the specific reason?

@mxnet-label-bot add [question, MKL]

Looks like an OpenMP related problem. Since the stack trace has libc on it I suspect we are re-entering MXNet on pthread_at_fork handlers due to Python multiprocessing interaction. Since you are using multiprocessing, this could be done above the python level to avoid this situation.

I would suggest to reproduce with debug symbols as the stack is not including the function names.

ping

I tried in GPU version, also no crash in debug mode.

In [2]: mx.runtime.Features()                                                      
Out[2]: [✔ CUDA, ✔ CUDNN, ✖ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✔ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP]

Revision 9d7fc7cbee09de2694022995d0601cb4316e4988

I could reproduce with binary distribution cu101mkl

(py3_env_pip) piotr@ip-172-31-21-159:0: ~> python test.py
Downloading /home/piotr/.mxnet/datasets/squad/train-v1.1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/squad/train-v1.1.zip...

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2f9cf20) [0x7fd2107a7f20]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fd25813cf20]
  [bt] (2) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libiomp5.so(+0xac19c) [0x7fd253c7119c]
  [bt] (3) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2648ea3) [0x7fd20fe53ea3]
  [bt] (4) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x265c798) [0x7fd20fe67798]
  [bt] (5) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x241) [0x7fd20fe6d741]
  [bt] (6) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x26650c4) [0x7fd20fe700c4]
  [bt] (7) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x264dc6e) [0x7fd20fe58c6e]
  [bt] (8) /home/piotr/py3_env_pip/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::CopyFromTo(mxnet::NDArray const&, mxnet::NDArray const&, int, bool)+0xa39) [0x7fd2100732c9]

Reproduced with a release CMake build
Also reproduced with a release Make build.

(py3_venv) piotr@ip-172-31-21-159:0: ~/mxnet [master]> python ~/test.py 

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(+0x345d9d9) [0x7fc3e00189d9]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fc4055e0f20]
  [bt] (2) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x34250) [0x7fc3b863c250]
  [bt] (3) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x34d3e) [0x7fc3b863cd3e]
  [bt] (4) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::OpenMP::set_reserve_cores(int)+0x6d) [0x7fc3dff68d5d]
  [bt] (5) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const+0x4f) [0x7fc3dff79c0f]
  [bt] (6) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x414) [0x7fc3dff7b0f4]
  [bt] (7) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x481) [0x7fc3dff7c871]
  [bt] (8) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x1a8) [0x7fc3dff6d358]

Flags:

USE_CUDA: "ON" # Build with CUDA support
USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
USE_OPENCV: "ON" # Build with OpenCV support
USE_OPENMP: "ON" # Build with Openmp support
USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path
USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON"
USE_LAPACK: "ON" # Build with lapack support
USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
USE_JEMALLOC: "ON" # Build with Jemalloc support
USE_PROFILER: "ON" # Build with Profiler support
USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
USE_CPP_PACKAGE: "OFF" # Build C++ Package
USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
USE_GPROF: "OFF" # Compile with gprof (profiling) flag
USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path
ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
INSTALL_EXAMPLES: "OFF" # Install the example source files.
USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
CMAKE_BUILD_TYPE: "Release"
CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
CMAKE_C_COMPILER_LAUNCHER: "ccache"
CMAKE_CXX_COMPILER_LAUNCHER: "ccache"

Can't reproduce with Debug builds.

RelWithDebSymbols:

Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
*** invalid %N$ use detected ***

No crash.

@fierceX forking an initial process is not supported in MXNet. The first Process creation should not be done, as the state of the library after fork is inconsitent. The code in the train function is never executed.

With respect to the crash, after investigating this I believe is caused by calling setenv in pthread_at_fork. I will refactor this code so unsafe calls to setenv are not done during forking.

Additionally we can detect that we are in a forked state and emit additional errors in MXNet for example during the use of DataLoader.

The crash still happening when using multiprocess.

In [6]: from multiprocessing import Process
   ...: import gluonnlp as nlp
   ...: import numpy as np
   ...: from gluonnlp.data import SQuAD
   ...: from mxnet import nd,gluon
   ...: import mxnet as mx
   ...: from mxnet.gluon import nn
   ...: 
   ...: class Transform(object):
   ...:     def __init__(self):
   ...:         pass
   ...: 
   ...:     def __call__(self, record_index, question_id, question, context, answer_list,
   ...:                  answer_start_list):
   ...:         return np.ones((100,1)),np.ones((100,3))
   ...: 
   ...: def train():
   ...:     train_data = SQuAD('train')
   ...:     dataloader = gluon.data.DataLoader(train_data.transform(Transform()),batch_size=128, shuffle=True, num_workers=4)
   ...:     net = nn.HybridSequential()
   ...:     net.add(nn.Dense(10))
   ...:     net.initialize(mx.init.Xavier(), ctx=mx.gpu(0))
   ...:     print(net)
   ...: 
   ...: p = Process(target=train)
   ...: p.start()
   ...: p.join()
Downloading /home/piotr/.mxnet/datasets/squad/train-v1.1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/squad/train-v1.1.zip...

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(+0x37ecd89) [0x7f8b09ab6d89]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f8bda440f20]
  [bt] (2) /home/piotr/mxnet/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so(+0xac19c) [0x7f8ac40d919c]
  [bt] (3) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::OpenMP::set_reserve_cores(int)+0x81a) [0x7f8b09a07fda]
  [bt] (4) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const+0x4f) [0x7f8b09a1659f]
  [bt] (5) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x3c2) [0x7f8b09a17cc2]
  [bt] (6) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x481) [0x7f8b09a192e1]
  [bt] (7) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x19f) [0x7f8b09a0b50f]
  [bt] (8) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x155) [0x7f8b09a08be5]

Managed to attach to the train process by putting a delay just after fork

(gdb) bt 5
#0  0x00007f5cfb7f819c in __kmp_set_num_threads () from /home/piotr/mxnet/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
#1  0x00007f5c4c8e7d8a in mxnet::engine::OpenMP::set_reserve_cores (this=this@entry=0x7f5c5c129b20 <mxnet::engine::OpenMP::Get()::openMP>, cores=1) at ../src/engine/openmp.cc:77
#2  0x00007f5c4c8f62df in mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const (__closure=__closure@entry=0x7ffcec860730)
at ../src/engine/threaded_engine_perdevice.cc:140
#3  0x00007f5c4c8f7a02 in mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice
::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}) (this=this@entry=0x4
8e0148, index=<optimized out>, creator=...) at ../src/engine/../common/lazy_alloc_array.h:113
#4  0x00007f5c4c8f9021 in mxnet::engine::ThreadedEnginePerDevice::PushToExecute (this=0x48dfda0, opr_block=<optimized out>, pusher_thread=<optimized out>) at ../src/engine/threaded_engine_p
erdevice.cc:149
(More stack frames follow...)
(gdb) 


@pengzhao-intel any ideas from libiomp5.so side? there's no debug symbols available right?

@larroy thanks to head up. In the next release, mklml will be removed so the binary will be more clear :)

@TaoLv @xinyu-intel could you take a try in local?

@larroy Could you please share the detailed steps to reproduce? Which version of mxnet should I use?

Found this related info:

https://stackoverflow.com/questions/25986091/telling-gcc-to-not-link-libgomp-so-it-links-libiomp5-instead

Added a sleep to be able to attach gdb on the "train pid"

from multiprocessing import Process
import gluonnlp as nlp
import numpy as np
from gluonnlp.data import SQuAD
from mxnet import nd,gluon
import mxnet as mx
from mxnet.gluon import nn
import os
import time

class Transform(object):
    def __init__(self):
        pass

    def __call__(self, record_index, question_id, question, context, answer_list,
                 answer_start_list):
        return np.ones((100,1)),np.ones((100,3))

def train():
    print("train pid: {}".format(os.getpid()))
    print("10 9...")
    time.sleep(10)
    print("go")
    train_data = SQuAD('train')
    dataloader = gluon.data.DataLoader(train_data.transform(Transform()),batch_size=128, shuffle=True, num_workers=4)
    net = nn.HybridSequential()
    net.add(nn.Dense(10))
    net.initialize(mx.init.Xavier(), ctx=mx.gpu(0))
    print(net)

print("parent pid: {}".format(os.getpid()))
p = Process(target=train)
p.start()
p.join()


Used this branch, to make sure only intel omp is used:

https://github.com/larroy/mxnet/tree/omp_chooser

piotr@ip-172-31-22-252:0: ~/mxnet [omp_chooser]> ldd build/libmxnet.so | grep omp
libiomp5.so => /home/piotr/mxnet/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so (0x00007f3507f63000)

Build config:

piotr@ip-172-31-22-252:0: ~/mxnet [omp_chooser]> cat cmake_options.yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

--- # CMake configuration
USE_CUDA: "ON" # Build with CUDA support
USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
USE_OPENCV: "ON" # Build with OpenCV support
USE_OPENMP: "ON" # Build with Openmp support
USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path
USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON"
USE_LAPACK: "ON" # Build with lapack support
USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
USE_JEMALLOC: "ON" # Build with Jemalloc support
USE_PROFILER: "ON" # Build with Profiler support
USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
USE_CPP_PACKAGE: "OFF" # Build C++ Package
USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
USE_GPROF: "OFF" # Compile with gprof (profiling) flag
USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path
ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
INSTALL_EXAMPLES: "OFF" # Install the example source files.
USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
CMAKE_BUILD_TYPE: "RelWithDebInfo"
CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
CMAKE_C_COMPILER_LAUNCHER: "ccache"
CMAKE_CXX_COMPILER_LAUNCHER: "ccache"

used ./dev_menu.py build

source py3_venv/bin/activate.fish
pip install gluonnlp

(py3_venv) piotr@ip-172-31-22-252:1: ~/mxnet [omp_chooser]> python test.py
parent pid: 31483
train pid: 31660
10 9...






go
pid: 31702
pid: 31711
pid: 31702
pid: 31720

Segmentation fault: 11

Process id: 31660
Stack trace:
  [bt] (0) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(+0x37efa99) [0x7f5c4c996a99]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f5d010b1f20]
  [bt] (2) /home/piotr/mxnet/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so(+0xac19c) [0x7f5cfb7f819c]
  [bt] (3) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::OpenMP::set_reserve_cores(int)+0x81a) [0x7f5c4c8e7d8a]
  [bt] (4) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const+0x4f) [0x7f5c4c8f62df]
  [bt] (5) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x3c2) [0x7f5c4c8f7a02]
  [bt] (6) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x481) [0x7f5c4c8f9021]
  [bt] (7) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x19f) [0x7f5c4c8eb2bf]
  [bt] (8) /home/piotr/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x155) [0x7f5c4c8e8995]
cgdb /home/piotr/mxnet/py3_venv/bin/python
attach <PID printed above before sleep continues>

Added this patch in data loader but is not needed:

diff --git a/3rdparty/dmlc-core b/3rdparty/dmlc-core
--- a/3rdparty/dmlc-core
+++ b/3rdparty/dmlc-core
@@ -1 +1 @@
-Subproject commit f1ff6cc117f4e95169a9f62be549c8fe3e15c20f
+Subproject commit f1ff6cc117f4e95169a9f62be549c8fe3e15c20f-dirty
diff --git a/python/mxnet/gluon/data/dataloader.py b/python/mxnet/gluon/data/dataloader.py
index 4dfa94f72..c58fd22e2 100644
--- a/python/mxnet/gluon/data/dataloader.py
+++ b/python/mxnet/gluon/data/dataloader.py
@@ -462,6 +462,9 @@ class _MultiWorkerIter(object):

     def __iter__(self):
         return self
+def f(*args):
+    import os
+    print("pid: {}".format(os.getpid()))


 class DataLoader(object):
@@ -562,6 +565,7 @@ class DataLoader(object):
             else:
                 self._worker_pool = multiprocessing.Pool(
                     self._num_workers, initializer=_worker_initializer, initargs=[self._dataset])
+                self._worker_pool.map(f, range(self._num_workers))
         if batchify_fn is None:
             if num_workers > 0:
                 self._batchify_fn = default_mp_batchify_fn

I think it might be a bug in intel omp library or some sort of interaction, with gnuomp everything works as expected: I compiled without MKL.

(py3_venv) piotr@ip-172-31-22-252:0: ~/mxnet_other [master]> python test.py
parent pid: 42977
train pid: 43182
10 9...
go
HybridSequential(
  (0): Dense(None -> 10, linear)
)
(py3_venv) piotr@ip-172-31-22-252:0: ~/mxnet_other [master]> 

(py3_venv) piotr@ip-172-31-22-252:0: ~/mxnet_other [master]> ldd build/libmxnet.so | grep omp
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007ff7a8b48000)

Seems mixing omp implementations might not be safe after all.

Can a comitter please reopen this? @pengzhao-intel @TaoLv

As a request from @larroy, I am reopening this issue.

Currently, the dependency of MKL is removed and I think we can try this again.
@lebeg @TaoLv

closing since we remove the MKL dependency.

Thanks @pengzhao-intel. I observe the issue still happens when compiling MXNet master with the llvm-openmp and MKLDNN. It is my understanding, that llvm-openmp and intel openmp share the same codebase, so I suppose this is expected (reference)? What do you think?

For now I'll reopen this issue to facilitate further investigation.

You can build mxnet master version with cmake -DUSE_CUDA=1 -DUSE_MKLDNN=1 -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DBUILD_CYTHON_MODULES=1 .. and you will observe the issue still happens.

This only affects the CMake build, because we only use the llvm openmp in the cmake build (see https://github.com/apache/incubator-mxnet/pull/8730/).
I suggest we revert https://github.com/apache/incubator-mxnet/pull/8730 to fix this issue.
@pengzhao-intel do you think that is sensible?

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(+0x14ab799) [0x7f4e85a66799]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f4f1324bf20]
  [bt] (2) /home/ubuntu/src/mxnet-dc/build/3rdparty/openmp/runtime/src/libomp.so(+0x34250) [0x7f4e84327250]
  [bt] (3) /home/ubuntu/src/mxnet-dc/build/3rdparty/openmp/runtime/src/libomp.so(+0x34d3e) [0x7f4e84327d3e]
  [bt] (4) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(mxnet::engine::OpenMP::set_reserve_cores(int)+0x8ca) [0x7f4e8598e2fa]
  [bt] (5) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const+0x4f) [0x7f4e859a05bf]
  [bt] (6) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x414) [0x7f4e859a1b14]
  [bt] (7) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x481) [0x7f4e859a3291]
  [bt] (8) /home/ubuntu/src/mxnet-dc/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x1a8) [0x7f4e859939b8]

Built just with gomp (deleting the 3rdparty/openmp folder), things work:

HybridSequential(
  (0): Dense(None -> 10, linear)
)

There are currently two hypotheses about the root cause of this error (https://github.com/apache/incubator-mxnet/issues/14979#issuecomment-525103793): a) bug in llvm / intel openmp b) interaction between gomp and llvm / intel openmp.

I did some more investigation and conclude we can rule out option b. In particular, I compile CC=clang-8 CXX=clang++-8 cmake -DUSE_CUDA=1 -DUSE_MKLDNN=1 -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DBUILD_CYTHON_MODULES=1 -DUSE_OPENCV=0 ...

We can investigate the shared library dependencies of the resulting libmxnet.so:

% readelf -Wa libmxnet.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libnvToolsExt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libopenblas.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libjemalloc.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [liblapack.so.3]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcufft.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcusolver.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcurand.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libomp.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]

among those, libopenblas.so.0 is provided by the system and depends on libgomp.so. (If we would compile with OpenCV, OpenCV would also transitively depend on ligomp.so, so I just disable it for the purpose of this test). We can see it shows up among the transitive shared library dependencies:

% ldd libmxnet.so
        linux-vdso.so.1 (0x00007ffd382ca000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007efdc9594000)
        libopenblas.so.0 => /usr/local/lib/libopenblas.so.0 (0x00007efdc85fb000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007efdc83f3000)
        libjemalloc.so.1 => /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x00007efdc81bd000)
        liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007efdc78fe000)
        libcublas.so.10.0 => /usr/local/cuda/lib64/libcublas.so.10.0 (0x00007efdc3368000)
        libcufft.so.10.0 => /usr/local/cuda/lib64/libcufft.so.10.0 (0x00007efdbceb4000)
        libcusolver.so.10.0 => /usr/local/cuda/lib64/libcusolver.so.10.0 (0x00007efdb47cd000)
        libcurand.so.10.0 => /usr/local/cuda/lib64/libcurand.so.10.0 (0x00007efdb0666000)
        libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007efdaf04a000)
        libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007efdaded3000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007efdadccf000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007efdadab0000)
        libomp.so.5 => /usr/lib/x86_64-linux-gnu/libomp.so.5 (0x00007efe411b4000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007efdad727000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007efdad389000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007efdad171000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efdacd80000)
        /lib64/ld-linux-x86-64.so.2 (0x00007efe410a8000)
        libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007efdac9a1000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007efdac772000)
        libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007efdac1b0000)
        libnvidia-fatbinaryloader.so.418.87.01 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.87.01 (0x00007efdabf62000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007efdabd22000)

Thus I recompile OpenBLAS with clang. Then we can investigate the transitive dependencies while replacing the system OpenBLAS with the llvm-openmp based OpenBLAS:

% LD_PRELOAD=/home/ubuntu/src/OpenBLAS/libopenblas.so ldd libmxnet.so
        linux-vdso.so.1 (0x00007ffd8eac5000)
        /home/ubuntu/src/OpenBLAS/libopenblas.so (0x00007f06ee33a000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f06ee131000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f06edf29000)
        libjemalloc.so.1 => /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x00007f06edcf3000)
        liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f06ed434000)
        libcublas.so.10.0 => /usr/local/cuda/lib64/libcublas.so.10.0 (0x00007f06e8e9e000)
        libcufft.so.10.0 => /usr/local/cuda/lib64/libcufft.so.10.0 (0x00007f06e29ea000)
        libcusolver.so.10.0 => /usr/local/cuda/lib64/libcusolver.so.10.0 (0x00007f06da303000)
        libcurand.so.10.0 => /usr/local/cuda/lib64/libcurand.so.10.0 (0x00007f06d619c000)
        libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007f06d4b80000)
        libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f06d3a09000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f06d3805000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f06d35e6000)
        libomp.so.5 => /usr/lib/x86_64-linux-gnu/libomp.so.5 (0x00007f0766c79000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f06d325d000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f06d2ebf000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f06d2ca7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f06d28b6000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0766b6d000)
        libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f06d24d7000)
        libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007f06d1f15000)
        libnvidia-fatbinaryloader.so.418.87.01 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.87.01 (0x00007f06d1cc7000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f06d1a87000)

and you find that libmxnet.so doesn't depend on libgomp.so anymore.

So let's see if the test case by @fierceX still crashes:

LD_PRELOAD=/home/ubuntu/src/OpenBLAS/libopenblas.so python3 ~/test.py

Stack trace:
  [bt] (0) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(+0x186faeb) [0x7f653ffcfaeb]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f65cf785f20]
  [bt] (2) /usr/lib/x86_64-linux-gnu/libomp.so.5(+0x3d594) [0x7f65cd145594]
  [bt] (3) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::OpenMP::set_reserve_cores(int)+0xf5) [0x7f653fed5255]
  [bt] (4) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}::operator()() const+0x42) [0x7f653fee8752]
  [bt] (5) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2}>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#2})+0x487) [0x7f653fee5b87]
  [bt] (6) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x223) [0x7f653fee12f3]
  [bt] (7) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x1dc) [0x7f653fed625c]
  [bt] (8) /home/ubuntu/src/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x212) [0x7f653fed64d2]

As the crash remains, we can conclude this is due to a bug in libomp.so, ie. the llvm openmp.

As @fierceX's use-case is common and important among the MXNet users, we can thus conclude that we must not default to llvm openmp until this issue is fixed.

On a sidenote, using forking in a multithreaded environment is according to the POSIX standard generally largely undefined (you're only allowed to exec afterwards). So it's not really a bug in llvm-openmp (as it's behavior is undefined). However, as it is an important use-case, and as it works with gomp, I suggest we just use gomp. You can also take a look at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035 for some more background.

@cjolivier01 please let me know if you see any issue with this investigation.

PS: To compile with clang, a small change to dmlc-core is required https://github.com/dmlc/dmlc-core/compare/master...leezu:omp

what is the source file and line number of that crash in libmxnet.so? What’s the line of code crashing?

@leezu, not sure if the problem of LLVM OMP is the same one as described at the beginning of this issue. I simply took the original issue as a problem of iomp5 which has been removed from all the binary releases of MXNet. Hence the issue was closed.
If you have further interests, you can reproduce the original issue on a GPU with the code snippet and package provided in the description. And then please try to replace the libiomp5.so under the installation folder with the one released along with Intel compiler 2019.0 update5. I expect the problem can be addressed with the new version of iomp5.

Is the iomp source code based on llvm? What version of llvm omp would the 2019.0 update correspond to? Or is the source different? I'll try it.

which line? the stack trace you listed says libmxnet.so at stack level 0 rather than libomp.so, so it wouldn’t be in the omp calls here, correct, is the CHECK_GE failing?

The function referred to above is the bt [3] in the backtrace. CHECK_GE is not failing, because bt [2] is part of libomp.so. But I'm not sure if it is due to the omp_set_num_threads(1); or omp_set_num_threads(omp_thread_max_ - reserve_cores_); call. addr2line just prints ?? if I try to look up the address of bt [3]

libmxnet.so is on bt [0] because it has a segfault handler:

https://github.com/apache/incubator-mxnet/blob/e48ff96ef4375f3d7c505e152b73b1f15a8b7afe/src/initialize.cc#L62-L68

Specifically Line 65.

thanks now the stack trace makes sense. maybe libomp isn’t built with -O0 in debug mode — assuming this is a debug build.

this problem is gone with the upgrade?

If it’s gone with the upgrade, then fine.

However, if it’s not, and since it also happened with official dist of libiomp5 (and if still happening, also official llvm dist) then considering that llvm omp is in HUGE distribution globally being part of clang and all, then it seems pretty unlikely to me that it’s a bug in the openmp causing this. Especially since I wrote most of this omp-related stuff in mxnet that is in that stack trace, and I definitely didn’t test it specifically with forking — it wasn’t a use case at the time. in fact, at the time i wrote that, it was known that trying to use omp at all (with libgomp specifically) would hang if attempted in a forked process (there’s an issue+PR in there somewhere fixing the issue by avoiding using a kernel that used omp i seem to recall — it was a long time ago and before llvm openmp was added — it was noted that it didn’t happen in mkl build which used libiomp5 instead). generally this wasn’t a problem because the OMP_NUM_THREADS gets set to 1 in the atfork call by the engine code.

However, if mxnet is loaded after the fork, then that environment variable was never set because the engine code never ran to hook the fork call before then. I think it’s possible there’s a bug in the (my) mxnet omp code since this wasn’t a use-case considered. This would mean it would likely still occur with clang builds (assuming it’s not intermittent and hard to reproduce).

regarding the libgomp hang i noted above, apparently it’s a known issue with libgomp and forking that I am surprised to see still occurring today. He lists gcc 8.

https://github.com/pytorch/pytorch/issues/17199

Yes, this crash still happens after the upgrade of llvm openmp. It also happens both when compiling with gcc or compiling with llvm.

The only case where the crash does not happen is when compiling with gcc and libgomp instead of the libomp.

The gcc hang you refer to above, is it https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035 ? I'm aware of that bug report and therefore quite surprised that the crash doesn't occur with gcc libgomp, but rather in all other settings..

i will see if i can reproduce this week

they link to the gcc issue in that pytorch issue link. it’s not that one you linked which is a pull request of sorts. (correction: i thought it was the PR one while looking on my phone but now on my PC I see it isn't). Yes, seems it's that issue as well.

it’s weird because i saw that behavior maybe three years ago using gcc 3.x ish, I think, so I assumed that libgomp had been corrected to handle fork properly since then. I am surprised that the same behavior is being reproduced in such a new gcc version. i want to try to reprocess that issue this week as well.

llvm openmp as you can see from the comments in the pytorch is known to handle forking correctly.

Another thing that pytorch issue mentions is cuda after a fork. while it’s reasonable to assume it’s illegal to use cuda and then fork and also use cuda in the forked process as well, i wonder if it works ok if you fork before using cuda for the first time as in this issue.

Is the iomp source code based on llvm? What version of llvm omp would the 2019.0 update correspond to? Or is the source different? I'll try it.

It should be different code base.

  • I root-caused the the crash mentioned directly above (not the assert). I did this in an actual debugger :), but the answer is actually in the stack trace:
[bt] (2) /usr/lib/x86_64-linux-gnu/libomp.so.5(+0x3d594) [0x7f65cd145594]

This is loading the wrong omp library from the one that it was just built against. That library comes with (on Ubuntu) libomp5.deb package. The proper one would be in cmake-build-debug/3rdparty/openmp at build time. Why it is loading that other one I did not track down, because the problem went away when I linked to the proper library in the libmxnet.so's dir. I also uninstalled the libomp5 package on my machine in the course of testing. It might be getting pulled in because the cython compile uses a different "toolchain" (which may or may not map back to the same compiler, which on my machine is just blindly running x86_64-linux-gnu-gcc in the path). Even if this is not the cause, it should be looked at because with more than one toolchain on a lot of dev boxes these days, this is a recipe for trouble. Since the cython library has libmxnet as a dependency, then it is conceiable that in some use-cases, it gets first stab at loading whatever shared object it wants, and so if not using the same toolchain, this could get pretty nasty (i.e. imagine libmxnet.so is forced, at load time, to link against libstdc++ from gcc 3.6 when mxnet was compiled with gcc 8). I know they have version tags in the symbols, but you get the idea, right? This should be looked into, imho.

btw this is why the location for the omp stack trace was ??? -- no debug info for "/usr/lib/x86_64-linux-gnu/libomp.so.5".

At any rate, there's a number of ways to resolve this just as one would resolve the wrong opencv library being loaded -- it's not rocket science :)

Summary: No evidence found suggesting tha this is a libomp5/libomp bug (the "upgrade" wasn't actually necessary, but doesn't hurt anything, so good to leave it in).

  • The reason libgomp doesn't hang (@leezu 's query) is because the environment gets set to one OMP thread at fork time (atfork is hooked at limxnet.so's static init), so the forked process never tries to use OMP.

By the way, I don't care if it's used or not, but on the MXNet cython branch, I did some cython stuff that, in the cmake files, uses the mxnet toolchain to build the cython library. That's one approach, there are other approaches -- there's pros and cons for each.

@cjolivier01 thank you for looking into this. I notice that the crash also happens when using the system llvm openmp at compile time (ie delete 3rdparty/openmp before build). I describe that in https://github.com/apache/incubator-mxnet/issues/14979#issuecomment-562926756
Thus it seems the cause you mention isn't the root cause?

BTW, the update of the 3rdparty/openmp is about fixing the debug assertion #10856. It indeed doesn't claim anything about the current issue.

@cjolivier01 thank you for looking into this. I notice that the crash also happens when using the system llvm openmp at compile time (ie delete 3rdparty/openmp before build). I describe that in #14979 (comment)
Thus it seems the cause you mention isn't the root cause?

BTW, the update of the 3rdparty/openmp is about fixing the debug assertion #10856. It indeed doesn't claim anything about the current issue.

Assert issue is fixed in referenced PR above.

The cython setup script apparently uses forking which is causing the problem during compile.

@cjolivier01 thank you for looking into this. I notice that the crash also happens when using the system llvm openmp at compile time (ie delete 3rdparty/openmp before build). I describe that in #14979 (comment)
Thus it seems the cause you mention isn't the root cause?

BTW, the update of the 3rdparty/openmp is about fixing the debug assertion #10856. It indeed doesn't claim anything about the current issue.

You can see in the newer version, they set in the atfork handler:

__kmp_atfork_child() 
{
...
__kmp_team_pool = NULL;
...
}

This is why the assert goes away, but the assert remains harmless even in the old version.

Thanks for looking into it. Even when harmless, it's annoying when using debug build. Thus it's good to make it go away.

The cython setup script apparently uses forking which is causing the problem during compile.

When using system llvm openmp instead of 3rdparty/openmp, why does the crash reported in the current issue still happen if it is due to a linking problem? I'm not clear about your reasoning here.

This seems fixed now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuconglin picture yuconglin  Â·  3Comments

Ajoo picture Ajoo  Â·  3Comments

ranti-iitg picture ranti-iitg  Â·  3Comments

dushoufu picture dushoufu  Â·  3Comments

zy-huang picture zy-huang  Â·  3Comments