getenv() calls in libc are not threadsafe according to:
https://rachelbythebay.com/w/2017/01/30/env/
and
https://github.com/xianyi/OpenBLAS/issues/716
There are indirect calls to dmlc::GetEnv() all across the mxnet codebase, here are a few:
https://github.com/apache/incubator-mxnet/blob/266de6bef4da5769431557288d41fab2a02e52ca/src/engine/threaded_engine_perdevice.cc#L79
or
https://github.com/apache/incubator-mxnet/blob/5a83b6b563211f430688e41eab4752c6de4ecf22/src/executor/graph_executor.cc#L1194
/lib64/libc.so.6(+0x35250) [0x7fdc5b99b250]
/lib64/libc.so.6(getenv+0xad) [0x7fdc5b99e0cd]
/opt/amazon/lib/libmxnet.so(_ZN4dmlc6GetEnvIbEET_PKcS1_+0x1b) [0x7fdc4a1bedab]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor10InitOpSegsEv+0x1cb) [0x7fdc4a1b6e0b]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0x71b) [0x7fdc4a1b760b]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERKSt6vectorIS4_SaIS4_EESL_SL_RKSt13unordered_mapISsNS2_6TShapeESt4hashISsESt8equal_toISsESaISA_ISB_SN_EEERKSM_ISsiSP_SR_SaISA_ISB_iEEES11_RKSH_INS_9OpReqType
ESaIS12_EERKSt13unordered_setISsSP_SR_SaISsEEPSH_INS_7NDArrayESaIS1C_EES1F_S1F_PSM_ISsS1C_SP_SR_SaISA_ISB_S1C_EEEPNS_8ExecutorERKSM_INS2_9NodeEntryES1C_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKS1M_S1C_EEE+0x75d) [0x7fdc4a1b958d]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vectorIS3_SaIS3_EESK_SK_RKSt13unordered_mapISsNS1_6TShapeESt4hashISsESt8equal_toISsESaIS9_ISA_SM_EEERKSL_ISsiSO_SQ_SaIS9_ISA_iEEES10_RKSG_INS_9OpReqTypeESaI
S11_EERKSt13unordered_setISsSO_SQ_SaISsEEPSG_INS_7NDArrayESaIS1B_EES1E_S1E_PSL_ISsS1B_SO_SQ_SaIS9_ISA_S1B_EEEPS0_+0x1a6) [0x7fdc4a1b9f46]
/opt/amazon/lib/libmxnet.so(MXExecutorSimpleBind+0x1e38) [0x7fdc4a1031a8]
/opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fdc5af9b858]
TODO
TODO
@samskalicky Thanks for identifying the thread safety issue
@mxnet-label-bot add [Backend, Bug]
I found a set/get in MXNet for MXNET_CPU_WORKER_NTHREADS:

So if we're running multiple programs using MXNet on the same machine, its possible that one will be setting MXNET_CPU_WORKER_NTHREADS when the other is getting MXNET_CPU_WORKER_NTHREADS and since they both use the same libc, we're going to hit this issue.
FYI @KellenSunderland and @larroy
@samskalicky I don't think the problem is running different processes, one process shouldn't affect the other in terms of environment and per separation of the process space. The problem is using getenv / setenv from the same process in different threads, as is the case from the code you linked.
Could you decode the stack trace via c++filt?
Looks like the evidence points to where we suspected, initialize.cc We should rework that code.
/lib64/libc.so.6(+0x35250) [0x7fdc5b99b250]
/lib64/libc.so.6(getenv+0xad) [0x7fdc5b99e0cd]
/opt/amazon/lib/libmxnet.so(ZN4dmlc6GetEnvIbEET_PKcS1+0x1b) [0x7fdc4a1bedab]
/opt/amazon/lib/libmxnet.so(mxnet::exec::GraphExecutor::InitOpSegs()+0x1cb) [0x7fdc4a1b6e0b]
/opt/amazon/lib/libmxnet.so(mxnet::exec::GraphExecutor::FinishInitGraph(nnvm::Symbol, nnvm::Graph, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x71b) [0x7fdc4a1b760b]
/opt/amazon/lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, nnvm::TShape, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, nnvm::TShape> > > const&, std::unordered_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > const&, std::unordered_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::NDArray, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x75d)
[0x7fdc4a1b958d] /opt/amazon/lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vectorIS3_SaIS3_EESK_SK_RKSt13unordered_mapISsNS1_6TShapeESt4hashISsESt8equal_toISsESaIS9_ISA_SM_EEERKSL_ISsiSO_SQ_SaIS9_ISA_iEEES10_RKSG_INS_9OpReqTypeESaIS11_EERKSt13unordered_setISsSO_SQ_SaISsEEPSG_INS_7NDArrayESaIS1B_EES1E_S1E_PSL_ISsS1B_SO_SQ_SaIS9_ISA_S1B_EEEPS0+0x1a6) [0x7fdc4a1b9f46]
/opt/amazon/lib/libmxnet.so(MXExecutorSimpleBind+0x1e38) [0x7fdc4a1031a8]
/opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fdc5af9b858]
What if we set a process wide mutex here: https://github.com/dmlc/dmlc-core/blob/600bc28bc476a1b7866cfe9f444c91b6d49760b7/include/dmlc/parameter.h#L1062
Should fix the issue right? No deadlocks? We'd have to make sure the mutex wasn't threadpool scoped as it's getting set in a new threadpool
as you said, what about libraries like openmp which might access the environment variable without a lock?. We could try monkey patching getenv with our own wrapper at link time or runtime. Best is not to use setenv after we have multiple threads running.
I don't think we need any mutex. We can just pass any config to the thread as data if we need. Definitely we shouldn't be using the environment to config workers.
In an attempt to reproduce the issue (assuming its coming from simultaneous processes initializing at the same time) I created a cpp engine unit test by modifying tests/cpp/engine/threaded_engine_test.cc and adding the following two functions:
void handler(int sig) {
void *array[10];
size_t size;
// get void*'s for all entries on the stack
size = backtrace(array, 10);
// print out all the frames to stderr
fprintf(stderr, "Error: signal %d:\n", sig);
backtrace_symbols_fd(array, size, STDERR_FILENO);
exit(1);
}
TEST(Engine, envsegfault) {
signal(SIGSEGV, handler);
mxnet::Engine* engine = mxnet::engine::CreateThreadedEnginePerDevice();
for (int i = 0; i < 10000; ++i) {
engine->Stop();
engine->Start();
}
}
The handler attempts to capture the segfault and print the stack trace, and the envsegfault test creates a threaded engine object and repeatedly stops and starts it to trigger the SetEnv call in initialize.cc and GetEnv call in threaded_engine_perdevice.cc.
Then I wrote this python script to run multiple independent processes doing this:
import multiprocessing
import time
import os
def mxnet_worker():
for i in range(100):
ret = os.system('./mxnet_unit_tests --gtest_filter=Engine.envsegfault')
if ret != 0:
print('Failed@@@@@@@@@@@@@@@@')
exit(2)
read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(10)]
for p in read_process:
p.daemon = True
p.start()
for p in read_process:
p.join()
I was able to trigger a segfault, but I hadnt added the handler yet to print the stack trace. I need to run again and try to reproduce again.
Do you think the multiple process have any impact? Env is not shared across processes.
process != thread. In this case I think you mean thread.
Other than that I agree with your analysis.
I think we need to understand if not using this variable as in https://github.com/apache/incubator-mxnet/pull/13472 makes unwanted threads to be spawn inside the engine workers, because it might actually not be the case.
Depending on that, we need to see if we need to use a different mechanism in the engine to control OMP threads, then run performance regression tests.
As you see from my PR, not changing the environment is functionally correct, but I don't know if it has performance impact because of the reason explained above.
I would suggest using an atomic counter in pthread_atfork that is incremented whenever we create a thread. In this way see how many threads we are creating and printing it out and comparing two runs with, and without the variables that I removed to see if there's any difference, that would be a quick check.
Heres the relevant code from the version of libc (2.17) that was being used in the failure:
getenv.c: https://pastebin.com/gj4XZgdE
setenv.c: https://pastebin.com/NuTfDdDp
It looks like the issue is the __environ variable being iterated over in lines 15, 32, 55. This variable is defined in setenv.c:
#if !_LIBC
# define __environ environ
# ifndef HAVE_ENVIRON_DECL
extern char **environ;
# endif
#endif
So there is this nasty global pointer to the environment. While setenv has a lock around changing the environment, theres no lock for getenv so (as mentioned in the block post linked in the description) :
"this is a char*, so it's supposed to have a bunch of pointers to char buffers, one for each var=val pairing. What if that region has been reused and now contains a bogus pointer into the weeds? Yep, segfault."
So I think @KellenSunderland's suggestion of adding a lock/mutex in dmls parameter.h's GetEnv and SetEnv should prevent any failure due to MXNet's various threads accessing the environment simultaneously. Although without a reproducible test case we're unable to validate this.
Anybody have any ideas for reproducing something similar to this code from the blog post with mxnet's engine tests so that we have a reproducible test case?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
static void* worker(void* arg) {
for (;;) {
int i;
char var[256], *p = var;
for (i = 0; i < 8; ++i) {
*p++ = 65 + (random() % 26);
}
*p++ = '\0';
setenv(var, "test", 1);
}
return NULL;
}
int main() {
pthread_t t;
printf("start\n");
setenv("foo", "bar", 0);
pthread_create(&t, NULL, worker, 0);
for (;;) {
getenv("foo");
}
return 0;
}
I tried simplifying the start_stop test https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/engine/threaded_engine_test.cc#L124 for just the threaded_engine_perdevice and stop/started it a few hundred times and it didnt cause a segfault.
We should not use locking in the handlers in pthread_atfork, it will likely have negative impacts in terms of lock contention and performance and architecturally is not a good solution. We should find a solution using good architecture to pass the number of threads in the engine. threads get a data pointer where you can put configuration or pass any other data at creation time.
@larroy We were already using locks inside the handlers. setenv uses a lock in libc around modifying the environment. This doesnt change anything.
Are you suggesting that we dont use an EnvVar to pass the OMP_NUM_THREADS=1 to subsequent processes? What do you think the impact will be the other processes that are using MXNet using more threads?
Was there a conscious decision to use an EnvVar to limit OMP_NUM_THREADS in subsequent processes?
The problem as provided in the article linked in this issue and related article here: https://rachelbythebay.com/w/2017/01/30/env/ is that if the main thread spawns another thread, which calls setenv and while we call setenv the process is forked, the mutex is currently in locked state in the child process and it will never be unlocked since there is no thread to release the lock which causes it to hang.
This can be replicated in MXNet in the following way. Pull the code from https://github.com/anirudh2290/mxnet/tree/setenv_issue and build it similar to the following:
cd build && cmake VERBOSE=1 -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_MKLDNN=ON -DUSE_OPENMP=ON -DUSE_OPENCV=OFF -DCMAKE_BUILD_TYPE=Debug -GNinja ..
Run the following script:
import multiprocessing
import os
import sys
import mxnet as mx
def mxnet_worker():
print 'inside mxnet_worker'
mx.base._LIB.MXStartBackgroundThread(mx.base.c_str("dummy"))
read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
for p in read_process:
p.daemon = True
p.start()
p.join()
Now run the script, you will be able to see the process hangs.
When I attach gdb to the process I see the following:
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fc0fabab99c in __add_to_environ (name=0x7fc093a935fc "MXNET_CPU_WORKER_NTHREADS", value=0x7fffec2eff10 "1", combined=0x0,
replace=1) at setenv.c:133
which means it is stuck trying to acquire the lock: https://github.com/lattera/glibc/blob/master/stdlib/setenv.c#L133
I checked the mxnet codebase to see if we are calling SetEnv anywhere else and we dont seem to be calling it anywhere except here. Also, pthread_at_fork statement calls Engine::Get()->Stop() which would mean that all engine threads are suspended. It is still possible that it could be called from other multithreaded code in MXNet iterators for example, but I couldnt find it and it is unlikely that we are not using dmlc::SetEnv but something else to set env vars for mxnet or dmlc-core code. I think it is more likely that the customer application spawned a thread, which called SetEnv at the same time pthread_at_fork was called which let to this behavior.
Good analysis @anirudh2290 agree with it and makes sense. I would suggest we don't use environment for this and have a propper configuration object that we can pass when needed, or in the worst case, can be a singleton, what do you think?
@larroy Can you elaborate more on the singleton solution . For example even if its a singleton its still possible that we may fork during setenv call inside a thread and run across the same issue. Removing env variables and changing it to config would be a breaking change. I dont think we can do it before 2.0.
Maybe we should understand more on the customer application and customer issue and come up with a set of guidelines when using multiprocessing with mxnet and document it for now.
@anirudh2290 good one!
I think this is problem in general. For this particular case we can try to use fork handlers:
pthread_at_fork(prepare, parent, child)
in prepare method, we should try to set bogus env variable, which will unlock the lock and then child will get that lock in unlocked state.
Something like dmlc::setEnv("Bogus", "Bogus") here: https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L54
Thanks for the suggestion @Vikas89 . I think with your suggested change, it may be less likely to be encountered, but it is possible that mutex is in locked state in child process, if the thread calls setenv after the prepare step and before fork.
If there is only 2 places that too after fork where we setEnv(initialize.cc), why are we setting env ? Can we just keep these values in memory or in some file tied to this mxnet processID ?
I was able to run the user's test script and reproduce in GDB. Looks like its failing when calling getenv("MXNET_EXEC_BULK_EXEC_INFERENCE"). This only seems to happen here:
This matches the original stack trace where we found it failing in GraphExecutor::InitOpSegs
So far, we've been able to verify that compiling mxnet without openmp does not trigger the segfault. This validates that the problem is between one thread in mxnet running openmp and another thread in mxnet calling getenv.
Thus the current workaround (to avoid the segfault) is to recompile mxnet without openmp. This will cause performance degredation for any operators that used to run using openmp (ie. those that are not using BLAS/LAPACK routines or MKLDNN).
@Vikas89 and @anirudh2290 have discussed a couple of options for implementing a permanent solution:
Moving all get/setenv calls to the beginning of mxnet initialization, before any openmp or any other dependent library is initialized. This would clearly prevent the problem by moving the getenv to before openmp runs. But then this may cause a breaking change with how developers are using mxnet to change its behavior using environment variables at runtime. If we only sample env vars at init time this will no longer work.
We could write our own thread-safe getenv/setenv functions and implement them across mxnet and openmp (since we're compiling it from source when we build mxnet). This would eliminate the problem, but would be harder to maintain as we would have to constantly patch openmp
We could create APIs to enable the behavior that users currently have when setting env vars, and with the change described in #1 have the same functionality albeit with a different use-model.
None of these seem like the best options, any other comments or thoughts?
I think to arrive at the right solution need to understand more on the openmp behavior and why it calls the library initialize within an operator execution when an omp for loop is called. Need to also confirm if this happens only during forking but also outside forking and what specific openmp versions the customer has and if this can be reproduced on other openmp versions.
Good analysis. Has anyone checked to see this problem is reproducible with all versions of OMP? I'm wondering if another workaround that may not have a perf impact would be to use a different implementation than what is used by default.
Regarding option 1. I actually think this is a good design, but several tests are currently relying on the fact that we can change behaviour via env var changes during runtime. However, we could probably fix this be reinit'ing the library whenever we set an env var in a test. I think we should avoid reading env vars at runtime in any case for performance reasons, so it makes sense to read these once at startup and not again after.
I would go for option 1.
Thanks @larroy @KellenSunderland @anirudh2290, just found out that the problem was due to using Intel OpenMP rather than the LLVM OpenMP submodule. Upon switching back to using LLVM OpenMP the problem did not occur. I think the problem is effectively avoided since we're not advocating for using Intel OpenMP.
@TaoLv @pengzhao-intel any interest in seeing MXNet switch to Intel OpenMP and investigating this issue? Are there any perceived benefits?
thanks @samskalicky It's a long thread, we need to warm up and understand the background :)
@TaoLv could you help take a look for the questions?
@samskalicky Can you explain more about how to use Intel OMP in MXNet and how to switch back to LLVM OMP? I think LLVM OMP is using the same runtime as Intel OMP.
@TaoLv according to this page: https://software.intel.com/en-us/forums/intel-c-compiler/topic/793552 they should be the same but may differ slightly as they have different release schedules. So this is a possibility that the functionality differs between a particular versions of LLVM/Intel OMP.
As for how to use it, I didnt do the building myself but I imagine that those who did simply modified the makefile to link against the Intel OMP binary rather than building the LLVM OMP submodule. Unfortunately I dont have step-by-step instructions for you.
I was just hoping to see what your thoughts were on the current way MXNet is architected in regards to OMP and if what you thought could be improved (possibly by using Intel OMP). And if you had any interest in seeing MXNet switch to Intel OMP and what any benefits might be. Of course we'd still need to maintain a build without Intel OMP for other architectures.
@samskalicky Yes. I would expect ICC + Intel OMP should have better performance and I know there is an effort ongoing for enabling them for MXNet. Currently MXNet cannot be successfully built with ICC.
Most helpful comment
The problem as provided in the article linked in this issue and related article here: https://rachelbythebay.com/w/2017/01/30/env/ is that if the main thread spawns another thread, which calls setenv and while we call setenv the process is forked, the mutex is currently in locked state in the child process and it will never be unlocked since there is no thread to release the lock which causes it to hang.
This can be replicated in MXNet in the following way. Pull the code from https://github.com/anirudh2290/mxnet/tree/setenv_issue and build it similar to the following:
Run the following script:
Now run the script, you will be able to see the process hangs.
When I attach gdb to the process I see the following:
which means it is stuck trying to acquire the lock: https://github.com/lattera/glibc/blob/master/stdlib/setenv.c#L133
I checked the mxnet codebase to see if we are calling SetEnv anywhere else and we dont seem to be calling it anywhere except here. Also, pthread_at_fork statement calls
Engine::Get()->Stop()which would mean that all engine threads are suspended. It is still possible that it could be called from other multithreaded code in MXNet iterators for example, but I couldnt find it and it is unlikely that we are not using dmlc::SetEnv but something else to set env vars for mxnet or dmlc-core code. I think it is more likely that the customer application spawned a thread, which calledSetEnvat the same time pthread_at_fork was called which let to this behavior.