Incubator-mxnet: Pretty high cpu load when import mxnet

Created on 20 Aug 2018 · 22Comments · Source: apache/incubator-mxnet

When i import mxnet in 8 processes simultaneously, all cpu resources will be used and the program stagnates for almost 5 minutes.

It works fine for mxnet1.1 but failed for mxnet1.2 and mxnet1.3
Following is the sample code

import multiprocessing

def mxnet_worker():
    print 'before import'
    import mxnet 
    print 'after import'

read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
for p in read_process:
    p.daemon = True
    p.start()

Any solution to this for mxnet1.2 and mxnet1.3? Thanks.

Python

Source

fighting-liu

👍5

Most helpful comment

This is the code in https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L359

const auto max_cores = static_cast<size_t>(omp_get_num_procs()) >> 1;
    if (max_cores >= 2) {
      std::vector<duration_t> core_times;
      // Take care of any OMP lazy-init with a throwaway call
      for (size_t omp_threads = 2; omp_threads <= max_cores; ++omp_threads) {
        GetOMPLoopOverhead(omp_threads);
}

If there are 36 cores, total threads will be 2+3+4+...18 = 170 threads
if there are 32 mxnet python processes = 170*32 = 5440 threads launched to determine tuning. Then there is resource conetention + omp_badness .
We would need to reduce the num of threads to determine tuning, making it bounded if possible.

Vikas-kum on 5 Dec 2018

👍4

All 22 comments

@fighting-liu : Just to understand your problem a little further, could you give details of your environment? This is given in the issue template. Also, what was the issue? Was the "after import" being printed after 5 minutes? How long did it take for 1.1 and also, how long does it take for one process import'ing mxnet with .12? System/environment information would be useful to help you further.

For reference of the issue template: https://github.com/apache/incubator-mxnet/blob/master/.github/ISSUE_TEMPLATE.md

@mxnet-label-bot : [Python, Question]

vdantu on 20 Aug 2018

@vdantu Thanks for your attention.

1. Environment info

----------System Info----------
('Platform     :', 'Linux-3.10.0-693.17.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core')
('system       :', 'Linux')
('node         :', '*****************')
('release      :', '3.10.0-693.17.1.el7.x86_64')
('version      :', '#1 SMP Thu Jan 25 20:13:58 UTC 2018')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping:              1
CPU MHz:               2401.000
CPU max MHz:           2401.0000
CPU min MHz:           1200.0000
BogoMIPS:              4800.04
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.3104 sec, LOAD: 5.1832 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.5159 sec, LOAD: 6.2714 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.5885 sec, LOAD: 2.0516 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.5425 sec, LOAD: 1.0500 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 1.4049 sec, LOAD: 4.5601 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 2.1899 sec, LOAD: 2.9143 sec.

2. Sample code

import multiprocessing
import time

def mxnet_worker():
    b_time = time.time()
    import mxnet 
    print 'time consumes: {}'.format(time.time()-b_time)

read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
for p in read_process:
    p.daemon = True
    p.start()

3. Test results
3.1 pip install mxnet-cu90==1.1

3.2 pip install mxnet-cu90==1.2

3.3 pip install --pre mxnet-cu90 (mxnet_cu90-1.3.0b20180820)

4. Cpu status from htop monitor

It takes too much time to import mxnet with mxnet 1.2 and mxnet 1.3

fighting-liu on 21 Aug 2018

Thanks for sharing this info. This is a good observation. I think it would be good to have a larger audience for this, could you also start a thread on discuss.mxnet.io?

vdantu on 23 Aug 2018

@vdantu Can you solve this problem?? I've start a thread on discuss.mxnet.io, but nobody answers.

fighting-liu on 24 Aug 2018

import ctypes
import time

start = time.time()
ctypes.CDLL('libmxnet_1.1.so', ctypes.RTLD_LOCAL)
print 'load libmxnet1.1.so', time.time() - start

start = time.time()
ctypes.CDLL('libmxnet_1.3.so', ctypes.RTLD_LOCAL)
print 'load libmxnet1.3.so', time.time() - start

load libmxnet1.1.so 0.0203261375427
load libmxnet1.3.so 1.06584215164

FYI，this might the reason why it costs too much time.

kardoszc on 31 Aug 2018

@vdantu

kardoszc on 2 Sep 2018

@kardoszc using the example provided by you , i was able to get the same result on run 1 on a MAC, on consecutive runs of the above, the speed was better, or similar

$python mxnet11.py 
load libmxnet1.1.so  0.019122838973999023
load libmxnet1.3.so  0.035729169845581055
$python mxnet11.py 
load libmxnet1.1.so  0.01851797103881836
load libmxnet1.3.so  0.034262895584106445
$ python mxnet11.py 
load libmxnet1.1.so  0.019358158111572266
load libmxnet1.3.so  0.03675413131713867
$ python mxnet11.py 
load libmxnet1.1.so  0.01902318000793457
load libmxnet1.3.so  0.03872489929199219

The files are about 10MB different is size, so the loading time itself is slightly higher on 1.3

ls -l libmxnet1.*.so

27494556 Oct  2 15:50 libmxnet1.1.so
37263368 Oct  2 15:51 libmxnet1.3.so

vrakesh on 3 Oct 2018

@fighting-liu
similar results to above were observed running your example

>>> def mxnet_worker():
...     b_time = time.time()
...     import mxnet 
...     print 'time consumes: {}'.format(time.time()-b_time)
... 
>>> read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
>>> for p in read_process:
...     p.daemon = True
...     p.start()
... 
>>> time consumes: 15.8834888935
time consumes: 15.884565115
time consumes: 15.8791670799
time consumes: 15.8853030205
time consumes: 15.8832161427
time consumes: 15.882764101
time consumes: 15.8819229603
time consumes: 15.8869299889

>>> read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
>>> for p in read_process:
...     p.daemon = True
...     p.start()
... 
>>> time consumes: 1.01575899124
time consumes: 1.02250099182
time consumes: 1.03319501877
time consumes: 1.03118515015
time consumes: 1.03451776505
time consumes: 1.03348302841
time consumes: 1.03426003456
time consumes: 1.03685307503

The second run gets the speed up again, I observed a similar trend in 1.1 as well, although the gap is smaller.

I am investigating this issue further

vrakesh on 3 Oct 2018

More updates this issue seems to specific to linux, reproduced Original posters issue on a linux box
mxnet 1.1

>>> import multiprocessing
>>> import time
>>> 
>>> def mxnet_worker():
...     b_time = time.time()
...     import mxnet 
...     print 'time consumes: {}'.format(time.time()-b_time)
... 
>>> read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
>>> for p in read_process:
...     p.daemon = True
...     p.start()
... 
>>> time consumes: 0.513573884964
time consumes: 0.518635988235
time consumes: 0.553323984146
time consumes: 0.549813985825
time consumes: 0.558361053467
time consumes: 0.556171894073
time consumes: 0.566649913788
time consumes: 0.569785118103

mxnet 1.3

python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> import time
>>> 
>>> def mxnet_worker():
...     b_time = time.time()
...     import mxnet 
...     print 'time consumes: {}'.format(time.time()-b_time)
... 
>>> read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
>>> for p in read_process:
...     p.daemon = True
...     p.start()
... 
>>> time consumes: 8.75463604927
time consumes: 17.4193239212
time consumes: 17.656072855
time consumes: 18.1875190735
time consumes: 18.4937279224
time consumes: 18.5608999729
time consumes: 18.5980598927
time consumes: 18.6172778606

vrakesh on 6 Oct 2018

@fighting-liu is your MXNet built from source, or installed via pip? if installed via pip - which version of MXNet? (mxnet/mxnet-mkl/mxnet-cu90mkl/...), if built from source can you provide the build flags?

@vrakesh the original report is that program stagnates for 5 mins... what is the total time you are seeing?

lupesko on 7 Oct 2018

@lupesko I do not see times as high was 5 minutes, overall, the who run is under 30s. But it should take take more than a second or so to complete a C library import

When a single instance is imported it takes about half a second to import , as the number of parallel import increases , the time taken drastically increases.

vrakesh on 8 Oct 2018

Hey Guys, we need to start getting consistent in what and how were testing this.

Here was I tried:

spin up a C5.large instance with DLAMI v19.0
pip install mxnet-cu90 --user
time python -c "import mxnet"

Here was the output:

real 0m24.955s
user 0m0.827s
sys 0m0.189s

Doing it a second time resulted in the following (and for every subsequent time too).

real 0m0.855s
user 0m0.798s
sys 0m0.123s

Spinning up a new instance, reinstalling the pip and running the script above for running 8 processes in parallel resulted in:

$ time python load_time.py
time consumes: 24.740694046020508
time consumes: 24.74588418006897
time consumes: 24.746330499649048
time consumes: 24.759278059005737
time consumes: 24.75693655014038
time consumes: 24.76559042930603
time consumes: 24.780791997909546
time consumes: 24.783067226409912

real 0m24.829s
user 0m8.753s
sys 0m0.742s

And running it a 2nd (and every subsequent) time:

$ time python load_time.py
time consumes: 4.612242221832275
time consumes: 4.625608682632446
time consumes: 4.641973257064819
time consumes: 4.690966844558716
time consumes: 4.7061262130737305
time consumes: 4.699116945266724
time consumes: 4.703948259353638
time consumes: 4.718777418136597

real 0m4.770s
user 0m8.823s
sys 0m0.650s

I terminated that instance and spun up a new one and after installing the pip wheel I modified the file in

/home/ec2-user/.local/lib/python3.6/site-packages/mxnet/__init__.py

(where the packages are installed with --user) to get an idea which imports where causing the most delay. Heres what I found were the biggest offenders:

from .context import Context, current_context, cpu, gpu, cpu_pinned
from . import engine
from .base import MXNetError
from . import base
from . import contrib
from . import ndarray
from . import ndarray as nd
from . import name

time = 20.388737678527832

from . import random as rnd
from . import random
from . import optimizer
from . import model
from . import metric
from . import notebook
from . import initializer

time = 0.5453829765319824

from . import image
from . import image as img

from . import test_utils

from . import rnn

from . import gluon

time = 0.4957273006439209

Running the 2nd time resulted in 0.6350719928741455 for the first block and 0.01 for the other two.

Now that we know the biggest delay is coming from that first group, i terminated and spun up a new instance, reinstalled and timed around those imports and found these two were causing the most delay:

from .context import Context, current_context, cpu, gpu, cpu_pinned
from . import contrib

9.019979238510132
14.68500018119812

So those two make up the majority of the 24 seconds we're seeing. If someone else could help out and jump in and follow a similar approach for the context and contrib, we can zoom in on the culprit.

samskalicky on 30 Nov 2018

I took the script from above and added another loop to try from 1 to 36 processes testing a local build from source (not pip package):

import multiprocessing
import time

def mxnet_worker():
    #t1 = time.time()
    import mxnet
    #t2 = time.time()
    #elapsed = t2-t1
    #print(times)

times = []
for i in range(37):
    t1 = time.time()
    read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(i)]
    for p in read_process:
        p.daemon = True
        p.start()

    for p in read_process:
        p.join()
    t2 = time.time()
    times.append(t2-t1)

for i in times:
    print(i)

Here are the results when compiling with the following cmake flags:

cmake -DUSE_CUDA=OFF -DUSE_CUDNN=OFF -DUSE_MKLDNN=OFF -DBLAS=Open -DCMAKE_BUILD_TYPE=Debug  ..

1: 5.77136611938
2: 7.65716195107
3: 13.9892320633
4: 16.6815569401
5: 22.9886288643
6: 27.6006569862
7: 30.7331540585
8: 33.8466141224
9: 34.18151021
10: 37.1062369347
11: 43.6272640228
12: 44.1143600941
13: 45.8406460285
14: 46.6692020893
15: 47.8332960606
16: 52.4621579647
17: 56.1070458889
18: 56.8046569824
19: 54.1124491692
20: 65.2930281162
21: 62.0744900703
22: 60.4670469761
23: 69.6229948997
24: 71.4172370434
25: 70.9572968483
26: 74.8509230614
27: 77.0419559479
28: 78.2489409447
29: 80.1934709549
30: 74.9342000484
31: 84.4639661312
32: 83.6565339565
33: 91.3137798309
34: 88.20520401
35: 96.2017951012
36: 96.4477438927

Then I tested against the pip wheel:

1: 1.86075401306
2: 2.52445602417
3: 27.84821105
4: 272.775645971
5: 532.317739964
6: 785.189717054

and i killed it after 6 processes. I think we get the picture.

Heres another set of results when compiling without openmp:

cmake -DUSE_CUDA=OFF -DUSE_CUDNN=OFF -DUSE_MKLDNN=OFF -DBLAS=Open -DCMAKE_BUILD_TYPE=Debug -DUSE_OPENMP=OFF ..

1: 0.827432
2: 0.859651
3: 0.858839
4: 0.833471
5: 0.884956
6: 0.883090
7: 0.862174
8: 0.888009
9: 0.891180
10: 0.917642
11: 0.894244
12: 0.947771
13: 0.944967
14: 0.956380
15: 0.932657
16: 0.991420
17: 0.956935
18: 0.924413
19: 0.935913
20: 0.944736
21: 0.996702
22: 0.934430
23: 0.966333
24: 1.022540
25: 1.038306
26: 1.175906
27: 1.056674
28: 1.022513
29: 1.083556
30: 1.151226
31: 1.078056
32: 1.046550
33: 1.220279
34: 1.256747
35: 1.334894
36: 1.377328

Clearly theres a problem with OpenMP seeing as the results are very reasonable to load when OpenMP is not used.

@szha, can you take a look at this? Theres a huge discrepancy between building from source and the pip wheel. Is there something different that is done when building the pip wheel related to OpenMP?

samskalicky on 30 Nov 2018

Looks like the processes are stuck at gomp_team_start if I use multiprocessing

#0  0x00007f5a797a774a in do_spin (val=22256, addr=addr@entry=0x55f45a6451c4)
    at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:130
#1  do_wait (addr=addr@entry=0x55f45a6451c4, val=val@entry=22256) at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:66
#2  0x00007f5a797a7813 in gomp_barrier_wait_end (bar=0x55f45a6451c0, state=22256)
    at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3  0x00007f5a797a6a1d in gomp_simple_barrier_wait (bar=<optimized out>)
    at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4  gomp_team_start (fn=<optimized out>, data=<optimized out>, nthreads=7, flags=<optimized out>, team=0x55f45a646790)
    at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:829
#5  0x00007f5a4f6cd8a8 in ?? () from /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
#6  0x00007f5a4f6dee3c in ?? () from /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
#7  0x00007f5a4f6df9fd in ?? () from /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
#8  0x00007f5a4f6dfb53 in ?? () from /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
#9  0x00007f5a4cb5f794 in ?? () from /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
#10 0x00007f5a7f5336ba in call_init (l=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7ffd800cd128, env=env@entry=0x55f45a0422b0) at dl-init.c:72
#11 0x00007f5a7f5337cb in call_init (env=0x55f45a0422b0, argv=0x7ffd800cd128, argc=2, l=<optimized out>) at dl-init.c:30
#12 _dl_init (main_map=main_map@entry=0x55f45a485c90, argc=2, argv=0x7ffd800cd128, env=0x55f45a0422b0) at dl-init.c:120
#13 0x00007f5a7f5388e2 in dl_open_worker (a=a@entry=0x7ffd800c8610) at dl-open.c:575

But instead if I use threads, like below, threads don't get stuck.

def mxnet_worker():
    print("before import: pid:{}".format(getpid()))
    st_time = time.time()
    import mxnet
    end_time = time.time()
    print("after import: pid:{} time:{}".format(getpid(), end_time - st_time))

#read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)]
from threading import Thread
i=0
while i<8:
    t1 = Thread(target=mxnet_worker)
    t1.start()
    i=i+1
#for p in read_process:
#    p.daemon = True
#    time.sleep(3)
#    p.start()
#time.sleep(100000)

Looks like there is issue with fork + openmp ,
We should check if this is related : https://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork

Vikas-kum on 1 Dec 2018

I added a 2 second delay between launching each process in the script and found that most processes complete the import between 1.7-1.9 seconds. This 2 second delay prevents processes from competing for resources at the same time. There appears to be some bad contention going on thats causing the large delay. Adding the 2 second delay prevented large exponential increase in import delay. Heres the data for 1-36 processes with this 2 second delay using the pip wheel:

1: 2.002468
2: 4.005214
3: 6.007431
4: 8.617790
5: 10.012702
6: 12.015214
7: 14.018058
8: 16.020579
9: 18.022354
10: 20.025505
11: 22.027436
12: 24.030987
13: 26.033214
14: 28.035729
15: 30.037884
16: 32.041215
17: 34.043598
18: 36.044175
19: 38.048719
20: 40.051221
21: 42.053396
22: 44.055702
23: 46.058514
24: 48.060241
25: 50.063950
26: 52.727346
27: 54.068652
28: 56.071813
29: 58.073548
30: 60.076997
31: 62.079071
32: 64.081306
33: 66.716197
34: 68.085373
35: 70.088400
36: 72.092169

So while the 12 seconds for 6 processes isnt ideal, it much better than the 785 that was found earlier without the 2 second delay.

So short-term workaround is to measure about how long it takes to "import mxnet" in a single process and then add an appropriately longer (~2second) delay between launching each process to avoid contention. The delay length might need to be tuned for each use case to avoid contention.

We'll continue debugging and trace the OpenMP problem and try to resolve it.

samskalicky on 1 Dec 2018

if you fork and then try to use OMP in thenforked process and are using libgomp, the process will hang because libgomp does not support forking.

cjolivier01 on 5 Dec 2018

OMP has a very high overhead in ec2 instances with 36+ cores. this is a known problem with ec2 instances.

cjolivier01 on 5 Dec 2018

👍1

There is huge number of threads getting created in operator-tune-inl.h that causes huge cpu load, threads take long time to determine whether they should use openmp for operator or not.
This problem gets worse with machines with more cores.
And multiplies with more processes.
As an experiment if I change the omp parallelism to thread count 1, and launch 32 processes it just take 6 seconds to load.

Vikas-kum on 5 Dec 2018

This is the code in https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L359

const auto max_cores = static_cast<size_t>(omp_get_num_procs()) >> 1;
    if (max_cores >= 2) {
      std::vector<duration_t> core_times;
      // Take care of any OMP lazy-init with a throwaway call
      for (size_t omp_threads = 2; omp_threads <= max_cores; ++omp_threads) {
        GetOMPLoopOverhead(omp_threads);
}

Vikas-kum on 5 Dec 2018

👍4

@cjolivier01 Would running with MXNET_USE_OPERATOR_TUNING=0 mitigate this issue?

larroy on 5 Dec 2018

Hi @larroy , setting that env var does seem to avoid the issue. Heres the testing with and without it:

$ python parallel_test.py 
1: 3.71833109856
2: 2.17086696625
3: 46.48721385

ubuntu@ip-172-31-83-54:~$ export MXNET_USE_OPERATOR_TUNING=0
ubuntu@ip-172-31-83-54:~$ python parallel_test.py 
1: 0.798185110092
2: 0.792401075363
3: 0.818596124649

Notice that in the 2nd run with the env var, that all runs take ~0.8 seconds

@cjolivier01 what are the performance implications for setting this env var? Does it only affect non-MKL/MKLDNN operators executing on CPU context?

samskalicky on 6 Dec 2018

❤1

Closing this issue since the related PR is merged. Will track in a new issue, turning off omp tuning for the subprocess.