Incubator-mxnet: gcc8+ memory usage regression for compiling indexing_op.o

Created on 6 Jun 2020 · 31Comments · Source: apache/incubator-mxnet

Description

Hi there, I try to build MXNet2.0 (only cpu) in my laptop with 16GB memory. I found that it takes over 16GB memory to compile a single file src/operator/tensor/index_op.o. I need to create extra 8GB virtual memory for building this file.

Is it possible to divide indexing_op into multiple small files to reduce the memory cost?

Environment

The latest code of MXNet 2.0
Arch Linux

Conclusion

The issue has been solved.

The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o by ninja with different version of gcc.

Compiler | The cost of memory(Child high-water RSS)
-------------|---------------------------
g++ 6.4.1 | 1.95 GB
g++ 7.4.1 | 1.78 GB
g++ 10.1.0 | 11 GB

Besides, since the compiler flags is different in different building ways (for example Makefile enable -funroll-loops, it will takes more memory), the cost of memory is different.

The solution is to build MXNet with g++-6 or g++-7.

Bug Build Upstream

Source

wkcn

Most helpful comment

I measure the overall memory consumption during compilation using linux control group feature. https://github.com/gsauthof/cgmemtime

Results are

v1.7.x
Child user: 7658.352 s
Child sys : 263.657 s
Child wall: 199.661 s
Child high-water RSS : 1952024 KiB
Recursive and acc. high-water RSS+CACHE : 54680084 KiB

v1.6.x
Child user: 5758.186 s
Child sys : 222.487 s
Child wall: 131.241 s
Child high-water RSS : 2040712 KiB
Recursive and acc. high-water RSS+CACHE : 45344596 KiB

v1.5.x
Child user: 3800.705 s
Child sys : 143.353 s
Child wall: 112.121 s
Child high-water RSS : 1604820 KiB
Recursive and acc. high-water RSS+CACHE : 37374300 KiB

ccache is always cleaned between compilations. Results obtained with:

CC=gcc-7 CXX=g++-7 cmake -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja

This is preliminary in that it measures parallel compilation, thus memory usage is very high. Overall there's a 44% increase from 1.5

leezu on 12 Jun 2020

👍3

All 31 comments

Fixing this would be a welcome improvement. Did you investigate if the high memory consumption is consistent among gcc and clang, as well as still present on gcc 9 (or 10) / clang 10?

leezu on 8 Jun 2020

Hi @leezu , the compiler I used is the latest version of gcc, namely gcc 10.1.0.
indexing_op.o is the only file which takes a long time and over 16GB memory to build.
I have not yet tried clang.

wkcn on 8 Jun 2020

I believe maybe I could help since clearly we are having the same problem, I don't have a notion about cross-compiling but I have access to 16GB+ mem computer

woreom on 8 Jun 2020

I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).

If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.

wkcn on 8 Jun 2020

I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).

If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.

Only 4 files take more than 8GB

woreom on 8 Jun 2020

👍1

I think we should consider this a release-critical bug. @woreom @wkcn did you try if this affects the 1.7 / 1.x branches as well?

cc: @ciyongch

leezu on 8 Jun 2020

@leezu sorry that I did not check 1.7 anx 1.x branches.

wkcn on 9 Jun 2020

There seem to be some more issues. In certain build configuration with llvm 7, many of the numpy object files blow up

875M    build/CMakeFiles/mxnet.dir/src/operator/tensor/broadcast_reduce_norm_value.cc.o
918M    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_elemwise_broadcast_logic_op.cc.o
1.2G    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_where_op.cc.o
1.9G    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_broadcast_reduce_op_value.cc.o
2.1G    build/CMakeFiles/mxnet.dir/src/operator/numpy/linalg/np_norm_forward.cc.o

leezu on 10 Jun 2020

👍1

Hi @leezu, @wkcn , as this is only a build issue when building MXNet from source on some certain machines (installed with small memory < 16GB), I suggest not to tag it a block issue for 1.7.0 and consider to include the fix if it's available before the release happened.
User still can install MXNet via binary release/nightly image or increase the virtual memory of their build machine as a workaround.
What do you think?

ciyongch on 11 Jun 2020

👎1 👍1

it would be great if you could make a prebuild that works on a raspberry pi with armv7 because I tried to build all versions from 1.2.1 to 1.6.0 and failed.

woreom on 11 Jun 2020

👍1

Hi @ciyongch , I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.

After the problem addressed, we can backport the PR to 1.7.x branch.

wkcn on 11 Jun 2020

Hi @woreom , could you please create a issue about requesting pre-build MXNet on ARM?

MXNet consists of ARM build and test (#18264 , #18058 ). I don't know whether the pre-build package will be released.

wkcn on 11 Jun 2020

@wkcn #18471 I did but @leezu closed it. I will open another one

woreom on 11 Jun 2020

I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.

After the problem addressed, we can backport the PR to 1.7.x branch.

Thanks for your confirm @wkcn :)

ciyongch on 11 Jun 2020

🎉1

@woreom It seems that the pre-built MXNet 1.5 package will not be uploaded because of ASF licensing policy, but pre-built MXNet 1.7 and 2.0+ on ARM may be uploaded.

Before that, you can try the naive build or cross-compiling, following the instruction: https://mxnet.apache.org/get_started?platform=devices&iot=raspberry-pi&

wkcn on 11 Jun 2020

I disagree. Official MXNet releases are source releases. At this point in time, there exist 0 compliant binary releases.
It's very important that we don't introduce regressions that prevent users from building MXNet.

I didn't check if this is present in 1.7, but if it is, it certainly is a release blocker in my opinion. Note that this is probably a regression due the work on mxnet 2. It's not acceptable to introduce such regressions in the 1.x series.

leezu on 11 Jun 2020

👍2

I measure the overall memory consumption during compilation using linux control group feature. https://github.com/gsauthof/cgmemtime

Results are

v1.7.x
Child user: 7658.352 s
Child sys : 263.657 s
Child wall: 199.661 s
Child high-water RSS : 1952024 KiB
Recursive and acc. high-water RSS+CACHE : 54680084 KiB

v1.6.x
Child user: 5758.186 s
Child sys : 222.487 s
Child wall: 131.241 s
Child high-water RSS : 2040712 KiB
Recursive and acc. high-water RSS+CACHE : 45344596 KiB

v1.5.x
Child user: 3800.705 s
Child sys : 143.353 s
Child wall: 112.121 s
Child high-water RSS : 1604820 KiB
Recursive and acc. high-water RSS+CACHE : 37374300 KiB

ccache is always cleaned between compilations. Results obtained with:

CC=gcc-7 CXX=g++-7 cmake -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja

This is preliminary in that it measures parallel compilation, thus memory usage is very high. Overall there's a 44% increase from 1.5

leezu on 12 Jun 2020

👍3

Doing a single-process build of 1.7.x branch (ninja -j1) just costs around 2 GB memory at maximum.

Child user: 4167.479 s
Child sys : 159.497 s
Child wall: 4327.964 s
Child high-water RSS : 1952008 KiB
Recursive and acc. high-water RSS+CACHE : 2155568 KiB

leezu on 12 Jun 2020

I'm trying to use ninja to build MXNet 2.0 (the master branch) on my laptop computer (16 GB mem + 8GB virtual mem). I will update the log later.

cmake  -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja

I run ninja two times since the building was interrupted, and the second time continues the building.
(gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f))

Child user: 3692.505 s
Child sys :  177.550 s
Child wall: 1017.096 s
Child high-water RSS                    :    1852208 KiB
Recursive and acc. high-water RSS+CACHE :    3877980 KiB

Child user: 13315.378 s
Child sys :  353.862 s
Child wall: 3847.364 s
Child high-water RSS                    :   11402844 KiB
Recursive and acc. high-water RSS+CACHE :   12226040 KiB

wkcn on 12 Jun 2020

Thanks @wkcn. I'll report the same with gcc7. You are using gcc10 right?

leezu on 12 Jun 2020

@leezu
Yes, gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f381f91b157a26d9beddcaa8f4960cc038)

wkcn on 12 Jun 2020

Single process build of MXNet master with gcc7 gives the following results:

Child user: 5288.372 s
Child sys :  188.645 s
Child wall: 5481.062 s
Child high-water RSS                    :    2504976 KiB
Recursive and acc. high-water RSS+CACHE :    2674692 KiB

That's a 24% increase to 1.7, but less than 3GB high-water. So I don't think we have any blocking issue here. @wkcn I suggest you reduce the number of parallel builds to stay under 16GB. Also recommend to use ccache to avoid rebuilding.

leezu on 12 Jun 2020

Hi @leezu , I am not sure that it costs less than 3GB memory to build indexing_op.o.

By building with ninja in multiple threads, It takes less than 12 GB memory.
However, it takes more than 16GB memory to build indexing_op.o by ‘make’ in single thread. I need to confirm it.

I think the compiler flags may be different between original ‘Makefile’ and ‘CMakeList’.
I remember that the original Makefile will be deprecated . If so, it is not a block issue.

I modify indexing_op.cc and re-build it.

Ninja
`cgmemtime ninja -v -j 1`

/usr/bin/ccache /usr/lib/ccache/bin/c++  -DDMLC_CORE_USE_CMAKE -DDMLC_LOG_FATAL_THROW=1 -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX14 -DMSHADOW_INT64_TENSOR_SIZE=0 -DMSHADOW_IN_CXX11 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_MKL=0 -DMSHADOW_USE_SSE -DMXNET_USE_BLAS_OPEN=1 -DMXNET_USE_LAPACK=1 -DMXNET_USE_LIBJPEG_TURBO=0 -DMXNET_USE_MKLDNN=1 -DMXNET_USE_OPENCV=1 -DMXNET_USE_OPENMP=1 -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_SIGNAL_HANDLER=1 -DNDEBUG=1 -D__USE_XOPEN2K8 -Dmxnet_EXPORTS -I../3rdparty/mkldnn/include -I3rdparty/mkldnn/include -I../include -I../src -I../3rdparty/nvidia_cub -I../3rdparty/tvm/nnvm/include -I../3rdparty/tvm/include -I../3rdparty/dmlc-core/include -I../3rdparty/dlpack/include -I../3rdparty/mshadow -I../3rdparty/mkldnn/src/../include -I3rdparty/dmlc-core/include -isystem /usr/include/opencv4 -Wall -Wno-sign-compare -O3 -fopenmp -fPIC   -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -std=gnu++17 -MD -MT CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o -MF CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o.d -o CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o -c ../src/operator/tensor/indexing_op.cc

Child user:  275.350 s
Child sys :    8.356 s
Child wall:  292.649 s
Child high-water RSS                    :   11403896 KiB
Recursive and acc. high-water RSS+CACHE :   11448656 KiB

Makefile
cgmemtime make -j1

g++ -std=c++17 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX14=1 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_LOG_FATAL_THROW=1 -O3 -DNDEBUG=1 -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/mshadow/ -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/dmlc-core/include -fPIC -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/tvm/nnvm/include -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/dlpack/include -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv4  -fopenmp -DMXNET_USE_OPENMP=1 -DMXNET_USE_OPERATOR_TUNING=1 -DMSHADOW_INT64_TENSOR_SIZE=0 -DMXNET_USE_LAPACK -DMXNET_USE_BLAS_ATLAS=1 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free  -DMXNET_USE_NVML=0 -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c src/operator/tensor/indexing_op.cc -o build/src/operator/tensor/indexing_op.o

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make: *** [Makefile:579: build/src/operator/tensor/indexing_op.o] Error 1
Child user:  358.708 s
Child sys :   13.901 s
Child wall:  376.262 s
Child high-water RSS                    :   13741284 KiB
Recursive and acc. high-water RSS+CACHE :   13771840 KiB

After removing -funroll-loops then build by make

Child user:  260.011 s
Child sys :    4.400 s
Child wall:  265.816 s
Child high-water RSS                    :   11338948 KiB
Recursive and acc. high-water RSS+CACHE :   11392800 KiB

The difference between ninja and makefile:

Make - Ninja

-DMSHADOW_USE_PASCAL=0
-DMXNET_USE_BLAS_ATLAS=1
-DDMLC_USE_CXX14=1
-DMSHADOW_RABIT_PS=0
-DMXNET_USE_LAPACK
-DMSHADOW_FORCE_STREAM
-DMXNET_USE_NCCL
-DMXNET_USE_NVML=0
-DMSHADOW_DIST_PS=0

-fno-builtin-free
-std=c++17
-funroll-loops
-fno-builtin-calloc
-fno-builtin-realloc
-fno-builtin-malloc
-Wsign-compare
-MMD

Ninja - make

-DMSHADOW_USE_SSE
-DMXNET_USE_SIGNAL_HANDLER=1
-DMXNET_USE_LAPACK=1
-DMXNET_USE_MKLDNN=1
-DDMLC_USE_CXX14
-DMSHADOW_IN_CXX11
-DDMLC_USE_CXX11
-D__USE_XOPEN2K8
-DMXNET_USE_BLAS_OPEN=1
-DDMLC_STRICT_CXX11

-MD
-MF
-MT
-std=gnu++17
-Wno-sign-compare
-isystem

wkcn on 13 Jun 2020

Hi @leezu , I found the cause.
The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o by ninja with different version of gcc.

Compiler | The cost of memory(Child high-water RSS)
-------------|---------------------------
g++ 6.4.1 | 1.95 GB
g++ 7.4.1 | 1.78 GB
g++ 10.1.0 | 11 GB

Besides, since the compiler flags is different in different building ways (for example Makefile enable -funroll-loops, it will takes more memory), the cost of memory is different.

wkcn on 13 Jun 2020

Hi @leezu @woreom @ciyongch , I have found the cause.

The cause is related to the compiler. g++ 10 takes over 11 GB memory to build indexing_op.o, but g++ 6 and 7 take less than 2 GB.

The solution is to build MXNet with g++-6 or g++-7.

Thanks for your help!

wkcn on 13 Jun 2020

👍1

@wkcn thank you for investigating this. The regression in gcc is quite serious. Would you check if there is a report at https://gcc.gnu.org/bugs/ and potentially open a new bug report? Eventually gcc10 will be shipped by default on many platforms and this issue may affect more users later.

leezu on 15 Jun 2020

@leezu Sorry that I do not know how to find the bug report in https://gcc.gnu.org/bugs/

wkcn on 15 Jun 2020

@wkcn the bugtracker is linked on the page. It's https://gcc.gnu.org/bugzilla/

leezu on 15 Jun 2020

👍1

@leezu Thank you! I guess that the bug is memory leak of the compiler gcc 10.1.0.

wkcn on 16 Jun 2020

According to https://github.com/apache/incubator-mxnet/issues/15393#issuecomment-649127482 the leak already occurs with gcc8

leezu on 25 Jun 2020

@leezu I have understand that the main problem is the gcc, I used the command:

-DCMAKE_C_COMPILER=gcc-4.9 -DCMAKE_CXX_COMPILER=g++-4.9

to successfully build mxnet with opencv but the both must be compiled with gcc-4.9 but when I import mxnet I get this error:

failed to map segment from shared object

woreom on 26 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Is there a simple way to make two similar networks share same weights?

xzqjack · 3Comments

what's the usage of ' is_train' in forward?

xzqjack · 3Comments

Automatic Batching for Dynamic Graphs

sbodenstein · 3Comments

No module named bbox when running rcnn demo.py

realbns2008 · 3Comments

Mxnet : test and validation accuracy during training ?

Shiro-LK · 3Comments