Hi there, I try to build MXNet2.0 (only cpu) in my laptop with 16GB memory. I found that it takes over 16GB memory to compile a single file src/operator/tensor/index_op.o. I need to create extra 8GB virtual memory for building this file.
Is it possible to divide indexing_op into multiple small files to reduce the memory cost?
The latest code of MXNet 2.0
Arch Linux
The issue has been solved.
The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o
by ninja with different version of gcc.
Compiler | The cost of memory(Child high-water RSS)
-------------|---------------------------
g++ 6.4.1 | 1.95 GB
g++ 7.4.1 | 1.78 GB
g++ 10.1.0 | 11 GB
Besides, since the compiler flags is different in different building ways (for example Makefile
enable -funroll-loops
, it will takes more memory), the cost of memory is different.
The solution is to build MXNet with g++-6 or g++-7.
Fixing this would be a welcome improvement. Did you investigate if the high memory consumption is consistent among gcc and clang, as well as still present on gcc 9 (or 10) / clang 10?
Hi @leezu , the compiler I used is the latest version of gcc, namely gcc 10.1.0.
indexing_op.o
is the only file which takes a long time and over 16GB memory to build.
I have not yet tried clang.
I believe maybe I could help since clearly we are having the same problem, I don't have a notion about cross-compiling but I have access to 16GB+ mem computer
I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).
If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.
I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.
Only 4 files take more than 8GB
I think we should consider this a release-critical bug. @woreom @wkcn did you try if this affects the 1.7 / 1.x branches as well?
cc: @ciyongch
@leezu sorry that I did not check 1.7 anx 1.x branches.
There seem to be some more issues. In certain build configuration with llvm 7, many of the numpy object files blow up
875M build/CMakeFiles/mxnet.dir/src/operator/tensor/broadcast_reduce_norm_value.cc.o
918M build/CMakeFiles/mxnet.dir/src/operator/numpy/np_elemwise_broadcast_logic_op.cc.o
1.2G build/CMakeFiles/mxnet.dir/src/operator/numpy/np_where_op.cc.o
1.9G build/CMakeFiles/mxnet.dir/src/operator/numpy/np_broadcast_reduce_op_value.cc.o
2.1G build/CMakeFiles/mxnet.dir/src/operator/numpy/linalg/np_norm_forward.cc.o
Hi @leezu, @wkcn , as this is only a build issue when building MXNet from source on some certain machines (installed with small memory < 16GB), I suggest not to tag it a block issue for 1.7.0 and consider to include the fix if it's available before the release happened.
User still can install MXNet via binary release/nightly image or increase the virtual memory of their build machine as a workaround.
What do you think?
it would be great if you could make a prebuild that works on a raspberry pi with armv7 because I tried to build all versions from 1.2.1 to 1.6.0 and failed.
Hi @ciyongch , I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.
After the problem addressed, we can backport the PR to 1.7.x branch.
Hi @woreom , could you please create a issue about requesting pre-build MXNet on ARM?
MXNet consists of ARM build and test (#18264 , #18058 ). I don't know whether the pre-build package will be released.
@wkcn #18471 I did but @leezu closed it. I will open another one
I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.
After the problem addressed, we can backport the PR to 1.7.x branch.
Thanks for your confirm @wkcn :)
@woreom It seems that the pre-built MXNet 1.5 package will not be uploaded because of ASF licensing policy, but pre-built MXNet 1.7 and 2.0+ on ARM may be uploaded.
Before that, you can try the naive build or cross-compiling, following the instruction: https://mxnet.apache.org/get_started?platform=devices&iot=raspberry-pi&
I disagree. Official MXNet releases are source releases. At this point in time, there exist 0 compliant binary releases.
It's very important that we don't introduce regressions that prevent users from building MXNet.
I didn't check if this is present in 1.7, but if it is, it certainly is a release blocker in my opinion. Note that this is probably a regression due the work on mxnet 2. It's not acceptable to introduce such regressions in the 1.x series.
I measure the overall memory consumption during compilation using linux control group feature. https://github.com/gsauthof/cgmemtime
Results are
v1.7.x
Child user: 7658.352 s
Child sys : 263.657 s
Child wall: 199.661 s
Child high-water RSS : 1952024 KiB
Recursive and acc. high-water RSS+CACHE : 54680084 KiB
v1.6.x
Child user: 5758.186 s
Child sys : 222.487 s
Child wall: 131.241 s
Child high-water RSS : 2040712 KiB
Recursive and acc. high-water RSS+CACHE : 45344596 KiB
v1.5.x
Child user: 3800.705 s
Child sys : 143.353 s
Child wall: 112.121 s
Child high-water RSS : 1604820 KiB
Recursive and acc. high-water RSS+CACHE : 37374300 KiB
ccache
is always cleaned between compilations. Results obtained with:
CC=gcc-7 CXX=g++-7 cmake -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja
This is preliminary in that it measures parallel compilation, thus memory usage is very high. Overall there's a 44% increase from 1.5
Doing a single-process build of 1.7.x branch (ninja -j1
) just costs around 2 GB memory at maximum.
Child user: 4167.479 s
Child sys : 159.497 s
Child wall: 4327.964 s
Child high-water RSS : 1952008 KiB
Recursive and acc. high-water RSS+CACHE : 2155568 KiB
I'm trying to use ninja
to build MXNet 2.0 (the master branch) on my laptop computer (16 GB mem + 8GB virtual mem). I will update the log later.
cmake -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja
I run ninja
two times since the building was interrupted, and the second time continues the building.
(gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f))
Child user: 3692.505 s
Child sys : 177.550 s
Child wall: 1017.096 s
Child high-water RSS : 1852208 KiB
Recursive and acc. high-water RSS+CACHE : 3877980 KiB
Child user: 13315.378 s
Child sys : 353.862 s
Child wall: 3847.364 s
Child high-water RSS : 11402844 KiB
Recursive and acc. high-water RSS+CACHE : 12226040 KiB
Thanks @wkcn. I'll report the same with gcc7. You are using gcc10 right?
@leezu
Yes, gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f381f91b157a26d9beddcaa8f4960cc038)
Single process build of MXNet master with gcc7 gives the following results:
Child user: 5288.372 s
Child sys : 188.645 s
Child wall: 5481.062 s
Child high-water RSS : 2504976 KiB
Recursive and acc. high-water RSS+CACHE : 2674692 KiB
That's a 24% increase to 1.7, but less than 3GB high-water. So I don't think we have any blocking issue here. @wkcn I suggest you reduce the number of parallel builds to stay under 16GB. Also recommend to use ccache
to avoid rebuilding.
Hi @leezu , I am not sure that it costs less than 3GB memory to build indexing_op.o.
By building with ninja in multiple threads, It takes less than 12 GB memory.
However, it takes more than 16GB memory to build indexing_op.o by ‘make’ in single thread. I need to confirm it.
I think the compiler flags may be different between original ‘Makefile’ and ‘CMakeList’.
I remember that the original Makefile will be deprecated . If so, it is not a block issue.
I modify indexing_op.cc
and re-build it.
Ninja
`cgmemtime ninja -v -j 1`
/usr/bin/ccache /usr/lib/ccache/bin/c++ -DDMLC_CORE_USE_CMAKE -DDMLC_LOG_FATAL_THROW=1 -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX14 -DMSHADOW_INT64_TENSOR_SIZE=0 -DMSHADOW_IN_CXX11 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_MKL=0 -DMSHADOW_USE_SSE -DMXNET_USE_BLAS_OPEN=1 -DMXNET_USE_LAPACK=1 -DMXNET_USE_LIBJPEG_TURBO=0 -DMXNET_USE_MKLDNN=1 -DMXNET_USE_OPENCV=1 -DMXNET_USE_OPENMP=1 -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_SIGNAL_HANDLER=1 -DNDEBUG=1 -D__USE_XOPEN2K8 -Dmxnet_EXPORTS -I../3rdparty/mkldnn/include -I3rdparty/mkldnn/include -I../include -I../src -I../3rdparty/nvidia_cub -I../3rdparty/tvm/nnvm/include -I../3rdparty/tvm/include -I../3rdparty/dmlc-core/include -I../3rdparty/dlpack/include -I../3rdparty/mshadow -I../3rdparty/mkldnn/src/../include -I3rdparty/dmlc-core/include -isystem /usr/include/opencv4 -Wall -Wno-sign-compare -O3 -fopenmp -fPIC -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -std=gnu++17 -MD -MT CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o -MF CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o.d -o CMakeFiles/mxnet.dir/src/operator/tensor/indexing_op.cc.o -c ../src/operator/tensor/indexing_op.cc
Child user: 275.350 s
Child sys : 8.356 s
Child wall: 292.649 s
Child high-water RSS : 11403896 KiB
Recursive and acc. high-water RSS+CACHE : 11448656 KiB
Makefile
cgmemtime make -j1
g++ -std=c++17 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX14=1 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_LOG_FATAL_THROW=1 -O3 -DNDEBUG=1 -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/mshadow/ -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/dmlc-core/include -fPIC -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/tvm/nnvm/include -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/dlpack/include -I/mnt/wkcn/proj/incubator-mxnet/3rdparty/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv4 -fopenmp -DMXNET_USE_OPENMP=1 -DMXNET_USE_OPERATOR_TUNING=1 -DMSHADOW_INT64_TENSOR_SIZE=0 -DMXNET_USE_LAPACK -DMXNET_USE_BLAS_ATLAS=1 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -DMXNET_USE_NVML=0 -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c src/operator/tensor/indexing_op.cc -o build/src/operator/tensor/indexing_op.o
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make: *** [Makefile:579: build/src/operator/tensor/indexing_op.o] Error 1
Child user: 358.708 s
Child sys : 13.901 s
Child wall: 376.262 s
Child high-water RSS : 13741284 KiB
Recursive and acc. high-water RSS+CACHE : 13771840 KiB
After removing -funroll-loops
then build by make
Child user: 260.011 s
Child sys : 4.400 s
Child wall: 265.816 s
Child high-water RSS : 11338948 KiB
Recursive and acc. high-water RSS+CACHE : 11392800 KiB
The difference between ninja and makefile:
-DMSHADOW_USE_PASCAL=0
-DMXNET_USE_BLAS_ATLAS=1
-DDMLC_USE_CXX14=1
-DMSHADOW_RABIT_PS=0
-DMXNET_USE_LAPACK
-DMSHADOW_FORCE_STREAM
-DMXNET_USE_NCCL
-DMXNET_USE_NVML=0
-DMSHADOW_DIST_PS=0
-fno-builtin-free
-std=c++17
-funroll-loops
-fno-builtin-calloc
-fno-builtin-realloc
-fno-builtin-malloc
-Wsign-compare
-MMD
-DMSHADOW_USE_SSE
-DMXNET_USE_SIGNAL_HANDLER=1
-DMXNET_USE_LAPACK=1
-DMXNET_USE_MKLDNN=1
-DDMLC_USE_CXX14
-DMSHADOW_IN_CXX11
-DDMLC_USE_CXX11
-D__USE_XOPEN2K8
-DMXNET_USE_BLAS_OPEN=1
-DDMLC_STRICT_CXX11
-MD
-MF
-MT
-std=gnu++17
-Wno-sign-compare
-isystem
Hi @leezu , I found the cause.
The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o
by ninja with different version of gcc.
Compiler | The cost of memory(Child high-water RSS)
-------------|---------------------------
g++ 6.4.1 | 1.95 GB
g++ 7.4.1 | 1.78 GB
g++ 10.1.0 | 11 GB
Besides, since the compiler flags is different in different building ways (for example Makefile
enable -funroll-loops
, it will takes more memory), the cost of memory is different.
Hi @leezu @woreom @ciyongch , I have found the cause.
The cause is related to the compiler. g++ 10 takes over 11 GB memory to build indexing_op.o
, but g++ 6 and 7 take less than 2 GB.
The solution is to build MXNet with g++-6 or g++-7.
Thanks for your help!
@wkcn thank you for investigating this. The regression in gcc is quite serious. Would you check if there is a report at https://gcc.gnu.org/bugs/ and potentially open a new bug report? Eventually gcc10 will be shipped by default on many platforms and this issue may affect more users later.
@leezu Sorry that I do not know how to find the bug report in https://gcc.gnu.org/bugs/
@wkcn the bugtracker is linked on the page. It's https://gcc.gnu.org/bugzilla/
@leezu Thank you! I guess that the bug is memory leak of the compiler gcc 10.1.0.
According to https://github.com/apache/incubator-mxnet/issues/15393#issuecomment-649127482 the leak already occurs with gcc8
@leezu I have understand that the main problem is the gcc, I used the command:
-DCMAKE_C_COMPILER=gcc-4.9 -DCMAKE_CXX_COMPILER=g++-4.9
to successfully build mxnet with opencv but the both must be compiled with gcc-4.9 but when I import mxnet I get this error:
failed to map segment from shared object
Most helpful comment
I measure the overall memory consumption during compilation using linux control group feature. https://github.com/gsauthof/cgmemtime
Results are
v1.7.x
Child user: 7658.352 s
Child sys : 263.657 s
Child wall: 199.661 s
Child high-water RSS : 1952024 KiB
Recursive and acc. high-water RSS+CACHE : 54680084 KiB
v1.6.x
Child user: 5758.186 s
Child sys : 222.487 s
Child wall: 131.241 s
Child high-water RSS : 2040712 KiB
Recursive and acc. high-water RSS+CACHE : 45344596 KiB
v1.5.x
Child user: 3800.705 s
Child sys : 143.353 s
Child wall: 112.121 s
Child high-water RSS : 1604820 KiB
Recursive and acc. high-water RSS+CACHE : 37374300 KiB
ccache
is always cleaned between compilations. Results obtained with:This is preliminary in that it measures parallel compilation, thus memory usage is very high. Overall there's a 44% increase from 1.5