MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ...
This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable.
This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems.
Code to reproduce the issue
Filename: mxnet_cpu_test.py
import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None
def run(need_import):
if need_import:
import mxnet as mx
else:
global mx
A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
while True:
A = mx.nd.dot(A, A)
def parse_args():
parser = argparse.ArgumentParser("benchmark mxnet cpu")
parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
parser.add_argument('--late-import', action='store_true')
return parser.parse_args()
def main(args):
if args.num_workers == 0:
print("Main process")
try:
run(need_import=args.late_import)
except KeyboardInterrupt:
pass
else:
print("Subprocesses")
ex = futures.ProcessPoolExecutor(args.num_workers)
for _ in range(args.num_workers):
ex.submit(run, need_import=args.late_import)
while True:
try:
time.sleep(10000)
except KeyboardInterrupt:
ex.shutdown(wait=False)
break
print("Stopped")
if __name__ == "__main__":
args = parse_args()
if not args.late_import:
import mxnet as mx
main(args)
Detailed experiments:
Run in the main process:
python3 mxnet_cpu_test.py --num-workers=0

Working fine for all mxnet variants (GPU or CPU-only).
Run in two subproceses
-- mxnet-cu90 on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It uses only 2 CPUs per subprocess.
-- mxnet-mkl on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

Same here. It uses only 2 CPUs per subprocess.
-- mxnet-mkl on CPU-only machine c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

Even worse. It uses only 1.5 CPUs per subprocess.
-- However, for vanilla CPU-version mxnet on c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

It is working better. At least, it uses 5 CPUs per subprocess.
-- Weirdly, still vanilla CPU-version mxnet but on GPU machine p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It is working worse, i.e., 2 CPUs per subprocesses.
import
mxnet in the main process and instead import mxnet in each subprocess:python3 mxnet_cpu_test.py --num-workers=2 --late-import
@TaoLv to help look at this issue
@YutingZhang Thanks for your issue reporting! @anirudh2290 @apeforest @azai91 @samskalicky please take a look in here.
Hi @YutingZhang, please try:
OMP_NUM_THREADS=#core/#worker;Please let me know if it works for you. Thanks.
Related issue: https://github.com/apache/incubator-mxnet/issues/12255
The limitation of 1 thread per worker is deliberately set to avoid thread contention.
Per offline discussion, I think a good solution is to use a ENV variable to control the limit of threads per worker can use (which defaults to 1 now).
@zhreshold this would also require rebuild with modified initialize.cc, otherwise the env variable would get overwritten.
@anirudh2290 Yes, I mean a PR is required to address this issue.
Thanks everyone for discussing and solving the issue!
@zhreshold I tried the latest version of mxnet, and do export MXNET_MP_WORKER_NTHREADS=20. However, the example code I posted still results in the same CPU usage. Any ideas?
@YutingZhang MXNET_MP_WORKER_NTHREADS can only control how many mxnet operators run in parallel, in the case of some transformations, it might not be able to parallelize as much op as possible. Due to a openmp bug, it's disabled for the worker so unfortunately it is the case.
You might want to enable opencv multithreading for each worker which might be the most time consuming part in worker process
@pengzhao-intel @TaoLv @anirudh2290 @zhreshold Thank you for everyone's help, and happy new year! This problem seems more complicated (it might be multiple problems in the beginning). @zhreshold's fix solved the problem in most cases.
However, I found, if we call asnumpy in each worker, it interferes among the processes. And it does not seem to be a problem for GPU-version MxNet running on a GPU-machine. It seems only happening on CPU-only machine (I tested on c5.18large with mxnet-mkl).
Code (one-line difference):
import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None
def run(need_import):
if need_import:
import mxnet as mx
else:
global mx
A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
while True:
A = mx.nd.dot(A, A)
A.asnumpy() # ******** only difference ***********
def parse_args():
parser = argparse.ArgumentParser("benchmark mxnet cpu")
parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
parser.add_argument('--late-import', action='store_true')
return parser.parse_args()
def main(args):
if args.num_workers == 0:
print("Main process")
try:
run(need_import=args.late_import)
except KeyboardInterrupt:
pass
else:
print("Subprocesses")
ex = futures.ProcessPoolExecutor(args.num_workers)
for _ in range(args.num_workers):
ex.submit(run, need_import=args.late_import)
while True:
try:
time.sleep(10000)
except KeyboardInterrupt:
ex.shutdown(wait=False)
break
print("Stopped")
if __name__ == "__main__":
args = parse_args()
if not args.late_import:
import mxnet as mx
main(args)
Launch 10 workers (python3 mxnet_cpu_test.py --num-workers=10). MXNET_MP_WORKER_NTHREADS does not affect the results.

But running it only in the main process is fine:

By the way, another issue I found with mxnet (cpu non-mkl version) is: when you run MxNet in a subprocess, it interferes with many other non-mxnet functions (e.g., cv2.cvtColor). The subprocess got stuck at those functions. This did not happen for mxnet==1.3.1, it started to happen in some nightly build version. Probably, we should create a new ticket for this.
@YutingZhang thanks for the case, we will look into the issue.
@YutingZhang If you just want to utilize 100% cpu for each process, please try export KMP_AFFINITY=granularity=fine,noduplicates, it works on my environment.
If you want enable openmp multi-threading to utilize >100% cpu for each process, you need to make below change for MXNet:
https://github.com/ZhennanQin/incubator-mxnet/commit/48fe761f0268c316477fac23d005d26b29c65a47
Then you can use export OMP_NUM_THREADS=4 to specify 4x cpu usage for each process.
If you don't want to change MXNet and just want to increase the efficiency of MKL dot, you can try export MKL_NUM_THREADS=4. It only works for MKL library.
@zhreshold do you know some backgrounds why fixed the thread number to 1 in the worker processor as below line shown?
ZhennanQin@48fe761
Got some info from @YutingZhang #13449 #12380 thanks a lot.
@anirudh2290
@pengzhao-intel The thread limit is set to 1 according to comment: https://github.com/apache/incubator-mxnet/pull/13606#discussion_r240914759
If you have better understanding of the problem please let me know.
@YutingZhang
Just tested out the master version, the ENV variable OMP_NUM_THREADS can now effectively control the OMP threads each worker is allowed to use.
For example, OMP_NUM_THREADS=32 python3 mxnet_cpu_test.py --num-workers=2 gives

Most helpful comment
@anirudh2290 Yes, I mean a PR is required to address this issue.