Incubator-mxnet: Low CPU usage of MXNet in subprocesses

Created on 10 Dec 2018 · 18Comments · Source: apache/incubator-mxnet

MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ...

This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable.

This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems.

Code to reproduce the issue

Filename: mxnet_cpu_test.py

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Detailed experiments:

Run in the main process:
python3 mxnet_cpu_test.py --num-workers=0

Working fine for all mxnet variants (GPU or CPU-only).
Run in two subproceses
-- mxnet-cu90 on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It uses only 2 CPUs per subprocess.
-- mxnet-mkl on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

Same here. It uses only 2 CPUs per subprocess.
-- mxnet-mkl on CPU-only machine c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

Even worse. It uses only 1.5 CPUs per subprocess.
-- However, for vanilla CPU-version mxnet on c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

It is working better. At least, it uses 5 CPUs per subprocess.
-- Weirdly, still vanilla CPU-version mxnet but on GPU machine p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It is working worse, i.e., 2 CPUs per subprocesses.
This problem seems relevant to how MXNet manage the thread per subprocess. If I do not import mxnet in the main process and instead import mxnet in each subprocess:
python3 mxnet_cpu_test.py --num-workers=2 --late-import

Then everything is working fine.

Performance

Source

YutingZhang

Most helpful comment

@anirudh2290 Yes, I mean a PR is required to address this issue.

zhreshold on 10 Dec 2018

👍2

All 18 comments

@TaoLv to help look at this issue

pengzhao-intel on 10 Dec 2018

@YutingZhang Thanks for your issue reporting! @anirudh2290 @apeforest @azai91 @samskalicky please take a look in here.

lanking520 on 10 Dec 2018

Hi @YutingZhang, please try:

set OMP_NUM_THREADS manually. For this test case, I tried OMP_NUM_THREADS=#core/#worker;
remove the two SetEnv form https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61-L62.

Please let me know if it works for you. Thanks.

TaoLv on 10 Dec 2018

samskalicky on 10 Dec 2018

The limitation of 1 thread per worker is deliberately set to avoid thread contention.

Per offline discussion, I think a good solution is to use a ENV variable to control the limit of threads per worker can use (which defaults to 1 now).

zhreshold on 10 Dec 2018

@zhreshold this would also require rebuild with modified initialize.cc, otherwise the env variable would get overwritten.

anirudh2290 on 10 Dec 2018

@anirudh2290 Yes, I mean a PR is required to address this issue.

zhreshold on 10 Dec 2018

👍2

Thanks everyone for discussing and solving the issue!

YutingZhang on 14 Dec 2018

@zhreshold I tried the latest version of mxnet, and do export MXNET_MP_WORKER_NTHREADS=20. However, the example code I posted still results in the same CPU usage. Any ideas?

YutingZhang on 19 Dec 2018

@YutingZhang MXNET_MP_WORKER_NTHREADS can only control how many mxnet operators run in parallel, in the case of some transformations, it might not be able to parallelize as much op as possible. Due to a openmp bug, it's disabled for the worker so unfortunately it is the case.

You might want to enable opencv multithreading for each worker which might be the most time consuming part in worker process

zhreshold on 19 Dec 2018

@pengzhao-intel @TaoLv @anirudh2290 @zhreshold Thank you for everyone's help, and happy new year! This problem seems more complicated (it might be multiple problems in the beginning). @zhreshold's fix solved the problem in most cases.
However, I found, if we call asnumpy in each worker, it interferes among the processes. And it does not seem to be a problem for GPU-version MxNet running on a GPU-machine. It seems only happening on CPU-only machine (I tested on c5.18large with mxnet-mkl).

Code (one-line difference):

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)
        A.asnumpy()    # ******** only difference ***********

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Launch 10 workers (python3 mxnet_cpu_test.py --num-workers=10). MXNET_MP_WORKER_NTHREADS does not affect the results.

But running it only in the main process is fine:

By the way, another issue I found with mxnet (cpu non-mkl version) is: when you run MxNet in a subprocess, it interferes with many other non-mxnet functions (e.g., cv2.cvtColor). The subprocess got stuck at those functions. This did not happen for mxnet==1.3.1, it started to happen in some nightly build version. Probably, we should create a new ticket for this.

YutingZhang on 2 Jan 2019

@YutingZhang thanks for the case, we will look into the issue.

pengzhao-intel on 3 Jan 2019

@YutingZhang If you just want to utilize 100% cpu for each process, please try export KMP_AFFINITY=granularity=fine,noduplicates, it works on my environment.

If you want enable openmp multi-threading to utilize >100% cpu for each process, you need to make below change for MXNet:
https://github.com/ZhennanQin/incubator-mxnet/commit/48fe761f0268c316477fac23d005d26b29c65a47

Then you can use export OMP_NUM_THREADS=4 to specify 4x cpu usage for each process.

If you don't want to change MXNet and just want to increase the efficiency of MKL dot, you can try export MKL_NUM_THREADS=4. It only works for MKL library.

ZhennanQin on 8 Jan 2019

👍1

@zhreshold do you know some backgrounds why fixed the thread number to 1 in the worker processor as below line shown?
ZhennanQin@48fe761

pengzhao-intel on 9 Jan 2019

Got some info from @YutingZhang #13449 #12380 thanks a lot.

pengzhao-intel on 9 Jan 2019

@anirudh2290

pengzhao-intel on 9 Jan 2019

@pengzhao-intel The thread limit is set to 1 according to comment: https://github.com/apache/incubator-mxnet/pull/13606#discussion_r240914759

If you have better understanding of the problem please let me know.

zhreshold on 9 Jan 2019

@YutingZhang
Just tested out the master version, the ENV variable OMP_NUM_THREADS can now effectively control the OMP threads each worker is allowed to use.

For example, OMP_NUM_THREADS=32 python3 mxnet_cpu_test.py --num-workers=2 gives