Incubator-mxnet: [Discussion] Overhead in MXNet Execution

Created on 6 May 2019  路  12Comments  路  Source: apache/incubator-mxnet

Hi,

I'm starting this thread to collect issues that are likely caused by various overhead in MXNet execution so that we can focus the efforts in solving them. The focus is everything other than the operators and kvstores.

If you've seen performance issues that you found were due to the overhead in MXNet execution (e.g. ctypes/FFI call, asynchronous execution, graph construction overhead, memory release and clean-up), please either report it in comments or link relevant issue.

If you're unsure, please create a performance issue report in another github issue instead.

Be sure to include these details:

  • the performance issue.
  • what you tried to solve it.
  • timings you've collected if any.
Discussion Performance

Most helpful comment

The overhead is 33.87945038931711 after we used the default Engine (comment NaiveEngine line)

All 12 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

MXNet function call overhead is quite high
NumPy call overhead: 0.7 microsecond
TVM call overhead: 3.4 microsecond
MXNet call overhead: 28.3 microsecond

Code attached for benchmarking the function call overhead.

import numpy as np
from matplotlib import pyplot as plt
from IPython import display

def benchmark(func, n_start, n_end, n_stride=1):
    avg_times, sizes = [], (2**np.arange(n_start, n_end, n_stride))
    np.random.seed(0)
    for size in sizes:
        avg_times.append(func(size))
    return sizes, np.array(avg_times)

def np_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    res = %timeit -o -q np.copyto(y, x)
    return res.average

_, times = benchmark(np_copy, 1, 8)
print('NumPy call overhead: %.1f microsecond'% (times.mean()*1e6,))

import tvm

def tvm_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = tvm.nd.array(x), tvm.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(tvm_copy, 1, 8)
print('TVM call overhead: %.1f microsecond'% (times.mean()*1e6,))

import mxnet as mx

def mx_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = mx.nd.array(x), mx.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(mx_copy, 1, 8)
print('MXNet call overhead: %.1f microsecond'% (times.mean()*1e6,))

Some profiling results:
https://gist.github.com/sxjscience/f069d49b250820be4483debfa9a8d4b5

Here I used NaiveEngine

MXNet: 25.77195848737444
JAX: 58.22208949497768
Numpy: 1.0632106236049108
PyTorch: 13.021196637834823
TensorFlow: 21.734169551304408

The overhead is 33.87945038931711 after we used the default Engine (comment NaiveEngine line)

In P3.2x:

With NaiveEngine:

MXNet: 49.990619931902195
JAX: 132.80538150242398
TVM: 4.161664417811803
Numpy: 1.8179416656494143
PyTorch: 20.977394921439036
TensorFlow: 47.540392194475444

With ThreadedEngine

MXNet: 61.272689274379196
JAX: 141.04713712419783
TVM: 4.231657300676618
Numpy: 1.918860844203404
PyTorch: 20.21517072405134
TensorFlow: 44.17456899370466

One idea that gained some popularity after discussion is to introduce an engine-less mode to MXNet, in which the operators are exposed in API and dispatched in a similar way as pytorch.

Given that Naive Engine option should be quite close to this already, we should verify the overhead in the Naive Engine mode and judge the necessity based on that result. Since the target is to measure the overhead, we will want to control the performance difference in operators.

@sxjscience what do the differences between Naive and ThreadedEngine reflect in the other frameworks?

i just would like to comment a bit about the tvm related benchmark. The copyto function was not really a good benchmark for TVM's FFI cost mainly because that function was implemented directly via ctypes because we didn't find it to be a bottleneck. This functionality could have been turned into cython(which is used to support the core functionality), which could speedup the overhead of the call.

The current best way test the FFI is through Packed function calls. PackedFunc is the place where most of the API functions are exposed, here I cross posted a quick benchmark on https://github.com/apache/incubator-mxnet/issues/17097

Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)
```
$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06
````

To followup this again, the patch https://github.com/apache/incubator-tvm/pull/4549 fixed the copyto issue by cythonize it. Here is an updated script to add benchmark of copyto

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(1, 2)')
timer.timeit(1)
num_repeat = 1000
print("TVM-nop:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
"""

timer = timeit.Timer(setup=setup,
                     stmt='np.copyto(y, x)')
timer.timeit(1)
print("numpy.copyto:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
x = tvm.nd.array(x)
y = tvm.nd.array(y)
"""

timer = timeit.Timer(setup=setup,
                     stmt='x.copyto(y)')
timer.timeit(1)
print("tvm.copyto:", timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$TVM_FFI=ctypes python test.py
TVM-nop: 5.864649e-06
numpy.copyto: 3.7303100000002364e-07
tvm.copyto: 8.198690000000008e-07

$ TVM_FFI=cython python test.py
TVM-nop: 1.1101499999999209e-07
numpy.copyto: 2.519029999999978e-07
tvm.copyto: 2.3881700000000226e-07
Was this page helpful?
0 / 5 - 0 ratings