Hi,
I'm starting this thread to collect issues that are likely caused by various overhead in MXNet execution so that we can focus the efforts in solving them. The focus is everything other than the operators and kvstores.
If you've seen performance issues that you found were due to the overhead in MXNet execution (e.g. ctypes/FFI call, asynchronous execution, graph construction overhead, memory release and clean-up), please either report it in comments or link relevant issue.
If you're unsure, please create a performance issue report in another github issue instead.
Be sure to include these details:
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature
MXNet function call overhead is quite high
NumPy call overhead: 0.7 microsecond
TVM call overhead: 3.4 microsecond
MXNet call overhead: 28.3 microsecond
Code attached for benchmarking the function call overhead.
import numpy as np
from matplotlib import pyplot as plt
from IPython import display
def benchmark(func, n_start, n_end, n_stride=1):
avg_times, sizes = [], (2**np.arange(n_start, n_end, n_stride))
np.random.seed(0)
for size in sizes:
avg_times.append(func(size))
return sizes, np.array(avg_times)
def np_copy(size):
x = np.random.normal(size=size).astype('float32')
y = np.empty_like(x)
res = %timeit -o -q np.copyto(y, x)
return res.average
_, times = benchmark(np_copy, 1, 8)
print('NumPy call overhead: %.1f microsecond'% (times.mean()*1e6,))
import tvm
def tvm_copy(size):
x = np.random.normal(size=size).astype('float32')
y = np.empty_like(x)
x, y = tvm.nd.array(x), tvm.nd.array(y)
res = %timeit -o -q x.copyto(y)
return res.average
_, times = benchmark(tvm_copy, 1, 8)
print('TVM call overhead: %.1f microsecond'% (times.mean()*1e6,))
import mxnet as mx
def mx_copy(size):
x = np.random.normal(size=size).astype('float32')
y = np.empty_like(x)
x, y = mx.nd.array(x), mx.nd.array(y)
res = %timeit -o -q x.copyto(y)
return res.average
_, times = benchmark(mx_copy, 1, 8)
print('MXNet call overhead: %.1f microsecond'% (times.mean()*1e6,))
issue also reported here: https://github.com/apache/incubator-mxnet/issues/8112
Some profiling results:
https://gist.github.com/sxjscience/f069d49b250820be4483debfa9a8d4b5
Here I used NaiveEngine
MXNet: 25.77195848737444
JAX: 58.22208949497768
Numpy: 1.0632106236049108
PyTorch: 13.021196637834823
TensorFlow: 21.734169551304408
The overhead is 33.87945038931711
after we used the default Engine (comment NaiveEngine line)
In P3.2x:
With NaiveEngine:
MXNet: 49.990619931902195
JAX: 132.80538150242398
TVM: 4.161664417811803
Numpy: 1.8179416656494143
PyTorch: 20.977394921439036
TensorFlow: 47.540392194475444
With ThreadedEngine
MXNet: 61.272689274379196
JAX: 141.04713712419783
TVM: 4.231657300676618
Numpy: 1.918860844203404
PyTorch: 20.21517072405134
TensorFlow: 44.17456899370466
One idea that gained some popularity after discussion is to introduce an engine-less mode to MXNet, in which the operators are exposed in API and dispatched in a similar way as pytorch.
Given that Naive Engine option should be quite close to this already, we should verify the overhead in the Naive Engine mode and judge the necessity based on that result. Since the target is to measure the overhead, we will want to control the performance difference in operators.
@sxjscience what do the differences between Naive and ThreadedEngine reflect in the other frameworks?
i just would like to comment a bit about the tvm related benchmark. The copyto function was not really a good benchmark for TVM's FFI cost mainly because that function was implemented directly via ctypes because we didn't find it to be a bottleneck. This functionality could have been turned into cython(which is used to support the core functionality), which could speedup the overhead of the call.
The current best way test the FFI is through Packed function calls. PackedFunc is the place where most of the API functions are exposed, here I cross posted a quick benchmark on https://github.com/apache/incubator-mxnet/issues/17097
Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.
import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)
Results on my laptop(macbook pro 13inch)
```
$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06
````
To followup this again, the patch https://github.com/apache/incubator-tvm/pull/4549 fixed the copyto issue by cythonize it. Here is an updated script to add benchmark of copyto
import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(1, 2)')
timer.timeit(1)
num_repeat = 1000
print("TVM-nop:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
"""
timer = timeit.Timer(setup=setup,
stmt='np.copyto(y, x)')
timer.timeit(1)
print("numpy.copyto:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
x = tvm.nd.array(x)
y = tvm.nd.array(y)
"""
timer = timeit.Timer(setup=setup,
stmt='x.copyto(y)')
timer.timeit(1)
print("tvm.copyto:", timer.timeit(num_repeat) / num_repeat)
Results on my laptop(macbook pro 13inch)
$TVM_FFI=ctypes python test.py
TVM-nop: 5.864649e-06
numpy.copyto: 3.7303100000002364e-07
tvm.copyto: 8.198690000000008e-07
$ TVM_FFI=cython python test.py
TVM-nop: 1.1101499999999209e-07
numpy.copyto: 2.519029999999978e-07
tvm.copyto: 2.3881700000000226e-07
Most helpful comment
The overhead is
33.87945038931711
after we used the default Engine (comment NaiveEngine line)