I experience a performance regression for both CPU (mxnet-mkl) and GPU (mxnet-cu100) when updating from version 1.4.1 to 1.5.0. The speed regression is about 8% on CPU, if you're interested I can also provide a report for GPU.
MXNet version 1.4.1
$ pip install mxnet-mkl==1.4.1
and MXNet version 1.5.0
$ pip install mxnet-mkl==1.5.0
export MXNET_SUBGRAPH_BACKEND=MKLDNNmx.__version__
'1.4.1'
print(profiler.dumps())
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: cpu/0 120158 24832.1641 313.3440 24834.2129 12260.4336
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 90628.5391 90628.5391 90628.5391 90628.5391
MXExecutorForward 10000 199.3470 0.0150 0.0610 0.0199
MXImperativeInvokeEx 10000 58.7390 0.0030 0.0820 0.0059
MXNDArrayGetShape 40000 21.5920 0.0000 0.0150 0.0005
MXNet C API Concurrency 120002 0.0000 0.0000 0.0010 0.0005
MXNet C API Calls 60001 60.0010 0.0010 60.0010 30.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 40082 104.3530 0.0000 1.0880 0.0026
[_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,broadcast_add,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv, 20000 182261.7500 8.3630 70.7680 9.1131
DeleteVariable 240000 636.3090 0.0000 0.0410 0.0027
CopyCPU2CPU 20000 103.6500 0.0020 0.0280 0.0052
mx.__version__
'1.5.0'
print(profiler.dumps())
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 240440 313.3440 43597.1289 21641.8926
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 2 196511.1406 97689.1719 98821.9688 98255.5703
MXNet C API Calls 220002 220.0020 0.0010 220.0020 110.0000
MXExecutorForward 20000 758.2010 0.0220 0.9980 0.0379
MXImperativeInvokeEx 20000 187.6520 0.0050 0.2250 0.0094
MXExecutorOutputs 20000 27.7760 0.0000 0.0230 0.0014
MXNDArrayGetShapeEx 80000 63.9810 0.0000 0.0480 0.0008
MXNDArrayFree 40000 25.8550 0.0000 0.0160 0.0006
MXNDArrayGetStorageType 40000 23.0740 0.0000 0.0150 0.0006
MXNet C API Concurrency 440004 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
[_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,broadcast_add,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv, 40000 396250.1875 9.0350 81.8880 9.9063
CopyCPU2CPU 40000 239.0970 0.0040 0.0550 0.0060
DeleteVariable 480168 1348.4821 0.0000 0.1520 0.0028
Reorder 560000 869.6200 0.0000 11.9040 0.0016
Consequently this results in a lower number of evaluated positions per second (nps) of the engine:
go movetime 5000
```python
info score cp 71 depth 30 nodes 697 time 4727 nps 147 pv e2e4 d7d5 e4d5 e7e6 f1b5 c7c6 d5c6 b7c6 b5c6
#### MXNet-mkl 1.5.0
```python
go movetime 5000
```python
info score cp 57 depth 25 nodes 673 time 4743 nps 141 pv e2e4 d7d5 e4d5 d8d5 b1c3 d5a5 g1f3 g8f6 f1c4 c8g4 e1g1 e7e6 d2d4 P@a3 a1b1 a3b2 c1b2 P@h3 c4e6
## Minimum reproducible example
Download the latest release **CrazyAra 0.5.1** which contains the model weights and architecture definition in `CrazyAra_0.5.1_RiseV2_mobile/model/`.
* https://github.com/QueensGambit/CrazyAra/releases
Run the following script respectively on MXNet 1.4.1 and MXNet 1.5.0 and verifiy that the `.json` and `.params` file are in the same directory as the python script:
```python
import mxnet as mx
import numpy as np
from mxnet import profiler
sym, arg_params, aux_params = mx.model.load_checkpoint('model-1.19246-0.603', 223)
sample = mx.ndarray.zeros((1, 34, 8, 8))
it = int(1e4)
ctx = mx.gpu() # mx.cpu()
executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)
def run_inference():
out = executor.forward(is_train=False, data=sample)
profiler.set_config(profile_all=True,
aggregate_stats=True,
filename='profile_output.json')
# warm-up
for i in range(100):
run_inference()
# Ask the profiler to start recording
profiler.set_state('run')
for i in range(it):
run_inference()
# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
profiler.dump()
In order to run the engine and check its nps do the following:
python crazyara.py
uci
setoption name context value cpu
isready
go movetime 5000
More information about the uci-protocoll is available here:
MXNet 1.5.0 introduced performance regressions for some operators:
which might be the cause of this problem.
This issue might also be related to https://github.com/apache/incubator-mxnet/issues/15430.
Best,
~QueensGambit
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance
@mxnet-label-bot add [Performance, MKL, Python]
Thanks for reporting the problem. @ElaineBao could you take a look for this?
cc @roywei
Hi @QueensGambit thanks for the issue. What does cp 71 depth 30 nodes 697 mean for your benchmark?
and could you share your model used? what's the archietecture? There is an known issue with RELU and we made it slower for correctness. https://github.com/apache/incubator-mxnet/pull/14262
Hello @roywei. Thanks for your answer.
_RiseV2_mobile_ is an adapted convolutional network designed by myself which is in principal similar to the AlphaZero-Network, but specially performant for CPUs (about ~3x faster) and GPUs (about ~1.4x faster) while maintaining similar performance.
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/neural_net/architectures/rise_v2_symbol.py#L149
It applies the concepts of multiple recent papers in one network:
Separable depthwise convolutions with inverted residual blocks
MobileNetV2: Inverted Residuals and Linear Bottlenecks - Sandler et al.
https://arxiv.org/pdf/1801.04381.pdf
Increasing filter channels per layer
Deep Pyramidal Residual Networks - Han et al.
https://arxiv.org/abs/1610.02915
Enhancement of activation channels (Winning entry of ImageNet 2017)
Squeeze-and-Excitation Networks - Hu et al.
https://arxiv.org/abs/1709.01507
Our paper which describes the network architecture in more detail will be published as well.
The output cp 71 depth 30 nodes 697 is UCI-protocol (UCI - Universal Chess Interface) related.
cp - Centi-Pawn metric how good the engine evaluates the given board position in the view of the current player. cp 100 means +1 pawn advantage.nodes - Number of evaluated board positions after running the search for X milli-seconds. Higher values are betterdepth - Maximum depth a rollout of the MCTS search has reached during the searchgo movetime 5000 - Defines for how long in milli-seconds the engine should search the current position (by default the starting position)pv - Stands for principal variation and is the best line of play which is suggested by the engine for both playersI changed the profiling code by replacing the executor handle with a gluon network object.
This version is slower, but provides more details on where the runtime is spent:
import mxnet as mx
import numpy as np
from mxnet import profiler
sample = mx.ndarray.zeros((1, 34, 8, 8))
it = int(1e4)
ctx = mx.cpu()
model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)
def run_inference():
out = net(sample)
profiler.set_config(profile_all=True,
aggregate_stats=True,
filename='profile_output.json')
# warm-up
for i in range(100):
run_inference()
# Ask the profiler to start recording
profiler.set_state('run')
for i in range(it):
run_inference()
# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
profiler.dump()
print(profiler.dumps())
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: cpu/0 1170861 13.5680 0.0000 905.4720 452.7360
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXInvokeCachedOpEx 10000 6.5960 0.0000 0.0140 0.0007
MXNDArrayFree 20000 13.3320 0.0000 0.0140 0.0007
MXInvokeCachedOp 10000 8961.5879 0.6330 10.6770 0.8962
MXNDArrayWaitAll 1 128982.7891 128982.7891 128982.7891 128982.7891
MXNDArrayGetContext 10000 6.3600 0.0000 0.0140 0.0006
MXNet C API Concurrency 100002 0.0000 0.0000 0.0010 0.0005
MXNet C API Calls 50001 50.0010 0.0010 50.0010 25.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 42326 78.8710 0.0000 0.0980 0.0019
DeleteVariable 1312004 5104.0220 0.0000 5.2050 0.0039
Activation 820000 13786.3789 0.0060 2.6830 0.0168
Convolution 860000 87381.4688 0.0150 11.2850 0.1016
Pooling 100000 1611.6990 0.0100 0.7420 0.0161
Reshape 100000 296.2550 0.0010 0.0560 0.0030
BatchNorm 840000 21546.2461 0.0100 2.1610 0.0257
broadcast_mul 100000 13171.7432 0.1030 0.9910 0.1317
broadcast_add 260000 88638.4297 0.2000 13.5370 0.3409
Flatten 140000 10471.5117 0.0070 0.7100 0.0748
FullyConnected 240000 18911.3770 0.0160 0.7730 0.0788
print(profiler.dumps())
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 1171335 0.0000 891.9040 445.9520
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 129400.2969 129400.2969 129400.2969 129400.2969
MXNet C API Calls 50001 50.0010 0.0010 50.0010 25.0000
MXInvokeCachedOp 10000 13172.8516 0.7930 15.1500 1.3173
MXNDArrayGetContext 10000 7.4100 0.0000 0.0180 0.0007
MXInvokeCachedOpEx 10000 7.2500 0.0000 0.0180 0.0007
MXNDArrayFree 20000 14.4550 0.0000 0.0690 0.0007
MXNet C API Concurrency 100002 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
broadcast_add 260000 91174.6172 0.2120 8.5150 0.3507
broadcast_mul 100000 13579.5371 0.1100 1.4710 0.1358
Convolution 860000 88691.5859 0.0160 9.0500 0.1031
FullyConnected 240000 20774.8105 0.0200 5.6790 0.0866
Flatten 140000 10516.3486 0.0060 0.7840 0.0751
BatchNorm 840000 23081.5391 0.0120 10.7910 0.0275
Pooling 100000 1740.1880 0.0110 0.4180 0.0174
Activation 820000 14118.7568 0.0070 5.6640 0.0172
Reshape 100000 537.9680 0.0020 0.0780 0.0054
DeleteVariable 1312516 5137.8071 0.0000 0.5320 0.0039
Reorder 286980 508.2870 0.0000 1.4380 0.0018
md5-229f1c0096fdeff44ae9520e07eb0c82
```python
info score cp 289 depth 20 nodes 873 time 9751 nps 89 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6
bestmove e2e4
md5-2c70a65f65e0bd2a0feb19e4d3c81386
```python
info score cp 290 depth 20 nodes 889 time 9755 nps 91 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6
Hi @QueensGambit, thanks for your detailed reply. Very cool project!
There was a lengthy discussion about operator performance regression here: https://github.com/apache/incubator-mxnet/issues/15429#issuecomment-508831150
It's still open as we need more comprehensive method to evaluate operator performance, and there could be regressions we didn't detect.
Summary is:
time module, we didn't see significant regression on batchnorm and broadcast add, 1-3% at most and givent the variance between runs, we decided to go with the release. Relu activation slow is expected for correctness.convlution you mentioned.Also if the cp nodes depths values are different, does it mean the time will be different?
In addition, if you are using mkl version, you can try these env var mentioned here:https://github.com/apache/incubator-mxnet/issues/15429#issuecomment-507700562
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18 (set to your num of cores)
Also if the cp nodes depths values are different, does it mean the time will be different?
It's basically the other way around. Usually, the movetime is a fixed given time and the nodes defines how many nodes have been created in this time within search tree. This makes the engine applicable to different hardware and time-controls. Higher nodes and depths are preferable.
cp is a dynamically changing value evaluation which converges in theory to the true value for a particular position given infinite samples, here nodes. The nodes are reused for future board position if possible, therefore even though for cases when a network with a slightly lower nps predicted the same bestmove, a higher nps version has a better calibrated evaluation for possible future positions.
In a simplified view, the .asnumpy() method is called and the prediction results of the neural network are assigned to a newly created node.
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/agent/player/util/net_pred_service.py#L80
Thus this speed measurement should be fairly independent from profiler update changes.
Engines with higher nps are able to explore the position more deeply in the same time and have in result a higher playing strength.
In addition, if you are using mkl version, you can try these env var mentioned here:#15429 (comment)
Thank you, I haven't tried optimizing all mkl hyperparameters so far, but only activated MXNET_SUBGRAPH_BACKEND=MKLDNN which gives a speed-up of ~16% for this particular model.
I'm also planning to support low precision inference float16, int8 and Tensorrt in the future.
A C++ version of the engine will also be released soon.
Here the tree-traversal and tree-management on CPU is significantly faster and the inference time of the neural network plays a much bigger factor.
Thanks for the explaination, I think the total time spend should be independent from profiler code change. But the results from profiler.dumps() is affected.
The environment variable setting should give you another speed up.
Meanwhile we will look at operator regressions. right now only time (ms) spend can be trusted.
Thank you for your discussion @QueensGambit @roywei , I will try to reproduce the case and profile it firstly :)
Hi, all, I've reproduced the case and confirmed initially that there is a regression in Activation operation for specific input shape. Here is the detailed report:
(1) I run the code provided by @QueensGambit with MKLDNN backend (change it(iterations) in above script from 1e4 to 1e3, for efficiency), and got the profile statistics:
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: cpu/0 126343 13.5680 0.0000 905.4720 452.7360
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXInvokeCachedOpEx 1000 0.6340 0.0000 0.0020 0.0006
MXNDArrayFree 2000 1.2910 0.0000 0.0030 0.0006
MXInvokeCachedOp 1000 1432.7380 1.0730 3.3030 1.4327
MXNDArrayWaitAll 1 21019.3457 21019.3457 21019.3457 21019.3457
MXNDArrayGetContext 1000 0.7310 0.0000 0.0100 0.0007
MXNet C API Concurrency 10002 0.0000 0.0000 0.0010 0.0005
MXNet C API Calls 5001 5.0010 0.0010 5.0010 2.5000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
FullyConnected 24000 4591.6479 0.0310 0.5470 0.1913
Flatten 14000 1809.8690 0.0090 0.2310 0.1293
broadcast_add 26000 15939.2910 0.4090 1.7990 0.6130
BatchNorm 84000 3556.9929 0.0190 0.9100 0.0423
broadcast_mul 10000 2011.7510 0.1820 0.8450 0.2012
Reshape 10000 38.4070 0.0020 0.0260 0.0038
Pooling 10000 344.1310 0.0180 0.6350 0.0344
Convolution 86000 7773.5220 0.0260 1.1880 0.0904
Reorder 4716 11.4720 0.0010 0.0150 0.0024
Activation 82000 2387.3359 0.0140 0.6090 0.0291
DeleteVariable 141422 598.9450 0.0000 0.5270 0.0042
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 127053 0.0000 891.9040 445.9520
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 23634.1523 23634.1523 23634.1523 23634.1523
MXNet C API Calls 5001 5.0010 0.0010 5.0010 2.5000
MXInvokeCachedOp 1000 1632.0590 1.0620 2.6950 1.6321
MXNDArrayGetContext 1000 0.7340 0.0000 0.0120 0.0007
MXNDArrayFree 2000 1.4440 0.0000 0.0100 0.0007
MXInvokeCachedOpEx 1000 0.7130 0.0000 0.0020 0.0007
MXNet C API Concurrency 10002 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
broadcast_add 26000 17322.6543 0.4180 1.9220 0.6663
broadcast_mul 10000 2184.6421 0.1850 0.7510 0.2185
FullyConnected 24000 4494.6572 0.0310 0.8050 0.1873
Flatten 14000 1890.4550 0.0090 0.2690 0.1350
Convolution 86000 7898.2749 0.0260 1.3600 0.0918
Activation 82000 4982.0112 0.0170 0.7210 0.0608
BatchNorm 84000 3896.8450 0.0210 0.7810 0.0464
Pooling 10000 315.3780 0.0170 0.6760 0.0315
Reshape 10000 68.2620 0.0040 0.0230 0.0068
DeleteVariable 142202 659.8880 0.0000 0.0680 0.0046
Reorder 32420 53.1210 0.0000 0.0180 0.0016
Noticed from the table that Activation operation has a performance gap between 1.4.1 (avg time: 0.0291 ms) and 1.5.0 (avg time: 0.0608 ms). The performance of other operations are considered as normal fluctuation, concerning that the profiler code changed between 1.4.1 and 1.5.0.
(2) to further investigate the performance regression of Activation operation, I enabled MKLDNN verbose by export MKLDNN_VERBOSE=1, which helps to collect basic statistics like execution time and input/output shapes of each MKLDNN operation. Parsing the verbose result and computing the average time of operation with different input shapes:
op: relu,
shape:all, total_count:38500, total_time:386.78921302000026, avg_time:0.010046473065454552
shape:mb1ic128, total_count:5500, total_time:43.13965790000107, avg_time:0.007843574163636558
shape:mb1ic256ih8iw8, total_count:4400, total_time:41.59277998999979, avg_time:0.00945290454318177
shape:mb1ic128ih8iw8, total_count:2200, total_time:19.7434104399999, avg_time:0.008974277472727226
......
op: sigmoid,
shape:all, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415
shape:mb1ic256, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415
op: relu,
shape: all, total_count:38500, total_time:1070.7652004499996, avg_time:0.02781208312857142
shape:mb1ic128, total_count:5500, total_time:690.7533760000006, avg_time:0.125591522909091
shape:mb1ic256ih8iw8, total_count:4400, total_time:38.590089899999874, avg_time:0.008770474977272699
shape:mb1ic128ih8iw8, total_count:2200, total_time:18.308107539999845, avg_time:0.008321867063636293
......
op:sigmoid,
shape: all, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965
shape:mb1ic256, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965
Noticed from the table that for relu, shape of mb1ic128 (stands for batch size=1, input channel=128) has a performance regression (avg_time from 0.0078 to 0.1256), other shapes are almost the same between 1.4.1 and 1.5.0. And for sigmoid, shape of mb1ic256 has a performance regression (avg_time from 0.0103 to 0.1237).
On the other hand, as @QueensGambit mentioned that there's not a regression for all models, I think the model in CrazyAraFish_0.5.0_RiseV1.zip may have fewer these operations, so that the performance can be explained as above.
In conclusion, MKLDNN relu with shape mb1ic128 and sigmoid with shape mb1ic256 have a performance regression from 1.4.1 to 1.5.0. We'll look into it.
Thanks @ElaineBao
FYI, @QueensGambit, Elaine is WIP to fix these performance issues and will update the results soon.
Thank you for the notification @pengzhao-intel.
Wouldn't it be to appropriate to disable the NaN-Value check from https://github.com/apache/incubator-mxnet/pull/14262 by default whenever the model is loaded in full-inference mode (grad_req='null') or to make it parametrizable?
ReLU is the most common building block, thus the regression affects most current models.
Normally, when a model was fully trained without NaNs, then a higher priority is assigned to the final inference time.
A preprint of our paper is now available on arXiv:
The paper describes among other things the network architecture in more detail.
Moreover, the Monte-Carlo tree search was reimplemented in C++ using the MXNet C++ Package:
Hi, @QueensGambit, this issue has been fixed in mxnet 1.6. As version of 1.6 has not released yet, you can install nightly build version: pip install mxnet-mkl==1.6.0b20190903 .
I've tested above scripts on this version:
op: relu
shape:all, total_count:38500, total_time:314.61432360999953, avg_time:0.008171800613246741
shape:mb1ic128, total_count:5500, total_time:60.115013090000865, avg_time:0.010930002380000158
shape:mb1ic256ih8iw8, total_count:4400, total_time:32.94727337000046, avg_time:0.0074880166750001045
shape:mb1ic128ih8iw8, total_count:2200, total_time:13.953863230000001, avg_time:0.0063426651045454556
...
op: sigmoid
total_count:5500, total_time:62.94021099000075, avg_time:0.011443674725454682
shape:mb1ic256, total_count:5500, total_time:62.94021099000075, avg_time:0.011443674725454682
As you can see, relu with shape mb1ic128 and sigmoid with shape mb1ic256 have a faster performance than that in 1.5.0, and achieve the same result as 1.4.1.
Great news!
I'm happy to update to MXNet-1.6.0 or MXNet-1.6.0-Dev for the next _CrazyAra_ release.
Thank you for awesome work.
Most helpful comment
Hi, all, I've reproduced the case and confirmed initially that there is a regression in
Activationoperation for specific input shape. Here is the detailed report:(1) I run the code provided by @QueensGambit with
MKLDNNbackend (changeit(iterations) in above script from 1e4 to 1e3, for efficiency), and got the profile statistics:mxnet 1.4.1
mxnet 1.5.0
Noticed from the table that
Activationoperation has a performance gap between 1.4.1 (avg time: 0.0291 ms) and 1.5.0 (avg time: 0.0608 ms). The performance of other operations are considered as normal fluctuation, concerning that the profiler code changed between 1.4.1 and 1.5.0.(2) to further investigate the performance regression of
Activationoperation, I enabled MKLDNN verbose byexport MKLDNN_VERBOSE=1, which helps to collect basic statistics like execution time and input/output shapes of each MKLDNN operation. Parsing the verbose result and computing the average time of operation with different input shapes:mxnet 1.4.1
mxnet 1.5.0
Noticed from the table that for
relu, shape ofmb1ic128(stands for batch size=1, input channel=128) has a performance regression (avg_time from 0.0078 to 0.1256), other shapes are almost the same between 1.4.1 and 1.5.0. And forsigmoid, shape ofmb1ic256has a performance regression (avg_time from 0.0103 to 0.1237).On the other hand, as @QueensGambit mentioned that there's not a regression for all models, I think the model in CrazyAraFish_0.5.0_RiseV1.zip may have fewer these operations, so that the performance can be explained as above.
In conclusion, MKLDNN
reluwith shapemb1ic128andsigmoidwith shapemb1ic256have a performance regression from 1.4.1 to 1.5.0. We'll look into it.