Incubator-mxnet: Performance regression for MXNet 1.5.0

Created on 23 Jul 2019  Â·  17Comments  Â·  Source: apache/incubator-mxnet

Description

I experience a performance regression for both CPU (mxnet-mkl) and GPU (mxnet-cu100) when updating from version 1.4.1 to 1.5.0. The speed regression is about 8% on CPU, if you're interested I can also provide a report for GPU.

Environment info

MXNet version 1.4.1

$ pip install mxnet-mkl==1.4.1

and MXNet version 1.5.0

$ pip install mxnet-mkl==1.5.0
  • Intel® Coreâ„¢ i5-8250U CPU @ 1.60GHz × 8
  • Ubuntu 18.04.2 LTS
  • export MXNET_SUBGRAPH_BACKEND=MKLDNN

Profiling results

mx.__version__

'1.4.1'

print(profiler.dumps())
Profile Statistics.
    Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                      120158       24832.1641         313.3440       24834.2129       12260.4336

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1       90628.5391       90628.5391       90628.5391       90628.5391
MXExecutorForward                   10000         199.3470           0.0150           0.0610           0.0199
MXImperativeInvokeEx                10000          58.7390           0.0030           0.0820           0.0059
MXNDArrayGetShape                   40000          21.5920           0.0000           0.0150           0.0005
MXNet C API Concurrency            120002           0.0000           0.0000           0.0010           0.0005
MXNet C API Calls                   60001          60.0010           0.0010          60.0010          30.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                             40082         104.3530           0.0000           1.0880           0.0026
[_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,broadcast_add,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,           20000      182261.7500           8.3630          70.7680           9.1131
DeleteVariable                     240000         636.3090           0.0000           0.0410           0.0027
CopyCPU2CPU                         20000         103.6500           0.0020           0.0280           0.0052
mx.__version__

'1.5.0'

print(profiler.dumps())
Profile Statistics:
    Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                      240440         313.3440       43597.1289       21641.8926

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        2      196511.1406       97689.1719       98821.9688       98255.5703
MXNet C API Calls                  220002         220.0020           0.0010         220.0020         110.0000
MXExecutorForward                   20000         758.2010           0.0220           0.9980           0.0379
MXImperativeInvokeEx                20000         187.6520           0.0050           0.2250           0.0094
MXExecutorOutputs                   20000          27.7760           0.0000           0.0230           0.0014
MXNDArrayGetShapeEx                 80000          63.9810           0.0000           0.0480           0.0008
MXNDArrayFree                       40000          25.8550           0.0000           0.0160           0.0006
MXNDArrayGetStorageType             40000          23.0740           0.0000           0.0150           0.0006
MXNet C API Concurrency            440004           0.0000           0.0000           0.0010           0.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
[_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,broadcast_add,_sg_mkldnn_conv,_sg_mkldnn_conv,_sg_mkldnn_conv,           40000      396250.1875           9.0350          81.8880           9.9063
CopyCPU2CPU                         40000         239.0970           0.0040           0.0550           0.0060
DeleteVariable                     480168        1348.4821           0.0000           0.1520           0.0028
Reorder                            560000         869.6200           0.0000          11.9040           0.0016

Consequently this results in a lower number of evaluated positions per second (nps) of the engine:

MXNet-mkl 1.4.1

go movetime 5000

```python
info score cp 71 depth 30 nodes 697 time 4727 nps 147 pv e2e4 d7d5 e4d5 e7e6 f1b5 c7c6 d5c6 b7c6 b5c6

#### MXNet-mkl 1.5.0

```python
go movetime 5000

```python
info score cp 57 depth 25 nodes 673 time 4743 nps 141 pv e2e4 d7d5 e4d5 d8d5 b1c3 d5a5 g1f3 g8f6 f1c4 c8g4 e1g1 e7e6 d2d4 P@a3 a1b1 a3b2 c1b2 P@h3 c4e6

## Minimum reproducible example

Download the latest release **CrazyAra 0.5.1** which contains the model weights and architecture definition in `CrazyAra_0.5.1_RiseV2_mobile/model/`.
* https://github.com/QueensGambit/CrazyAra/releases

Run the following script respectively on MXNet 1.4.1 and MXNet 1.5.0 and verifiy that the `.json` and `.params` file are in the same directory as the python script:

```python
import mxnet as mx
import numpy as np
from mxnet import profiler

sym, arg_params, aux_params = mx.model.load_checkpoint('model-1.19246-0.603', 223)
sample = mx.ndarray.zeros((1, 34, 8, 8))
it = int(1e4)
ctx = mx.gpu() # mx.cpu()
executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)

def run_inference():
    out = executor.forward(is_train=False, data=sample)

profiler.set_config(profile_all=True,
                    aggregate_stats=True,
                    filename='profile_output.json')

# warm-up
for i in range(100):
    run_inference()

# Ask the profiler to start recording
profiler.set_state('run')

for i in range(it):
    run_inference()

# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
profiler.dump()

In order to run the engine and check its nps do the following:

python crazyara.py
uci
setoption name context value cpu
isready
go movetime 5000

More information about the uci-protocoll is available here:


MXNet 1.5.0 introduced performance regressions for some operators:

which might be the cause of this problem.

This issue might also be related to https://github.com/apache/incubator-mxnet/issues/15430.

Best,
~QueensGambit

MKLDNN Performance Python

Most helpful comment

Hi, all, I've reproduced the case and confirmed initially that there is a regression in Activation operation for specific input shape. Here is the detailed report:

(1) I run the code provided by @QueensGambit with MKLDNN backend (change it(iterations) in above script from 1e4 to 1e3, for efficiency), and got the profile statistics:

mxnet 1.4.1

Profile Statistics.
        Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                      126343          13.5680           0.0000         905.4720         452.7360

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXInvokeCachedOpEx                   1000           0.6340           0.0000           0.0020           0.0006
MXNDArrayFree                        2000           1.2910           0.0000           0.0030           0.0006
MXInvokeCachedOp                     1000        1432.7380           1.0730           3.3030           1.4327
MXNDArrayWaitAll                        1       21019.3457       21019.3457       21019.3457       21019.3457
MXNDArrayGetContext                  1000           0.7310           0.0000           0.0100           0.0007
MXNet C API Concurrency             10002           0.0000           0.0000           0.0010           0.0005
MXNet C API Calls                    5001           5.0010           0.0010           5.0010           2.5000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
FullyConnected                      24000        4591.6479           0.0310           0.5470           0.1913
Flatten                             14000        1809.8690           0.0090           0.2310           0.1293
broadcast_add                       26000       15939.2910           0.4090           1.7990           0.6130
BatchNorm                           84000        3556.9929           0.0190           0.9100           0.0423
broadcast_mul                       10000        2011.7510           0.1820           0.8450           0.2012
Reshape                             10000          38.4070           0.0020           0.0260           0.0038
Pooling                             10000         344.1310           0.0180           0.6350           0.0344
Convolution                         86000        7773.5220           0.0260           1.1880           0.0904
Reorder                              4716          11.4720           0.0010           0.0150           0.0024
Activation                          82000        2387.3359           0.0140           0.6090           0.0291
DeleteVariable                     141422         598.9450           0.0000           0.5270           0.0042

mxnet 1.5.0

Profile Statistics:
        Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                      127053           0.0000         891.9040         445.9520

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1       23634.1523       23634.1523       23634.1523       23634.1523
MXNet C API Calls                    5001           5.0010           0.0010           5.0010           2.5000
MXInvokeCachedOp                     1000        1632.0590           1.0620           2.6950           1.6321
MXNDArrayGetContext                  1000           0.7340           0.0000           0.0120           0.0007
MXNDArrayFree                        2000           1.4440           0.0000           0.0100           0.0007
MXInvokeCachedOpEx                   1000           0.7130           0.0000           0.0020           0.0007
MXNet C API Concurrency             10002           0.0000           0.0000           0.0010           0.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
broadcast_add                       26000       17322.6543           0.4180           1.9220           0.6663
broadcast_mul                       10000        2184.6421           0.1850           0.7510           0.2185
FullyConnected                      24000        4494.6572           0.0310           0.8050           0.1873
Flatten                             14000        1890.4550           0.0090           0.2690           0.1350
Convolution                         86000        7898.2749           0.0260           1.3600           0.0918
Activation                          82000        4982.0112           0.0170           0.7210           0.0608
BatchNorm                           84000        3896.8450           0.0210           0.7810           0.0464
Pooling                             10000         315.3780           0.0170           0.6760           0.0315
Reshape                             10000          68.2620           0.0040           0.0230           0.0068
DeleteVariable                     142202         659.8880           0.0000           0.0680           0.0046
Reorder                             32420          53.1210           0.0000           0.0180           0.0016

Noticed from the table that Activation operation has a performance gap between 1.4.1 (avg time: 0.0291 ms) and 1.5.0 (avg time: 0.0608 ms). The performance of other operations are considered as normal fluctuation, concerning that the profiler code changed between 1.4.1 and 1.5.0.

(2) to further investigate the performance regression of Activation operation, I enabled MKLDNN verbose by export MKLDNN_VERBOSE=1, which helps to collect basic statistics like execution time and input/output shapes of each MKLDNN operation. Parsing the verbose result and computing the average time of operation with different input shapes:

mxnet 1.4.1

op: relu,
shape:all, total_count:38500, total_time:386.78921302000026, avg_time:0.010046473065454552
shape:mb1ic128, total_count:5500, total_time:43.13965790000107, avg_time:0.007843574163636558
shape:mb1ic256ih8iw8, total_count:4400, total_time:41.59277998999979, avg_time:0.00945290454318177
shape:mb1ic128ih8iw8, total_count:2200, total_time:19.7434104399999, avg_time:0.008974277472727226
......

op: sigmoid, 
shape:all, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415
shape:mb1ic256, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415

mxnet 1.5.0

op: relu,
shape: all, total_count:38500, total_time:1070.7652004499996, avg_time:0.02781208312857142
shape:mb1ic128, total_count:5500, total_time:690.7533760000006, avg_time:0.125591522909091
shape:mb1ic256ih8iw8, total_count:4400, total_time:38.590089899999874, avg_time:0.008770474977272699
shape:mb1ic128ih8iw8, total_count:2200, total_time:18.308107539999845, avg_time:0.008321867063636293
......

op:sigmoid, 
shape: all, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965
shape:mb1ic256, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965

Noticed from the table that for relu, shape of mb1ic128 (stands for batch size=1, input channel=128) has a performance regression (avg_time from 0.0078 to 0.1256), other shapes are almost the same between 1.4.1 and 1.5.0. And for sigmoid, shape of mb1ic256 has a performance regression (avg_time from 0.0103 to 0.1237).

On the other hand, as @QueensGambit mentioned that there's not a regression for all models, I think the model in CrazyAraFish_0.5.0_RiseV1.zip may have fewer these operations, so that the performance can be explained as above.

In conclusion, MKLDNN relu with shape mb1ic128 and sigmoid with shape mb1ic256 have a performance regression from 1.4.1 to 1.5.0. We'll look into it.

All 17 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@mxnet-label-bot add [Performance, MKL, Python]

Thanks for reporting the problem. @ElaineBao could you take a look for this?

cc @roywei

Hi @QueensGambit thanks for the issue. What does cp 71 depth 30 nodes 697 mean for your benchmark?
and could you share your model used? what's the archietecture? There is an known issue with RELU and we made it slower for correctness. https://github.com/apache/incubator-mxnet/pull/14262

Hello @roywei. Thanks for your answer.
_RiseV2_mobile_ is an adapted convolutional network designed by myself which is in principal similar to the AlphaZero-Network, but specially performant for CPUs (about ~3x faster) and GPUs (about ~1.4x faster) while maintaining similar performance.
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/neural_net/architectures/rise_v2_symbol.py#L149

It applies the concepts of multiple recent papers in one network:

Our paper which describes the network architecture in more detail will be published as well.


The output cp 71 depth 30 nodes 697 is UCI-protocol (UCI - Universal Chess Interface) related.

  • cp - Centi-Pawn metric how good the engine evaluates the given board position in the view of the current player. cp 100 means +1 pawn advantage.
  • nodes - Number of evaluated board positions after running the search for X milli-seconds. Higher values are better
  • depth - Maximum depth a rollout of the MCTS search has reached during the search
  • go movetime 5000 - Defines for how long in milli-seconds the engine should search the current position (by default the starting position)

* pv - Stands for principal variation and is the best line of play which is suggested by the engine for both players

I changed the profiling code by replacing the executor handle with a gluon network object.
This version is slower, but provides more details on where the runtime is spent:

import mxnet as mx
import numpy as np
from mxnet import profiler

sample = mx.ndarray.zeros((1, 34, 8, 8))
it = int(1e4)
ctx = mx.cpu()

model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)

def run_inference():
    out = net(sample)

profiler.set_config(profile_all=True,
                    aggregate_stats=True,
                    filename='profile_output.json')

# warm-up
for i in range(100):
    run_inference()

# Ask the profiler to start recording
profiler.set_state('run')

for i in range(it):
    run_inference()

# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
profiler.dump()

MXNet 1.4.1

print(profiler.dumps())
Profile Statistics.
    Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                     1170861          13.5680           0.0000         905.4720         452.7360

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXInvokeCachedOpEx                  10000           6.5960           0.0000           0.0140           0.0007
MXNDArrayFree                       20000          13.3320           0.0000           0.0140           0.0007
MXInvokeCachedOp                    10000        8961.5879           0.6330          10.6770           0.8962
MXNDArrayWaitAll                        1      128982.7891      128982.7891      128982.7891      128982.7891
MXNDArrayGetContext                 10000           6.3600           0.0000           0.0140           0.0006
MXNet C API Concurrency            100002           0.0000           0.0000           0.0010           0.0005
MXNet C API Calls                   50001          50.0010           0.0010          50.0010          25.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                             42326          78.8710           0.0000           0.0980           0.0019
DeleteVariable                    1312004        5104.0220           0.0000           5.2050           0.0039
Activation                         820000       13786.3789           0.0060           2.6830           0.0168
Convolution                        860000       87381.4688           0.0150          11.2850           0.1016
Pooling                            100000        1611.6990           0.0100           0.7420           0.0161
Reshape                            100000         296.2550           0.0010           0.0560           0.0030
BatchNorm                          840000       21546.2461           0.0100           2.1610           0.0257
broadcast_mul                      100000       13171.7432           0.1030           0.9910           0.1317
broadcast_add                      260000       88638.4297           0.2000          13.5370           0.3409
Flatten                            140000       10471.5117           0.0070           0.7100           0.0748
FullyConnected                     240000       18911.3770           0.0160           0.7730           0.0788

MXNet 1.5.0

print(profiler.dumps())
Profile Statistics:
    Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                     1171335           0.0000         891.9040         445.9520

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1      129400.2969      129400.2969      129400.2969      129400.2969
MXNet C API Calls                   50001          50.0010           0.0010          50.0010          25.0000
MXInvokeCachedOp                    10000       13172.8516           0.7930          15.1500           1.3173
MXNDArrayGetContext                 10000           7.4100           0.0000           0.0180           0.0007
MXInvokeCachedOpEx                  10000           7.2500           0.0000           0.0180           0.0007
MXNDArrayFree                       20000          14.4550           0.0000           0.0690           0.0007
MXNet C API Concurrency            100002           0.0000           0.0000           0.0010           0.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
broadcast_add                      260000       91174.6172           0.2120           8.5150           0.3507
broadcast_mul                      100000       13579.5371           0.1100           1.4710           0.1358
Convolution                        860000       88691.5859           0.0160           9.0500           0.1031
FullyConnected                     240000       20774.8105           0.0200           5.6790           0.0866
Flatten                            140000       10516.3486           0.0060           0.7840           0.0751
BatchNorm                          840000       23081.5391           0.0120          10.7910           0.0275
Pooling                            100000        1740.1880           0.0110           0.4180           0.0174
Activation                         820000       14118.7568           0.0070           5.6640           0.0172
Reshape                            100000         537.9680           0.0020           0.0780           0.0054
DeleteVariable                    1312516        5137.8071           0.0000           0.5320           0.0039
Reorder                            286980         508.2870           0.0000           1.4380           0.0018



md5-229f1c0096fdeff44ae9520e07eb0c82



```python
info score cp 289 depth 20 nodes 873 time 9751 nps 89 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6
bestmove e2e4



md5-2c70a65f65e0bd2a0feb19e4d3c81386



```python
info score cp 290 depth 20 nodes 889 time 9755 nps 91 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6

Hi @QueensGambit, thanks for your detailed reply. Very cool project!

There was a lengthy discussion about operator performance regression here: https://github.com/apache/incubator-mxnet/issues/15429#issuecomment-508831150
It's still open as we need more comprehensive method to evaluate operator performance, and there could be regressions we didn't detect.

Summary is:

  1. The profiler code changed between 1.4.1 and 1.5.0, so it's not an apple to apple comparison. (same untouched operators became slow due to profiler change)
  2. Using time module, we didn't see significant regression on batchnorm and broadcast add, 1-3% at most and givent the variance between runs, we decided to go with the release. Relu activation slow is expected for correctness.
  3. We can look into convlution you mentioned.

Also if the cp nodes depths values are different, does it mean the time will be different?

In addition, if you are using mkl version, you can try these env var mentioned here:https://github.com/apache/incubator-mxnet/issues/15429#issuecomment-507700562

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18 (set to your num of cores)

Also if the cp nodes depths values are different, does it mean the time will be different?

It's basically the other way around. Usually, the movetime is a fixed given time and the nodes defines how many nodes have been created in this time within search tree. This makes the engine applicable to different hardware and time-controls. Higher nodes and depths are preferable.
cp is a dynamically changing value evaluation which converges in theory to the true value for a particular position given infinite samples, here nodes. The nodes are reused for future board position if possible, therefore even though for cases when a network with a slightly lower nps predicted the same bestmove, a higher nps version has a better calibrated evaluation for possible future positions.

In a simplified view, the .asnumpy() method is called and the prediction results of the neural network are assigned to a newly created node.
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/agent/player/util/net_pred_service.py#L80

Thus this speed measurement should be fairly independent from profiler update changes.
Engines with higher nps are able to explore the position more deeply in the same time and have in result a higher playing strength.

In addition, if you are using mkl version, you can try these env var mentioned here:#15429 (comment)

Thank you, I haven't tried optimizing all mkl hyperparameters so far, but only activated MXNET_SUBGRAPH_BACKEND=MKLDNN which gives a speed-up of ~16% for this particular model.

I'm also planning to support low precision inference float16, int8 and Tensorrt in the future.
A C++ version of the engine will also be released soon.
Here the tree-traversal and tree-management on CPU is significantly faster and the inference time of the neural network plays a much bigger factor.

Thanks for the explaination, I think the total time spend should be independent from profiler code change. But the results from profiler.dumps() is affected.
The environment variable setting should give you another speed up.

Meanwhile we will look at operator regressions. right now only time (ms) spend can be trusted.

Thank you for your discussion @QueensGambit @roywei , I will try to reproduce the case and profile it firstly :)

Hi, all, I've reproduced the case and confirmed initially that there is a regression in Activation operation for specific input shape. Here is the detailed report:

(1) I run the code provided by @QueensGambit with MKLDNN backend (change it(iterations) in above script from 1e4 to 1e3, for efficiency), and got the profile statistics:

mxnet 1.4.1

Profile Statistics.
        Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                      126343          13.5680           0.0000         905.4720         452.7360

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXInvokeCachedOpEx                   1000           0.6340           0.0000           0.0020           0.0006
MXNDArrayFree                        2000           1.2910           0.0000           0.0030           0.0006
MXInvokeCachedOp                     1000        1432.7380           1.0730           3.3030           1.4327
MXNDArrayWaitAll                        1       21019.3457       21019.3457       21019.3457       21019.3457
MXNDArrayGetContext                  1000           0.7310           0.0000           0.0100           0.0007
MXNet C API Concurrency             10002           0.0000           0.0000           0.0010           0.0005
MXNet C API Calls                    5001           5.0010           0.0010           5.0010           2.5000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
FullyConnected                      24000        4591.6479           0.0310           0.5470           0.1913
Flatten                             14000        1809.8690           0.0090           0.2310           0.1293
broadcast_add                       26000       15939.2910           0.4090           1.7990           0.6130
BatchNorm                           84000        3556.9929           0.0190           0.9100           0.0423
broadcast_mul                       10000        2011.7510           0.1820           0.8450           0.2012
Reshape                             10000          38.4070           0.0020           0.0260           0.0038
Pooling                             10000         344.1310           0.0180           0.6350           0.0344
Convolution                         86000        7773.5220           0.0260           1.1880           0.0904
Reorder                              4716          11.4720           0.0010           0.0150           0.0024
Activation                          82000        2387.3359           0.0140           0.6090           0.0291
DeleteVariable                     141422         598.9450           0.0000           0.5270           0.0042

mxnet 1.5.0

Profile Statistics:
        Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                      127053           0.0000         891.9040         445.9520

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1       23634.1523       23634.1523       23634.1523       23634.1523
MXNet C API Calls                    5001           5.0010           0.0010           5.0010           2.5000
MXInvokeCachedOp                     1000        1632.0590           1.0620           2.6950           1.6321
MXNDArrayGetContext                  1000           0.7340           0.0000           0.0120           0.0007
MXNDArrayFree                        2000           1.4440           0.0000           0.0100           0.0007
MXInvokeCachedOpEx                   1000           0.7130           0.0000           0.0020           0.0007
MXNet C API Concurrency             10002           0.0000           0.0000           0.0010           0.0000

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
broadcast_add                       26000       17322.6543           0.4180           1.9220           0.6663
broadcast_mul                       10000        2184.6421           0.1850           0.7510           0.2185
FullyConnected                      24000        4494.6572           0.0310           0.8050           0.1873
Flatten                             14000        1890.4550           0.0090           0.2690           0.1350
Convolution                         86000        7898.2749           0.0260           1.3600           0.0918
Activation                          82000        4982.0112           0.0170           0.7210           0.0608
BatchNorm                           84000        3896.8450           0.0210           0.7810           0.0464
Pooling                             10000         315.3780           0.0170           0.6760           0.0315
Reshape                             10000          68.2620           0.0040           0.0230           0.0068
DeleteVariable                     142202         659.8880           0.0000           0.0680           0.0046
Reorder                             32420          53.1210           0.0000           0.0180           0.0016

Noticed from the table that Activation operation has a performance gap between 1.4.1 (avg time: 0.0291 ms) and 1.5.0 (avg time: 0.0608 ms). The performance of other operations are considered as normal fluctuation, concerning that the profiler code changed between 1.4.1 and 1.5.0.

(2) to further investigate the performance regression of Activation operation, I enabled MKLDNN verbose by export MKLDNN_VERBOSE=1, which helps to collect basic statistics like execution time and input/output shapes of each MKLDNN operation. Parsing the verbose result and computing the average time of operation with different input shapes:

mxnet 1.4.1

op: relu,
shape:all, total_count:38500, total_time:386.78921302000026, avg_time:0.010046473065454552
shape:mb1ic128, total_count:5500, total_time:43.13965790000107, avg_time:0.007843574163636558
shape:mb1ic256ih8iw8, total_count:4400, total_time:41.59277998999979, avg_time:0.00945290454318177
shape:mb1ic128ih8iw8, total_count:2200, total_time:19.7434104399999, avg_time:0.008974277472727226
......

op: sigmoid, 
shape:all, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415
shape:mb1ic256, total_count:5500, total_time:56.511490069998786, avg_time:0.010274816376363415

mxnet 1.5.0

op: relu,
shape: all, total_count:38500, total_time:1070.7652004499996, avg_time:0.02781208312857142
shape:mb1ic128, total_count:5500, total_time:690.7533760000006, avg_time:0.125591522909091
shape:mb1ic256ih8iw8, total_count:4400, total_time:38.590089899999874, avg_time:0.008770474977272699
shape:mb1ic128ih8iw8, total_count:2200, total_time:18.308107539999845, avg_time:0.008321867063636293
......

op:sigmoid, 
shape: all, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965
shape:mb1ic256, total_count:5500, total_time:680.288773400003, avg_time:0.12368886789090965

Noticed from the table that for relu, shape of mb1ic128 (stands for batch size=1, input channel=128) has a performance regression (avg_time from 0.0078 to 0.1256), other shapes are almost the same between 1.4.1 and 1.5.0. And for sigmoid, shape of mb1ic256 has a performance regression (avg_time from 0.0103 to 0.1237).

On the other hand, as @QueensGambit mentioned that there's not a regression for all models, I think the model in CrazyAraFish_0.5.0_RiseV1.zip may have fewer these operations, so that the performance can be explained as above.

In conclusion, MKLDNN relu with shape mb1ic128 and sigmoid with shape mb1ic256 have a performance regression from 1.4.1 to 1.5.0. We'll look into it.

Thanks @ElaineBao
FYI, @QueensGambit, Elaine is WIP to fix these performance issues and will update the results soon.

Thank you for the notification @pengzhao-intel.
Wouldn't it be to appropriate to disable the NaN-Value check from https://github.com/apache/incubator-mxnet/pull/14262 by default whenever the model is loaded in full-inference mode (grad_req='null') or to make it parametrizable?

ReLU is the most common building block, thus the regression affects most current models.
Normally, when a model was fully trained without NaNs, then a higher priority is assigned to the final inference time.

A preprint of our paper is now available on arXiv:

The paper describes among other things the network architecture in more detail.
Moreover, the Monte-Carlo tree search was reimplemented in C++ using the MXNet C++ Package:

Hi, @QueensGambit, this issue has been fixed in mxnet 1.6. As version of 1.6 has not released yet, you can install nightly build version: pip install mxnet-mkl==1.6.0b20190903 .

I've tested above scripts on this version:

op: relu
shape:all, total_count:38500, total_time:314.61432360999953, avg_time:0.008171800613246741
shape:mb1ic128, total_count:5500, total_time:60.115013090000865, avg_time:0.010930002380000158
shape:mb1ic256ih8iw8, total_count:4400, total_time:32.94727337000046, avg_time:0.0074880166750001045
shape:mb1ic128ih8iw8, total_count:2200, total_time:13.953863230000001, avg_time:0.0063426651045454556
...

op: sigmoid
total_count:5500, total_time:62.94021099000075, avg_time:0.011443674725454682
shape:mb1ic256, total_count:5500, total_time:62.94021099000075, avg_time:0.011443674725454682

As you can see, relu with shape mb1ic128 and sigmoid with shape mb1ic256 have a faster performance than that in 1.5.0, and achieve the same result as 1.4.1.

Great news!
I'm happy to update to MXNet-1.6.0 or MXNet-1.6.0-Dev for the next _CrazyAra_ release.
Thank you for awesome work.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Shiro-LK picture Shiro-LK  Â·  3Comments

luoruisichuan picture luoruisichuan  Â·  3Comments

Zhaoyang-XU picture Zhaoyang-XU  Â·  3Comments

xzqjack picture xzqjack  Â·  3Comments

ranti-iitg picture ranti-iitg  Â·  3Comments