Incubator-mxnet: CPU inference is very slow for some model checkpoints

Created on 24 Sep 2020  路  5Comments  路  Source: apache/incubator-mxnet

Description

I have trained a resnet50 model. When running inference using an early checkpoint (0003.params) on CPU, it runs on reasonable speed. However, it is much slower to run inference using a later checkpoint (0023.params) on CPU. Both models runs very fast at similar speed on GPU. I ran the profile and it turns out that the issue is at Conv Layer, where some Conv take up to 110ms.

Here are the profiling results:
fast model 0003.params on CPU:

Profile Statistics:
    Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                         672           0.0000      214184.3125      107092.1562
MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1        5506.6421        5506.6421        5506.6421        5506.6421
MXExecutorForward                     100          13.2700           0.0760           0.3040           0.1327
MXNet C API Calls                    2301           2.3010           0.0010           2.3010           1.1500
MXImperativeInvokeEx                  100           1.5130           0.0090           0.0560           0.0151
MXNDArrayGetShapeEx                  1200           0.3880           0.0000           0.0090           0.0003
MXNDArrayReshape64                    100           0.1490           0.0010           0.0040           0.0015
MXExecutorOutputs                     100           0.0730           0.0000           0.0020           0.0007
MXNDArrayGetContext                   200           0.0390           0.0000           0.0010           0.0002
MXNDArrayFree                         200           0.0340           0.0000           0.0010           0.0002
MXNDArrayGetDType                     200           0.0250           0.0000           0.0010           0.0001
MXNDArrayGetStorageType               100           0.0150           0.0000           0.0010           0.0002
MXNet C API Concurrency              4602           0.0000           0.0000           0.0010           0.0000
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                      5300        3687.2700           0.0430           3.5340           0.6957
LeakyReLU                            2500         722.8160           0.1400           2.5450           0.2891
_sg_mkldnn_fully_connected             100         194.0460           1.8040           2.4230           1.9405
BatchNorm                            2600         102.9240           0.0120           0.2850           0.0396
Dropout                               100          14.9120           0.1140           0.2520           0.1491
DeleteVariable                        559           6.8570           0.0010           1.3360           0.0123
CopyCPU2CPU                           100           1.5180           0.0130           0.0240           0.0152
_minus_scalar                         100           1.3820           0.0110           0.0230           0.0138
_mul_scalar                           100           0.9680           0.0070           0.0260           0.0097

slow model 0023.params on CPU

Profile Statistics:
    Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                         672           0.0000      214184.3125      107092.1562
MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1       78402.4453       78402.4453       78402.4453       78402.4453
MXExecutorForward                     100          15.5860           0.0890           0.2150           0.1559
MXNet C API Calls                    2301           2.3010           0.0010           2.3010           1.1500
MXImperativeInvokeEx                  100           1.5790           0.0090           0.0270           0.0158
MXNDArrayGetShapeEx                  1200           0.5500           0.0000           0.0100           0.0005
MXNDArrayReshape64                    100           0.1820           0.0010           0.0090           0.0018
MXExecutorOutputs                     100           0.0740           0.0000           0.0030           0.0007
MXNDArrayGetContext                   200           0.0540           0.0000           0.0010           0.0003
MXNDArrayGetDType                     200           0.0390           0.0000           0.0010           0.0002
MXNDArrayFree                         200           0.0360           0.0000           0.0010           0.0002
MXNDArrayGetStorageType               100           0.0260           0.0000           0.0010           0.0003
MXNet C API Concurrency              4602           0.0000           0.0000           0.0010           0.0000
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                      5300       68954.4688           0.0450         109.0230          13.0103
LeakyReLU                            2500        1831.9830           0.1510           4.5000           0.7328
_sg_mkldnn_fully_connected             100         186.8240           1.8090           2.2790           1.8682
BatchNorm                            2600         115.5410           0.0120           0.8330           0.0444
Dropout                               100          15.9720           0.1430           0.2250           0.1597
DeleteVariable                        559           6.3650           0.0010           1.4840           0.0114
_minus_scalar                         100           1.4650           0.0120           0.0220           0.0147
CopyCPU2CPU                           100           1.4390           0.0130           0.0190           0.0144
_mul_scalar                           100           0.8480           0.0070           0.0150           0.0085

Error Message

No error

To Reproduce

    def run_inference():
        out = executor.forward(is_train=False, data=sample)
        pass
    sym, arg_params, aux_params = mx.model.load_checkpoint('model', 23)
    sample = mx.ndarray.zeros((1, 3, 112, 112))
    it = 100
    ctx = mx.cpu()  # mx.cpu()
    executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
    executor.copy_params_from(arg_params, aux_params)

    profiler.set_config(profile_all=True,
                        aggregate_stats=True,
                        filename='profile_output.json')

    # warm-up
    for i in range(10):
        run_inference()

    # Ask the profiler to start recording
    profiler.set_state('run')

    for i in range(it):
        run_inference()

    # Make sure all operations have completed
    mx.nd.waitall()
    # Ask the profiler to stop recording
    profiler.set_state('stop')
    # Dump all results to log file before download
    print(profiler.dumps())

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. run above code with attached 0003.params and 0023.params
  2. compare the output

What have you tried to solve it?

1.
2.

Environment

MXNet 1.4 cu100mkl, MXNet 1.6 cu100mkl, MXNet1.7 native. All versions could reproduce the issue.

0003.params, 0023.params and .json are in the following drive:
https://drive.google.com/drive/folders/1NE9wxNXmzPX35ImYBMLvbCpqLX8zq8Xq?usp=sharing

Bug MKLDNN needs triage

Most helpful comment

Hi again @buaalsy2003. Problem here is inside PReLu activation functions. During training your model, some gamma parameters reached very low, denormalized floating point values. It slows down the execution of computations on CPU. I need some time to prepare proper solution for that but if you need quick fix to run your trained model on CPU, just iterate through parameter dict and set low prelu params to 0. Try choosing the right threshold so you won't lose accuracy.

Example code:

def fix_denorm_params():
    global arg_params
    for key in arg_params.keys():
        if 'prelu' in key:
            gammas = arg_params[key]
            for index, gamma in enumerate(gammas):
                if abs(gamma) < 1e-20:
                    arg_params[key][index] = 0.

...

def run_inference():
    out = executor.forward(is_train=False, data=sample)
    pass
sym, arg_params, aux_params = mx.model.load_checkpoint('model-reduce', 23)
fix_denorm_params()
sample = mx.ndarray.zeros((1, 3, 112, 112))

All 5 comments

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

Hi @buaalsy2003 thanks for reporting this bug, I've managed to reproduce it. I'll try to find the cause and fix it.

Hi again @buaalsy2003. Problem here is inside PReLu activation functions. During training your model, some gamma parameters reached very low, denormalized floating point values. It slows down the execution of computations on CPU. I need some time to prepare proper solution for that but if you need quick fix to run your trained model on CPU, just iterate through parameter dict and set low prelu params to 0. Try choosing the right threshold so you won't lose accuracy.

Example code:

def fix_denorm_params():
    global arg_params
    for key in arg_params.keys():
        if 'prelu' in key:
            gammas = arg_params[key]
            for index, gamma in enumerate(gammas):
                if abs(gamma) < 1e-20:
                    arg_params[key][index] = 0.

...

def run_inference():
    out = executor.forward(is_train=False, data=sample)
    pass
sym, arg_params, aux_params = mx.model.load_checkpoint('model-reduce', 23)
fix_denorm_params()
sample = mx.ndarray.zeros((1, 3, 112, 112))

@grygielski Thanks a lot for your quick action! Just to confirm the workaround works. Please keep us posted when better solution is available.
BTW, what is the approach or debug tool you used to figure this out? In case I am running into similar issue so I could try to debug it myself.

@buaalsy2003 Sorry for my late response but I somehow missed your question.
If it comes to how I figured out this problem, I had some experience with similar behavior from other frameworks so denormal values was my initial guess. I didn't use any sophisticated debugging tool, to confirm that, I just checked for denormals inside C++ code (more precisely with fpclassify function). I've added these checks on convolution input values and built MXNet from source.

However, the first step for me is always checking an output of running MXNet code with export MKLDNN_VERBOSE=1 environment variable. It outputs oneDNN (MKL-DNN) primitives executed in order with the execution time at the end. This way I can compare 2 runs like in this case and see if any of them differ significantly.

I hope it somehow shed a light on my thought process and can help you in the future.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

WangcsShuai picture WangcsShuai  路  3Comments

yuconglin picture yuconglin  路  3Comments

seongkyun picture seongkyun  路  3Comments

xzqjack picture xzqjack  路  3Comments

Zhaoyang-XU picture Zhaoyang-XU  路  3Comments