I have trained a resnet50 model. When running inference using an early checkpoint (0003.params) on CPU, it runs on reasonable speed. However, it is much slower to run inference using a later checkpoint (0023.params) on CPU. Both models runs very fast at similar speed on GPU. I ran the profile and it turns out that the issue is at Conv Layer, where some Conv take up to 110ms.
Here are the profiling results:
fast model 0003.params on CPU:
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 672 0.0000 214184.3125 107092.1562
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 5506.6421 5506.6421 5506.6421 5506.6421
MXExecutorForward 100 13.2700 0.0760 0.3040 0.1327
MXNet C API Calls 2301 2.3010 0.0010 2.3010 1.1500
MXImperativeInvokeEx 100 1.5130 0.0090 0.0560 0.0151
MXNDArrayGetShapeEx 1200 0.3880 0.0000 0.0090 0.0003
MXNDArrayReshape64 100 0.1490 0.0010 0.0040 0.0015
MXExecutorOutputs 100 0.0730 0.0000 0.0020 0.0007
MXNDArrayGetContext 200 0.0390 0.0000 0.0010 0.0002
MXNDArrayFree 200 0.0340 0.0000 0.0010 0.0002
MXNDArrayGetDType 200 0.0250 0.0000 0.0010 0.0001
MXNDArrayGetStorageType 100 0.0150 0.0000 0.0010 0.0002
MXNet C API Concurrency 4602 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 5300 3687.2700 0.0430 3.5340 0.6957
LeakyReLU 2500 722.8160 0.1400 2.5450 0.2891
_sg_mkldnn_fully_connected 100 194.0460 1.8040 2.4230 1.9405
BatchNorm 2600 102.9240 0.0120 0.2850 0.0396
Dropout 100 14.9120 0.1140 0.2520 0.1491
DeleteVariable 559 6.8570 0.0010 1.3360 0.0123
CopyCPU2CPU 100 1.5180 0.0130 0.0240 0.0152
_minus_scalar 100 1.3820 0.0110 0.0230 0.0138
_mul_scalar 100 0.9680 0.0070 0.0260 0.0097
slow model 0023.params on CPU
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 672 0.0000 214184.3125 107092.1562
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 78402.4453 78402.4453 78402.4453 78402.4453
MXExecutorForward 100 15.5860 0.0890 0.2150 0.1559
MXNet C API Calls 2301 2.3010 0.0010 2.3010 1.1500
MXImperativeInvokeEx 100 1.5790 0.0090 0.0270 0.0158
MXNDArrayGetShapeEx 1200 0.5500 0.0000 0.0100 0.0005
MXNDArrayReshape64 100 0.1820 0.0010 0.0090 0.0018
MXExecutorOutputs 100 0.0740 0.0000 0.0030 0.0007
MXNDArrayGetContext 200 0.0540 0.0000 0.0010 0.0003
MXNDArrayGetDType 200 0.0390 0.0000 0.0010 0.0002
MXNDArrayFree 200 0.0360 0.0000 0.0010 0.0002
MXNDArrayGetStorageType 100 0.0260 0.0000 0.0010 0.0003
MXNet C API Concurrency 4602 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 5300 68954.4688 0.0450 109.0230 13.0103
LeakyReLU 2500 1831.9830 0.1510 4.5000 0.7328
_sg_mkldnn_fully_connected 100 186.8240 1.8090 2.2790 1.8682
BatchNorm 2600 115.5410 0.0120 0.8330 0.0444
Dropout 100 15.9720 0.1430 0.2250 0.1597
DeleteVariable 559 6.3650 0.0010 1.4840 0.0114
_minus_scalar 100 1.4650 0.0120 0.0220 0.0147
CopyCPU2CPU 100 1.4390 0.0130 0.0190 0.0144
_mul_scalar 100 0.8480 0.0070 0.0150 0.0085
No error
def run_inference():
out = executor.forward(is_train=False, data=sample)
pass
sym, arg_params, aux_params = mx.model.load_checkpoint('model', 23)
sample = mx.ndarray.zeros((1, 3, 112, 112))
it = 100
ctx = mx.cpu() # mx.cpu()
executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)
profiler.set_config(profile_all=True,
aggregate_stats=True,
filename='profile_output.json')
# warm-up
for i in range(10):
run_inference()
# Ask the profiler to start recording
profiler.set_state('run')
for i in range(it):
run_inference()
# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
print(profiler.dumps())
(Paste the commands you ran that produced the error.)
1.
2.
MXNet 1.4 cu100mkl, MXNet 1.6 cu100mkl, MXNet1.7 native. All versions could reproduce the issue.
0003.params, 0023.params and .json are in the following drive:
https://drive.google.com/drive/folders/1NE9wxNXmzPX35ImYBMLvbCpqLX8zq8Xq?usp=sharing
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
Hi @buaalsy2003 thanks for reporting this bug, I've managed to reproduce it. I'll try to find the cause and fix it.
Hi again @buaalsy2003. Problem here is inside PReLu activation functions. During training your model, some gamma parameters reached very low, denormalized floating point values. It slows down the execution of computations on CPU. I need some time to prepare proper solution for that but if you need quick fix to run your trained model on CPU, just iterate through parameter dict and set low prelu params to 0. Try choosing the right threshold so you won't lose accuracy.
Example code:
def fix_denorm_params():
global arg_params
for key in arg_params.keys():
if 'prelu' in key:
gammas = arg_params[key]
for index, gamma in enumerate(gammas):
if abs(gamma) < 1e-20:
arg_params[key][index] = 0.
...
def run_inference():
out = executor.forward(is_train=False, data=sample)
pass
sym, arg_params, aux_params = mx.model.load_checkpoint('model-reduce', 23)
fix_denorm_params()
sample = mx.ndarray.zeros((1, 3, 112, 112))
@grygielski Thanks a lot for your quick action! Just to confirm the workaround works. Please keep us posted when better solution is available.
BTW, what is the approach or debug tool you used to figure this out? In case I am running into similar issue so I could try to debug it myself.
@buaalsy2003 Sorry for my late response but I somehow missed your question.
If it comes to how I figured out this problem, I had some experience with similar behavior from other frameworks so denormal values was my initial guess. I didn't use any sophisticated debugging tool, to confirm that, I just checked for denormals inside C++ code (more precisely with fpclassify function). I've added these checks on convolution input values and built MXNet from source.
However, the first step for me is always checking an output of running MXNet code with export MKLDNN_VERBOSE=1 environment variable. It outputs oneDNN (MKL-DNN) primitives executed in order with the execution time at the end. This way I can compare 2 runs like in this case and see if any of them differ significantly.
I hope it somehow shed a light on my thought process and can help you in the future.
Most helpful comment
Hi again @buaalsy2003. Problem here is inside PReLu activation functions. During training your model, some
gammaparameters reached very low, denormalized floating point values. It slows down the execution of computations on CPU. I need some time to prepare proper solution for that but if you need quick fix to run your trained model on CPU, just iterate through parameter dict and set lowpreluparams to 0. Try choosing the right threshold so you won't lose accuracy.Example code: