Incubator-mxnet: [Discussion] Unified performance tests and dashboard

Created on 5 Aug 2019 · 22Comments · Source: apache/incubator-mxnet

Problem Statement

Performance tests are not integrated with CI. We do not run any performance tests during PR validation and nightly tests. We will not be able to catch performance leaks early enough leading to performance degradations, regressions caught during or after a release.
Without performance tests with CI, we are unable to track performance improvement/degradation and bring in the focus of the community towards performance improvement related projects.
With new projects such as NumPy, Large Tensor Support, MKLDNN 1.0 integration, MShadow deprecation etc... tracking changes in the performance is critical. Having tools and integration with CI will make us move faster and handle regression swiftly.
Current performance/benchmark tests are too diverse distributed and maintained across teams and repos.
1. We have few performance tests under - benchmark/python
2. Recently, operator performance tests opperf
3. MXNet contributors at AWS maintain a suite of performance tests in - awslabs/deeplearning-benchmarks
4. MXNet contributors at Intel maintain a suite of performance tests. (repo - ??)
5. MXNet contributors at NVIDIA maintain a suite of performance tests. (repo - ??)
MXNet currently does not have a common dashboard to view performance benchmarks.

Proposal

At high level we can divide all performance tests into 3 categories:
1. Kernel level tests - Ex: Conv MKLDNN/CuDNN kernels.
2. Operator level tests - Ex: OpPerf we have in MXNet. This tests MXNet engine and other critical paths involved in execution of an operator.
3. End to end topology/model tests - Ex: ResNet50-v1 on ImageNet
  1. Training
  2. Inference
We will unify all performance tests distributed across MXNet repo, repos maintained by contributors across AWS, NVIDIA, Intel, and others under one single umbrella of MXNet performance tests and benchmarks.
We will integrate these performance tests with MXNet CI system. We need to divide tests across PR and nightly/weekly tests.
We will have a unified dashboard with results from nightly builds to see the status of MXNet at given point by the community.

This is a topic open for discussion. Please do comment with your suggestions/feedbacks.

CC: @apeforest @ChaiBapchya @access2rohit @samskalicky @PatricZhao @TaoLv @ptrendx @marcoabreu

CI Discussion Performance

Source

sandeep-krishnamurthy

👍5

Most helpful comment

Update some benchmark and accuracy test from Intel side.

Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples.

The performance report normally compared and presented by,

day-to-day comparison, if the performance fluctuation exceeds a preset threshold (model level normally 10%, accuracy is 0 gap), it will raise an suspicious regression;
Long-term trends tracking, The recent 30-day performance are presented as a curve;
The most recent nightly performance data will be the default criteria for the internal CI test and comparison target.

Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.

| Socket | Physical Core | HT | Turbo | RAM | RAM Slot | Memory Bandwidth
-- | -- | -- | -- | -- | -- | -- | --
SKX-8180 | 2 | 28 | On | On | DDR4 2666 | 26 | 255GB/s
SKX-6148 | 2 | 20 | On | On | DDR4 2666 | 26 | 255GB/s
CLX-8280 | 2 | 28 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-8260 | 2 | 24 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-6248 | 2 | 20 | On | On | DDR4 2666 | 2*6 | 255GB/s

To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.

Other details we may discuss offline. Thanks.

juliusshufan on 9 Aug 2019

🎉1 👍1

All 22 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

mxnet-label-bot on 5 Aug 2019

We can't use the CI system for performance measurements since it does not provide a consistent environment for various reasons (efficiency, maintainability, etc). Thus, we need a separate system that has the sole purpose of being entirely consistent.

Also, I'm afraid that using tests to also measure performance could be misleading since tests might get extended or altered. I'd propose to have dedicated benchmarks instead.

marcoabreu on 5 Aug 2019

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

pengzhao-intel on 6 Aug 2019

👍1

+1

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

Thanks @PatricZhao - This requires both hardware and software setup. Let us start small with whatever is available and incrementally expand it. Looking forward to more learnings from your experience.

sandeep-krishnamurthy on 6 Aug 2019

@ptrendx - Any inputs on the performance related tests / benchmarks / CI you maintain that can be upstreamed here?

sandeep-krishnamurthy on 6 Aug 2019

We can certainly push some of our benchmarks to that common repo, although I'm not sure how to handle the differences between our container version of MXNet and upstream.

As for the performance testing insights - having a dedicated machine is important (so probably p3.16xlarge instance) as other tenants may skew the results, especially for the cases that are more CPU or IO intensive.

ptrendx on 9 Aug 2019

👍1

Update some benchmark and accuracy test from Intel side.

The performance report normally compared and presented by,

day-to-day comparison, if the performance fluctuation exceeds a preset threshold (model level normally 10%, accuracy is 0 gap), it will raise an suspicious regression;
Long-term trends tracking, The recent 30-day performance are presented as a curve;
The most recent nightly performance data will be the default criteria for the internal CI test and comparison target.

Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.

To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.

Other details we may discuss offline. Thanks.

juliusshufan on 9 Aug 2019

🎉1 👍1

@juliusshufan Thanks for providing the benchmark setup. Recently we have been running operator-level runtime comparison between int32 and int64 data types for tensor indexing using the MXNet Opperf profiler contributed by @sandeep-krishnamurthy and et al. However, we do noticed large variations if we calibrate the runtime using built-in profiler in MXNet, also mis-correlation from the runtime we measured using Python time module directly. @ChaiBapchya can provide more detailed performance results. We need a universal way to calibrate runtime in order for us to track the performance results. Any advice will be appreciated.

apeforest on 14 Aug 2019

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

ChaiBapchya on 15 Aug 2019

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

Thanks, @apeforest @ChaiBapchya we are testing large tensor operator now. Will come back with the results soon

pengzhao-intel on 15 Aug 2019

@pengzhao-intel There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here:

https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya or me should you have any question.

Thanks!

apeforest on 22 Aug 2019

Erm why are we running cpu only benchmarks on a p3.16xlarge?

Lin Yuan notifications@github.com schrieb am Mi., 21. Aug. 2019, 22:03:

@pengzhao-intel https://github.com/pengzhao-intel There was some
mistake in the earlier results due to CPU sharing. Chai has re-run
profiling and collected the updated results here:

https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and
Shape (10000, 100) corresponding to three different input shapes. The
runtime numbers are the 50 percentile out of 100 runs. There are comparison
between int64/int32 and in64mkl/int32mkl. Please feel free to ping
@ChaiBapchya https://github.com/ChaiBapchya or me should you have any
question.

Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/15757?email_source=notifications&email_token=AEOED2ZMSI2QIIF3UVFQPMLQFYM3LA5CNFSM4IJOUW22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD434LJI#issuecomment-523748773,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOED2YSBGTMYTQ6FJZJJBLQFYM3LANCNFSM4IJOUW2Q
.

marcoabreu on 22 Aug 2019

@marcoabreu You are right. We should be more frugal :) @ChaiBapchya c5.x18 might be sufficient.

apeforest on 22 Aug 2019

It's not necessarily only about frugality but also the c5.18xlarge contains
different processors than p3.16xlarge as far as I know. So the results
don't really reflect the reality - but I also don't think that they will
make a big difference. But in the future we should let apples stay apples
and pears be pears :)

Lin Yuan notifications@github.com schrieb am Mi., 21. Aug. 2019, 22:38:

@marcoabreu https://github.com/marcoabreu You are right. We should be
more frugal :) @ChaiBapchya https://github.com/ChaiBapchya

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/15757?email_source=notifications&email_token=AEOED25JU76EOGIEN7GVTQ3QFYQ3ZA5CNFSM4IJOUW22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD436B7Y#issuecomment-523755775,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOED2235UEOGPTHVMPRMA3QFYQ3ZANCNFSM4IJOUW2Q
.

marcoabreu on 22 Aug 2019

It didn't occur to me about the instance. Apologies for the same. @marcoabreu Thanks for bringing it to our notice.

Having said that wanted clarification
"So the results
don't really reflect the reality " - Why does running CPU only benchmark on p3.16xl not reflect reality? All the 4 config (int32,int32+mkl.int64,int64+mkl) were run on the same instance. Moreover, I was planning to run GPU benchmarks as well. In that sense, wouldn't it make sense to run all of this in an instance that provides CPU+GPU support too.

"apples stay apples
and pears be pears :)" - meaning CPU benchmark in c5.18xl and GPU benchmark in p3.16xl?

Thanks

ChaiBapchya on 22 Aug 2019

We have just collected the performance numbers of some operators (like fullyConnected, softmax, etc) with MKL-DNN implementation. We also compared the results between MKL-DNN v0.20 and v1.0. Now one local CLX-8280 with 28 physical cores is used to run benchmark. Later maybe we'll switch to AWS EC2 C5 instance.
Because I don't have edit access to Chai's google doc, so I just listed the results in another doc below (please check the sheet Large Tensor Test (MKL-DNN)):

https://docs.google.com/spreadsheets/d/10rhQEzDqnCjSKq27QlT04qNHegmAZjOoVqT_q287_ZU/edit?usp=sharing

wuxun-zhang on 22 Aug 2019

👍1

It doesn't reflect reality in so far as that users would not run a cpu only build on a p3.16xlarge but on a c5 instead.

Right, they were run on the same instance, but I'm not sure (Intel, please confirm) if the CPUs in a c5 might perform differently. But in general I would doubt it and say that the relative results are still relevant, just not accurate.

I don't think it would make sense to be honest. A user looks at throughput/$ (or latency or whatever metric they optimize for). Cpu instances are way cheaper, but might underperform In direct comparison. But if you normalize these results with the costs, you will get a picture that's way closer to the reality of how a real user will use MXNet. In the end, we're optimizing for real use cases, so we should make the benchmarks and environment also as close to reality as possible.

Correct, that's what I meant :)

I didn't check in detail and sorry if my proposal introduces too much of a complexity, but what do you think about considering the performance of more than one sequential execution (think of a service) but instead measure the performance a fully utilized system is capable to handle? Like high batch size with one process (throughput optimized) vs batch size one with many processes (latency optimized).

marcoabreu on 22 Aug 2019

Hi @wuxun-zhang Thanks for the running the test and sharing data. Are the performance numbers generated from your inhouse profiling tool at Intel? We also noticed using average sometimes can be misleading due to some glitch (one super large number). We used p50 number to present in the table instead.

apeforest on 22 Aug 2019

@apeforest I used the Chai's large tensor benchmark scripts with latest MXNet master. So I think the data should be average but not a p50 number. Later I will update the data by using p50 metric to ensure consistency with your data.

wuxun-zhang on 23 Aug 2019

@wuxun-zhang For p50,90 and 99 numbers, I've this PR https://github.com/apache/incubator-mxnet/pull/15953

Once that's merged you will be able to get those numbers using python's time module.

With profiler flag, you can choose between python or Native.

ChaiBapchya on 23 Aug 2019

Hi @ChaiBapchya , is there has some updates for this large tensor benchmark script? I tried to run this script with this commit and will get such an error below. Look that this error is caused by incomplete input arguments (missing num_hidden for FC). BTW, this script works well for other operators but FC from my side. Thanks for your help in advance.

(mxnet_p36) ubuntu@ip-172-31-18-141:~/github/incubator-mxnet/benchmark/opperf$ python opperf_large_tensor.py --ctx=cpu -p python
Large tensor support : OFF
INFO:root:Running Large tensor benchmarks with the following options: Namespace(ctx='cpu', dtype='float32', mkldnn_option='mkldnn', output_file='./mxnet_operator_benchmarks.json', output_format='json', profiler='python')
[{'data': (1024, 1024), 'weight': (1024, 1024)}, {'data': (10000, 1), 'weight': (10000, 1)}, {'data': (10000, 100), 'weight': (10000, 100)}]
Traceback (most recent call last):
  File "opperf_large_tensor.py", line 114, in <module>
    sys.exit(main())
  File "opperf_large_tensor.py", line 103, in main
    final_benchmark_results = run_large_test_benchmarks(args.profiler, ctx=ctx, dtype=dtype)
  File "opperf_large_tensor.py", line 46, in run_large_test_benchmarks
    mx_large_tensor_results = run_op_benchmarks(mx_large_tensor_ops, dtype, ctx, profiler, warmup=10, runs=100)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 157, in run_op_benchmarks
    warmup=warmup, runs=runs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 137, in run_performance_test
    benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, args_list, kwargs_list, profiler)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 69, in _run_nd_operator_performance_test
    _, _ = benchmark_helper_func(op, warmup, [], **kwargs_list[0])
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 241, in python_profile_it
    res = func(*modified_args, **kwargs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 48, in nd_forward_backward_and_profile
    res = op(**kwargs)
  File "<string>", line 86, in FullyConnected
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 100, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Required parameter num_hidden of int is not presented, in operator FullyConnected(name="")

wuxun-zhang on 28 Aug 2019

Yes. (This error is probably caused because incorrect file is being used. It was previously used for testing on my branch. But now with latest master, opperf.py file is good to use.)

Few pointers -

Don't use separate file for testing large tensor opperf_large_tensor.py. Functionality has been merged into the original opperf.py file.
All the operators that have been benchmarked so far in the opperf utility (in the master branch) can be profiled with native/python.
Inclusion of python time module via flag
Adding more operators to improve coverage

For current master branch,
All you've to do now for the opperf utility is run
python opperf.py with your desired flags --ctx=cpu -p python
It will run all the ops supported without error.

Let me know if that helps.

ChaiBapchya on 28 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings