Problem Statement
Proposal
This is a topic open for discussion. Please do comment with your suggestions/feedbacks.
CC: @apeforest @ChaiBapchya @access2rohit @samskalicky @PatricZhao @TaoLv @ptrendx @marcoabreu
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test
We can't use the CI system for performance measurements since it does not provide a consistent environment for various reasons (efficiency, maintainability, etc). Thus, we need a separate system that has the sole purpose of being entirely consistent.
Also, I'm afraid that using tests to also measure performance could be misleading since tests might get extended or altered. I'd propose to have dedicated benchmarks instead.
+1
It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.
Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.
+1
It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.
Thanks @PatricZhao - This requires both hardware and software setup. Let us start small with whatever is available and incrementally expand it. Looking forward to more learnings from your experience.
@ptrendx - Any inputs on the performance related tests / benchmarks / CI you maintain that can be upstreamed here?
We can certainly push some of our benchmarks to that common repo, although I'm not sure how to handle the differences between our container version of MXNet and upstream.
As for the performance testing insights - having a dedicated machine is important (so probably p3.16xlarge instance) as other tenants may skew the results, especially for the cases that are more CPU or IO intensive.
Update some benchmark and accuracy test from Intel side.
Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples.
The performance report normally compared and presented by,
Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.
 | Socket | Physical Core | HT | Turbo | RAM | RAM Slot | Memory Bandwidth
-- | -- | -- | -- | -- | -- | -- | --
SKX-8180 | 2 | 28 | On | On | DDR4 2666 | 26 | 255GB/s
SKX-6148 | 2 | 20 | On | On | DDR4 2666 | 26 | 255GB/s
CLX-8280 | 2 | 28 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-8260 | 2 | 24 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-6248 | 2 | 20 | On | On | DDR4 2666 | 2*6 | 255GB/s
To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.
Other details we may discuss offline. Thanks.
@juliusshufan Thanks for providing the benchmark setup. Recently we have been running operator-level runtime comparison between int32 and int64 data types for tensor indexing using the MXNet Opperf profiler contributed by @sandeep-krishnamurthy and et al. However, we do noticed large variations if we calibrate the runtime using built-in profiler in MXNet, also mis-correlation from the runtime we measured using Python time module directly. @ChaiBapchya can provide more detailed performance results. We need a universal way to calibrate runtime in order for us to track the performance results. Any advice will be appreciated.
Here are the links for Large Tensor Operator benchmarks I ran.
Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing
MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing
Tested on - p3.16xl instance
Here are the links for Large Tensor Operator benchmarks I ran.
Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharingMXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing
Tested on - p3.16xl instance
Thanks, @apeforest @ChaiBapchya we are testing large tensor operator now. Will come back with the results soon
@pengzhao-intel There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here:
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing
Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya or me should you have any question.
Thanks!
Erm why are we running cpu only benchmarks on a p3.16xlarge?
Lin Yuan notifications@github.com schrieb am Mi., 21. Aug. 2019, 22:03:
@pengzhao-intel https://github.com/pengzhao-intel There was some
mistake in the earlier results due to CPU sharing. Chai has re-run
profiling and collected the updated results here:https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing
Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and
Shape (10000, 100) corresponding to three different input shapes. The
runtime numbers are the 50 percentile out of 100 runs. There are comparison
between int64/int32 and in64mkl/int32mkl. Please feel free to ping
@ChaiBapchya https://github.com/ChaiBapchya or me should you have any
question.Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/15757?email_source=notifications&email_token=AEOED2ZMSI2QIIF3UVFQPMLQFYM3LA5CNFSM4IJOUW22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD434LJI#issuecomment-523748773,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOED2YSBGTMYTQ6FJZJJBLQFYM3LANCNFSM4IJOUW2Q
.
@marcoabreu You are right. We should be more frugal :) @ChaiBapchya c5.x18 might be sufficient.
It's not necessarily only about frugality but also the c5.18xlarge contains
different processors than p3.16xlarge as far as I know. So the results
don't really reflect the reality - but I also don't think that they will
make a big difference. But in the future we should let apples stay apples
and pears be pears :)
Lin Yuan notifications@github.com schrieb am Mi., 21. Aug. 2019, 22:38:
@marcoabreu https://github.com/marcoabreu You are right. We should be
more frugal :) @ChaiBapchya https://github.com/ChaiBapchya—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/15757?email_source=notifications&email_token=AEOED25JU76EOGIEN7GVTQ3QFYQ3ZA5CNFSM4IJOUW22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD436B7Y#issuecomment-523755775,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOED2235UEOGPTHVMPRMA3QFYQ3ZANCNFSM4IJOUW2Q
.
It didn't occur to me about the instance. Apologies for the same. @marcoabreu Thanks for bringing it to our notice.
Having said that wanted clarification
"So the results
don't really reflect the reality " - Why does running CPU only benchmark on p3.16xl not reflect reality? All the 4 config (int32,int32+mkl.int64,int64+mkl) were run on the same instance. Moreover, I was planning to run GPU benchmarks as well. In that sense, wouldn't it make sense to run all of this in an instance that provides CPU+GPU support too.
"apples stay apples
and pears be pears :)" - meaning CPU benchmark in c5.18xl and GPU benchmark in p3.16xl?
Thanks
We have just collected the performance numbers of some operators (like fullyConnected, softmax, etc) with MKL-DNN implementation. We also compared the results between MKL-DNN v0.20 and v1.0. Now one local CLX-8280 with 28 physical cores is used to run benchmark. Later maybe we'll switch to AWS EC2 C5 instance.
Because I don't have edit access to Chai's google doc, so I just listed the results in another doc below (please check the sheet Large Tensor Test (MKL-DNN)):
https://docs.google.com/spreadsheets/d/10rhQEzDqnCjSKq27QlT04qNHegmAZjOoVqT_q287_ZU/edit?usp=sharing
It doesn't reflect reality in so far as that users would not run a cpu only build on a p3.16xlarge but on a c5 instead.
Right, they were run on the same instance, but I'm not sure (Intel, please confirm) if the CPUs in a c5 might perform differently. But in general I would doubt it and say that the relative results are still relevant, just not accurate.
I don't think it would make sense to be honest. A user looks at throughput/$ (or latency or whatever metric they optimize for). Cpu instances are way cheaper, but might underperform In direct comparison. But if you normalize these results with the costs, you will get a picture that's way closer to the reality of how a real user will use MXNet. In the end, we're optimizing for real use cases, so we should make the benchmarks and environment also as close to reality as possible.
Correct, that's what I meant :)
I didn't check in detail and sorry if my proposal introduces too much of a complexity, but what do you think about considering the performance of more than one sequential execution (think of a service) but instead measure the performance a fully utilized system is capable to handle? Like high batch size with one process (throughput optimized) vs batch size one with many processes (latency optimized).
Hi @wuxun-zhang Thanks for the running the test and sharing data. Are the performance numbers generated from your inhouse profiling tool at Intel? We also noticed using average sometimes can be misleading due to some glitch (one super large number). We used p50 number to present in the table instead.
@apeforest I used the Chai's large tensor benchmark scripts with latest MXNet master. So I think the data should be average but not a p50 number. Later I will update the data by using p50 metric to ensure consistency with your data.
@wuxun-zhang For p50,90 and 99 numbers, I've this PR https://github.com/apache/incubator-mxnet/pull/15953
Once that's merged you will be able to get those numbers using python's time module.
With profiler flag, you can choose between python or Native.
Hi @ChaiBapchya , is there has some updates for this large tensor benchmark script? I tried to run this script with this commit and will get such an error below. Look that this error is caused by incomplete input arguments (missing num_hidden for FC). BTW, this script works well for other operators but FC from my side. Thanks for your help in advance.
(mxnet_p36) ubuntu@ip-172-31-18-141:~/github/incubator-mxnet/benchmark/opperf$ python opperf_large_tensor.py --ctx=cpu -p python
Large tensor support : OFF
INFO:root:Running Large tensor benchmarks with the following options: Namespace(ctx='cpu', dtype='float32', mkldnn_option='mkldnn', output_file='./mxnet_operator_benchmarks.json', output_format='json', profiler='python')
[{'data': (1024, 1024), 'weight': (1024, 1024)}, {'data': (10000, 1), 'weight': (10000, 1)}, {'data': (10000, 100), 'weight': (10000, 100)}]
Traceback (most recent call last):
File "opperf_large_tensor.py", line 114, in <module>
sys.exit(main())
File "opperf_large_tensor.py", line 103, in main
final_benchmark_results = run_large_test_benchmarks(args.profiler, ctx=ctx, dtype=dtype)
File "opperf_large_tensor.py", line 46, in run_large_test_benchmarks
mx_large_tensor_results = run_op_benchmarks(mx_large_tensor_ops, dtype, ctx, profiler, warmup=10, runs=100)
File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 157, in run_op_benchmarks
warmup=warmup, runs=runs)
File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 137, in run_performance_test
benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, args_list, kwargs_list, profiler)
File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 69, in _run_nd_operator_performance_test
_, _ = benchmark_helper_func(op, warmup, [], **kwargs_list[0])
File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 241, in python_profile_it
res = func(*modified_args, **kwargs)
File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 48, in nd_forward_backward_and_profile
res = op(**kwargs)
File "<string>", line 86, in FullyConnected
File "/home/ubuntu/github/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 100, in _imperative_invoke
ctypes.byref(out_stypes)))
File "/home/ubuntu/github/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Required parameter num_hidden of int is not presented, in operator FullyConnected(name="")
Yes. (This error is probably caused because incorrect file is being used. It was previously used for testing on my branch. But now with latest master, opperf.py file is good to use.)
Few pointers -
opperf_large_tensor.py. Functionality has been merged into the original opperf.py file.For current master branch,
All you've to do now for the opperf utility is run
python opperf.py with your desired flags --ctx=cpu -p python
It will run all the ops supported without error.
Let me know if that helps.
Most helpful comment
Update some benchmark and accuracy test from Intel side.
Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples.
The performance report normally compared and presented by,
Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.
 | Socket | Physical Core | HT | Turbo | RAM | RAM Slot | Memory Bandwidth
-- | -- | -- | -- | -- | -- | -- | --
SKX-8180 | 2 | 28 | On | On | DDR4 2666 | 26 | 255GB/s
SKX-6148 | 2 | 20 | On | On | DDR4 2666 | 26 | 255GB/s
CLX-8280 | 2 | 28 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-8260 | 2 | 24 | On | On | DDR4 2933 | 26 | 281GB/s
CLX-6248 | 2 | 20 | On | On | DDR4 2666 | 2*6 | 255GB/s
To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.
Other details we may discuss offline. Thanks.