Taichi: [test] Continuous Performance Regression Tests

Created on 11 May 2020 · 14Comments · Source: taichi-dev/taichi

Concisely describe the proposed feature

I think it will be great if we can have a CI pipeline to run some benchmarks as regression tests. This way we can easily detect problems like https://github.com/taichi-dev/taichi/pull/937#issuecomment-626282263.

enhancement feature request

Source

k-ye

👍1

Most helpful comment

Great idea. Thanks for proposing this.

I think it's better to be funded, rather than paying out of our own pockets, even if we are very enthusiastic on this...

We can find a computer in our lab for benchmark purposes. We need a machine with consistent hardware otherwise the performance comparisons won't make much sense. I guess Travis will just randomly pick an available VM slot, whose hardware capability fluctuates. Our group also has some free Google cloud accounts. I'll think about this.

Anyway, before these fancy stuffs, we must set up ti benchmark first. @xumingkuan do you have any idea on how to implement this? Many thanks :)

We actually already have some basic benchmarks: https://github.com/taichi-dev/taichi/tree/master/benchmarks. https://github.com/taichi-dev/taichi/blob/master/benchmarks/run.py can trigger these benchmarks:

fill_dense:
 * flat_range                        x64        8.852 ms       cuda       0.402 ms
 * flat_struct                       x64        5.688 ms       cuda       0.398 ms
 * nested_range                      x64        8.549 ms       cuda       0.826 ms
 * nested_range_blocked              x64        4.323 ms       cuda       6.724 ms
 * nested_struct                     x64        5.694 ms       cuda       0.324 ms
 * nested_struct_listgen_16x16       x64        5.693 ms       cuda       0.317 ms
 * nested_struct_listgen_8x8         x64        5.763 ms       cuda       0.316 ms
 * root_listgen                      x64        5.685 ms       cuda       0.402 ms
fill_sparse:
 * nested_struct                     x64       11.053 ms       cuda       0.674 ms
 * nested_struct_fill_and_clear      x64       43.212 ms       cuda      22.951 ms
memory_bound:
 * memcpy                            x64       78.917 ms       cuda       8.072 ms
 * memset                            x64       92.055 ms       cuda       5.042 ms
 * saxpy                             x64       97.547 ms       cuda      11.460 ms
 * sscal                             x64       98.836 ms       cuda       7.809 ms
minimal:
 * fill_scalar                       x64        0.002 ms       cuda       0.007 ms
mpm2d:
 * range                             x64        0.793 ms       cuda       0.027 ms
 * struct                            x64        0.773 ms       cuda       0.028 ms

These can be reused. For example, https://github.com/taichi-dev/taichi/blob/master/benchmarks/mpm2d.py should be able to detect the performance issue in introduced in #937. How to automatically summarize the benchmark results and display on GitHub is worth discussions.

I haven't got a chance to systematically work on performance issues though.

yuanming-hu on 11 May 2020

👍2

All 14 comments

https://www.cnblogs.com/younggun/articles/1814989.html
I thought that's exactly what we do in tests/python?

archibate on 11 May 2020

By "regression" I mean to detect performance regression (e.g. a new change caused the performance, as measured by our benchmark tests, to drop by 50%).

In contrast, what we have currently in the CI are just unit tests. They are used to verify if the system is not fundamentally broken.

k-ye on 11 May 2020

👍2

Thank for clarify this, so we want to verify the functionability not broken, also want to verify the performance not broken? Not sure how Travis CI could do this, currently we can only do this by git switch back-and-fore, then run benchmarks by hand.

archibate on 11 May 2020

also want to verify the performance not broken?

Yep

Not sure now Travis CI could do this,

I'm no expert on this either. But such kind of regression tests are actually quite common, so I guess Travis must have a way to run some command, then produce a few timing numbers.

I suggest we don't worry too much about this issue. We may prioritize this when Taichi is more mature. For now I'm simply creating an issue so that we don't forget :)

k-ye on 11 May 2020

I'm no expert on this either. But such kind of regression tests are actually quite common, so I guess Travis must have a way to run some command, then produce a few timing numbers.

I searched the web and found no info about relation between Travis and CPRT...

A stright-forward attempt can be:
Add a file called last_benchmark.txt, contains numbers that generated for each commit.
And let the CI or human-eye to check if the value last_benchmark.txt is increased or decreased, and report that number aloud.

archibate on 11 May 2020

I searched the web and found no info about relation between Travis and CPRT...

Ah, the naming could be performance tests, benchmark (BM) tests... I think the terms are pretty confusing here.

Yeah, I think having a file to store the historical BM data is a good way to get things on going (Usually this would be stored in some database for ease of query, but obviously we'd then have to pay for that...) I think this can be even simpler -- configure the bot so that it posts the BM data on each PR. For example: https://github.com/pingcap/tidb/pull/17101#issuecomment-626607152

k-ye on 11 May 2020

pingcap/tidb#17101 (comment)

Cool! But I guess we will pay money for that. not sure if @yuanming-hu like this...
It comes to me that we can upgrade our format server to [Click to update benchmark]:
https://github.com/taichi-dev/taichi/blob/471392bcda9ad204337591559355920cfc7736a4/.github/pull_request_template.md#L7
When clicked, it runs ti benchmark and update misc/benchmark.txt in a commit [skip ci] update benchmark just like the [skip ci] enforce code format.
Or we can trigger this when user pushes [benchmark] do benchmark for me like the [format] currently does.
Then the reviewers can check the Files changed page to see if the performance increased or decreased.

archibate on 11 May 2020

But I guess we will pay money for that. not sure if @yuanming-hu like this...

I think it's better to be funded, rather than paying out of our own pockets, even if we are very enthusiastic on this...

When clicked, it runs ti benchmark and update misc/benchmark.txt in a commit [skip ci] update benchmark just like the [skip ci] enforce code format.

Yeah, i think this can be a good start.. The good thing about having a report on the PR is that people can actively look into it, though. But again, these are all fancy stuffs, which we don't need urgently

k-ye on 11 May 2020

Anyway, before these fancy stuffs, we must set up ti benchmark first. @xumingkuan do you have any idea on how to implement this? Many thanks :)

archibate on 11 May 2020

Great idea. Thanks for proposing this.

I think it's better to be funded, rather than paying out of our own pockets, even if we are very enthusiastic on this...

Anyway, before these fancy stuffs, we must set up ti benchmark first. @xumingkuan do you have any idea on how to implement this? Many thanks :)

fill_dense:
 * flat_range                        x64        8.852 ms       cuda       0.402 ms
 * flat_struct                       x64        5.688 ms       cuda       0.398 ms
 * nested_range                      x64        8.549 ms       cuda       0.826 ms
 * nested_range_blocked              x64        4.323 ms       cuda       6.724 ms
 * nested_struct                     x64        5.694 ms       cuda       0.324 ms
 * nested_struct_listgen_16x16       x64        5.693 ms       cuda       0.317 ms
 * nested_struct_listgen_8x8         x64        5.763 ms       cuda       0.316 ms
 * root_listgen                      x64        5.685 ms       cuda       0.402 ms
fill_sparse:
 * nested_struct                     x64       11.053 ms       cuda       0.674 ms
 * nested_struct_fill_and_clear      x64       43.212 ms       cuda      22.951 ms
memory_bound:
 * memcpy                            x64       78.917 ms       cuda       8.072 ms
 * memset                            x64       92.055 ms       cuda       5.042 ms
 * saxpy                             x64       97.547 ms       cuda      11.460 ms
 * sscal                             x64       98.836 ms       cuda       7.809 ms
minimal:
 * fill_scalar                       x64        0.002 ms       cuda       0.007 ms
mpm2d:
 * range                             x64        0.793 ms       cuda       0.027 ms
 * struct                            x64        0.773 ms       cuda       0.028 ms

I haven't got a chance to systematically work on performance issues though.

yuanming-hu on 11 May 2020

👍2

Anyway, before these fancy stuffs, we must set up ti benchmark first. @xumingkuan do you have any idea on how to implement this? Many thanks :)

Currently, I'm just setting TI_PRINT_BENCHMARK_STAT=1, which generates a log file when running each unit test, for my benchmark charts.

it runs ti benchmark and update misc/benchmark.txt

If you want something like this, we can just let this command set print_benchmark_stat = true, run tests, and then read all log files and collect the data (the number of statements).

xumingkuan on 11 May 2020

👍1

If you want something like this, we can just let this command set print_benchmark_stat = true, run tests, and then read all log files and collect the data (the number of statements).

So currently print_benchmark_stat only shows the number of statements? If so, it's not enough, for example:

$1 = const [8]
$2 = pow $0, $1

versus:

$1 = mul $0, $0
$2 = mul $1, $1
$3 = mul $2, $2

Although the second have more statements, but it's actually more efficient than the first.

Also consider vector division:

v.x /= k;
v.y /= k;
v.z /= k;

versus:

tmp = 1 / k;
v.x *= tmp;
v.y *= tmp;
v.z *= tmp;

And not to mention loop unroll.

So what we want is Time Performance, instead of Size Performance. I think it's good to add SP, but TP is more important for Regression Test, since sometimes we want to sacrifice SP for TP like #944.

archibate on 12 May 2020

I think it's good to add SP, but TP is more important for Regression Test, since sometimes we want to sacrifice SP for TP like #944.

Yes, but we may need to solve this issue first before adding the regression test of time performance:

We need a machine with consistent hardware otherwise the performance comparisons won't make much sense. I guess Travis will just randomly pick an available VM slot, whose hardware capability fluctuates.

xumingkuan on 12 May 2020