Rust: Performance regressions of compiled code over the last year

Created on 19 Jan 2018 · 15Comments · Source: rust-lang/rust

Started on an update to https://github.com/Marwes/combine after it being dormant for a while. When I ran the benchmarks to check that my changes hadn't regressed the performance I noticed that performance had regressed by ~28% (~116% with incremental compilation!) since the last time ran benchmarks (somewhere around September 2016).

I ran the benchmarks again against an old version of the library to be able to compile it with older rustc's but the regression is the same in the master branch as well.

`cargo bench --bench http` against https://github.com/Marwes/combine/tree/v2.3.2

cargo-0.18.0-nightly (a73a665 2017-02-14)

test http_requests_large ... bench:     439,961 ns/iter (+/- 30,684)
test http_requests_small ... bench:      87,508 ns/iter (+/- 5,173)

rustc 1.19.0-nightly (554c685b0 2017-06-14)

test http_requests_large ... bench:     475,989 ns/iter (+/- 10,477)
test http_requests_small ... bench:      95,175 ns/iter (+/- 23,751)

rustc 1.22.0-nightly (368122087 2017-09-06)

test http_requests_large ... bench:     494,088 ns/iter (+/- 27,462)
test http_requests_small ... bench:     102,798 ns/iter (+/- 67,446)

rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=0)

test http_requests_large ... bench:     551,065 ns/iter (+/- 420,621)
test http_requests_small ... bench:     112,375 ns/iter (+/- 2,098)

rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=1)

test http_requests_large ... bench:   1,001,847 ns/iter (+/- 40,639)
test http_requests_small ... bench:     188,091 ns/iter (+/- 1,958)

I'd like to bisect this further but the two tools I found for this do not appear to work in this case, is there any other tool that can be used for this?

https://github.com/kamalmarhubi/rust-bisect (Outdated)
https://github.com/Mark-Simulacrum/bisect-rust/tree/master/src (Only last 90 days)

A-libtest C-enhancement I-slow P-medium T-compiler regression-from-stable-to-stable

Source

Marwes

Most helpful comment

I've checked this from 1.13.0 to the current nightly and it is obvious that the regression is introduced in 1.19.0 and 1.20.0. The speed is recovering in more recent version (with CGU=1 and ThinLTO=off) but still haven't come back to the 1.19.0 level.

Graphs

Large result

Small result

Raw data

Version | Large | LRange | Small | SRange
-- | -- | -- | -- | --
25-δ | 298793 | 26302 | 61109 | 8759
24-β.6 | 320784 | 33707 | 63668 | 6996
23 | 312017 | 38840 | 63081 | 6354
22 | 330144 | 30876 | 65170 | 6320
21 | 328676 | 41291 | 65799 | 8213
20 | 336373 | 37812 | 65845 | 5612
19 | 287879 | 24456 | 56188 | 9432
18 | 242170 | 25084 | 49207 | 5695
17 | 250096 | 22134 | 50041 | 5387
16 | #N/A | #N/A | #N/A | #N/A
15 | 266197 | 29355 | 52775 | 6734
14 | 264196 | 24070 | 52792 | 5694
13 | 263274 | 26703 | 51758 | 5650

kennytm on 22 Jan 2018

❤6

All 15 comments

cc @michaelwoerister @arielb1 @nagisa

eddyb on 19 Jan 2018

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
What happens if you do RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0 cargo bench?

cc @alexcrichton

michaelwoerister on 19 Jan 2018

(These are on a different machine so numbers won't be comparable to the ones above)

env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0

rustc 1.17.0-nightly (62eb6056d 2017-02-15)

test http_requests_large ... bench:     247,219 ns/iter (+/- 68,700)
test http_requests_small ... bench:      51,941 ns/iter (+/- 9,127)

rustc 1.25.0-nightly (0f9c78475 2018-01-17)

test http_requests_large ... bench:     325,950 ns/iter (+/- 48,330)
test http_requests_small ... bench:      62,812 ns/iter (+/- 12,001)

Without specifying Ccodegen-units=1

rustc 1.25.0-nightly (0f9c78475 2018-01-17)

test http_requests_large ... bench:     420,282 ns/iter (+/- 75,396)
test http_requests_small ... bench:      85,946 ns/iter (+/- 13,067)

rustc 1.25.0-nightly (0f9c78475 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

Marwes on 19 Jan 2018

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.

Yeah I figured ThinLTO might be one of the culprits but I don't believe there is a way to use full LTO atm (https://github.com/rust-lang/rust/pull/47521). Depending on how much ThinLTO affects the runtimes though that might still be a problem however.

Marwes on 19 Jan 2018

247 vs 325 microseconds is indeed quite the slowdown. I would be interesting to see where all the additional time goes. Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

Yes, that is out of date since a few weeks ago, I think. cc @alexcrichton & @rust-lang/cargo

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

At some point we'll probably have incremental ThinLTO, at which point runtime performance should be quite a bit better. However, I'm not sure if that will ever be the default since it will cause longer compile times than regular incremental compilation. Incremental compilation strives to give you code that is "fast enough" and otherwise clearly prioritizes short compile times.

michaelwoerister on 22 Jan 2018

Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Tried gleaning something from profiler output I didn't spot anything as the runtime is spread out in a lot of different places and inlining tears through the code completely. I may give it another shot but I don't expect much.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

It is unlikely to do with the benchmark framework. I am comparing these numbers to https://github.com/Geal/nom_benchmarks/blob/master/http/nom-http/src/main.rs which also uses it and that benchmark hasn't regressed. (I actually use https://github.com/bluss/bencher/ but switching to the builtin one shows no change in runtime).

Added another result to https://github.com/rust-lang/rust/issues/47561#issuecomment-358991665 . Forcing full lto helps a bit but is not the sole problem.

rustc 1.25.0-nightly (0f9c78475 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

Marwes on 22 Jan 2018

Graphs

Large result

Small result

Raw data

kennytm on 22 Jan 2018

❤6

@kennytm

Cool. Could you try to bisect to the specific commit? I might then take a go at investigating it.

arielb1 on 22 Jan 2018

@arielb1 Probably not to a specific commit, but I'm now doing a date-based bisection which should point to a narrow-enough commit range.

kennytm on 22 Jan 2018

❤1

From the timing of nightlies between 2017-04-24 to 2017-07-17:

Regression at nightly-2017-04-26 (+43µs): 63c77214c1d38789652b465694b254205d1886e0...2b4c911581099e247a68b3a8adc782d778f5190b (likely #40123 (LLVM 4.0 Upgrade))
Regression at nightly-2017-05-14 (+26µs): e17a1227ac779a181b2839998e26a7e4e434c2a0...826d8f3850b37a23481dfcf4a899b5dfc82d22e3 (likely #41920 (remove the #[inline] attribute from drop_in_place))
Regression at nightly-2017-06-21 (+36µs): 04145943a25c3b8c7e7d7fe8c2efb04f259c25fb...445077963c55297ef1e196a3525723090fe80b22 (???)
Regression at nightly-2017-06-23 (+22µs): 622e7e6487b6fb7fdbb901720cd4214f9179ed67...ab5bec25530aac43dfd64384b405c909b6e405e3 (likely #42771 (mark calls in the unwind path as !noinline))
Improvement at nightly-2017-07-07 (-48µs): 3610a70ce488953c5b0379fece70f2baad30a825...696412de7e4e119f8536686c643621115b90c775 (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))

Graphs

Large tests
Small tests

Raw data

Nightly | Large | LRange | Small | SRange | Commit
-- | -- | -- | -- | -- | --
2017-04-24 | 237376 | 29677 | 47239 | 6068 |
2017-04-25 | 241729 | 22569 | 48152 | 4449 | 63c77214c1d38789652b465694b254205d1886e0
2017-04-26 | 283963 | 35674 | 57046 | 6098 | 2b4c911581099e247a68b3a8adc782d778f5190b
2017-04-27 | 283284 | 31492 | 56515 | 7331 |
2017-04-29 | 280613 | 38855 | 57309 | 5711 |
2017-05-05 | 288286 | 22754 | 57488 | 5544 |
2017-05-06 | 280634 | 44019 | 56483 | 6311 | f4209651ec4d4455dab4fc3f3a3456a897d9da7f
2017-05-09 | 271223 | 36183 | 53666 | 7654 | f1140a33176a5fb2e91e26ea3ae42a834dd9bfdf
2017-05-10 | 271546 | 29414 | 54456 | 4247 |
2017-05-13 | 278337 | 29537 | 55530 | 6867 | e17a1227ac779a181b2839998e26a7e4e434c2a0
2017-05-14 | 304265 | 44141 | 60732 | 8017 | 826d8f3850b37a23481dfcf4a899b5dfc82d22e3
2017-05-16 | 300772 | 34152 | 60028 | 7134 |
2017-06-07 | 302343 | 34978 | 61109 | 7028 |
2017-06-17 | 295258 | 29874 | 58697 | 5658 |
2017-06-20 | 281801 | 24845 | 58202 | 5233 | 04145943a25c3b8c7e7d7fe8c2efb04f259c25fb
2017-06-21 | 325467 | 42166 | 63920 | 7960 | 445077963c55297ef1e196a3525723090fe80b22
2017-06-22 | 323663 | 35073 | 64315 | 6130 | 622e7e6487b6fb7fdbb901720cd4214f9179ed67
2017-06-23 | 343497 | 43136 | 69235 | 6408 | ab5bec25530aac43dfd64384b405c909b6e405e3
2017-06-24 | 342053 | 36216 | 67890 | 8065 | 229d0d326
2017-06-25 | 352200 | 41641 | 71437 | 8251 | c9bb93576
2017-06-27 | 352002 | 44692 | 69970 | 6756 |
2017-06-29 | 344796 | 30291 | 69683 | 7940 | 69c65d29615c391c958ebf75dd65258ec23e175c
2017-06-30 | 365836 | 43236 | 72866 | 7919 | 3bfc18a9619a5151ff4f11618db9cd882996ba6f
2017-07-01 | 360115 | 42362 | 73310 | 7862 |
2017-07-02 | 357479 | 35203 | 72631 | 8314 |
2017-07-03 | 368134 | 47205 | 73680 | 7208 |
2017-07-04 | 377648 | 45819 | 75442 | 8086 |
2017-07-05 | 383190 | 46242 | 75982 | 8814 |
2017-07-06 | 382141 | 45204 | 74689 | 8027 | 3610a70ce488953c5b0379fece70f2baad30a825
2017-07-07 | 329873 | 35233 | 65957 | 6922 | 696412de7e4e119f8536686c643621115b90c775
2017-07-17 | 329498 | 39955 | 66395 | 5884 |

kennytm on 23 Jan 2018

❤4

Regression at nightly-2017-04-26 (+43µs): 63c7721...2b4c911 (likely #40123 (LLVM 4.0 Upgrade))

Not much we can do about that without a fairly deep investigation. We're about to update to LLVM 5.0 or 6.0, so let's hope that fixes the regression

Regression at nightly-2017-06-21 (+36µs): 0414594...4450779 (???)
Improvement at nightly-2017-07-07 (-48µs): 3610a70...696412d (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))

This looks like #42313 and #42727 - the first try at allocator integration had some perf issues - missing inline attribute, which were fixed by #42727.

Regression at nightly-2017-05-14 (+26µs): e17a122...826d8f3 (likely #41920 (remove the #[inline] attribute from drop_in_place))
Regression at nightly-2017-06-23 (+22µs): 622e7e6...ab5bec2 (likely #42771 (mark calls in the unwind path as !noinline))

This sort of makes sense, but is weird. I might look into it

arielb1 on 23 Jan 2018

👍1

@arielb1 A regression from #41920 seems consistent with my observations. I have another variant of the large benchmark above where the error type does not need drop (the normal error contains a Vec). The version that contains that benchmarks requires a recent version of rust so I am not able to compare the results but it does appear to have regressed less than one that needs drop.

Marwes on 7 Feb 2018

triage: P-medium

nikomatsakis on 22 Feb 2018

Triage; It's been a while. @Marwes lots has changed since you've filed this issue; do you still track this stuff on combine?

steveklabnik on 22 Sep 2019

Reran on the old branch so I could use 1.19

 name                 1.19 ns/iter  1.39 ns/iter  diff ns/iter  diff %  speedup
 http_requests_large  525,200       629,625            104,425  19.88%   x 0.83
 http_requests_small  104,955       123,803             18,848  17.96%   x 0.85

Regression still seems to exist.

Marwes on 24 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cannot compile compiler docs

mcarton · 3Comments

Cannot impl Drop for a generic type with a Higher-Rank Trait Bound

dwrensha · 3Comments

core::marker::Send and raw pointers

drewcrawford · 3Comments

Internal compiler error: cannot relate bound region (likely caused by `conservative_impl_trait`)

jmegaffin · 3Comments

Inherited trait bounds are not completely checked

behnam · 3Comments

Rust: Performance regressions of compiled code over the last year

cargo bench --bench http against https://github.com/Marwes/combine/tree/v2.3.2

cargo-0.18.0-nightly (a73a665 2017-02-14)

rustc 1.19.0-nightly (554c685b0 2017-06-14)

rustc 1.22.0-nightly (368122087 2017-09-06)

rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=0)

rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=1)

Most helpful comment

All 15 comments

env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0

rustc 1.17.0-nightly (62eb6056d 2017-02-15)

rustc 1.25.0-nightly (0f9c78475 2018-01-17)

rustc 1.25.0-nightly (0f9c78475 2018-01-17)

rustc 1.25.0-nightly (0f9c78475 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

rustc 1.25.0-nightly (0f9c78475 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

Related issues

`cargo bench --bench http` against https://github.com/Marwes/combine/tree/v2.3.2