If you look at the trans-stats for some of our modules, some "big functions" really stand out:
rustc:
16838 insns, 67 ms, middle::trans::foreign::trans_intrinsic
15988 insns, 145 ms, middle::typeck::check::check_expr_with_unifier
11759 insns, 87 ms, middle::privacy::check_crate
libextra:
5244 insns, 31 ms, terminfo::parm::expand
5073 insns, 21 ms, time::do_strptime::parse_type
4068 insns, 23 ms, time::do_strftime::parse_type
3610 insns, 73 ms, terminfo::parser::compiled::parse
2169 insns, 89 ms, net_tcp::listen_common
1952 insns, 28 ms, getopts::getopts
1937 insns, 193 ms, net_tcp::connect
If you look at the code generating them, they don't immediately look like they should be taking 10x as long as other functions; but they do all seem to use match somewhat heavily. This suggests to me that we're still compiling matches in some way that causes a combinatorial explosion of basic blocks, or something.
Investigate!
/cc @toddaaro
/cc @msullivan
I don't think it was actually toddaaro I was talking to about this...
I have whined a lot about match implementations that suffer from combinatorial explosion, so maybe that is why you thought of me.
This Scala issue is probably unrelated, but maybe reading about the experience will prove useful to a Rust developer someday.
Triage: did anyone determine if match was still the issue here? I'd imagine codegen has gotten better since 2013...
@steveklabnik I don't know of any changes to match codegen that would have alleviated this, re-running the measurements...
Yes, match is still generating more code than anything else, by a significant margin.
@cmr if you have a moment, can you explain how you checked this out? Future triage-rs will be thankful :smile:
make RUSTFLAGS="-Z count-llvm-insns" | tee output, followed by sort -n < output > output2, then look at the bottom of output2 (ctrl-f for "trans_match")
trans-ing a match appears to be quadratic in the number of arms (at least for integers and strings):
$ for N in 500 1000 2000 4000; do
echo $N
python -c "print('fn main() { match 0 { %s _ => {} } }' % ''.join(' %s => {}' % n for n in range(0, $N)))" |
rustc - -Z time-passes |
grep translation
done
500
time: 0.030; rss: 65MB translation
1000
time: 0.107; rss: 67MB translation
2000
time: 0.427; rss: 70MB translation
4000
time: 1.620; rss: 75MB translation
(The python invocation prints code like fn main() { match 0 { 0 => {} 1 => {} 2 => {} 3 => {} 4 => {} _ => {} } } with $N + 1 arms.)
Results of @huonw's script above with rustc 1.11.0-nightly (34505e222 2016-06-08):
time: 0.088; rss: 97MB translation
1000
time: 0.308; rss: 100MB translation
2000
time: 1.247; rss: 101MB translation
4000
time: 4.723; rss: 106MB translation
With -Z orbit:
time: 0.009; rss: 95MB translation
1000
time: 0.015; rss: 97MB translation
2000
time: 0.017; rss: 101MB translation
4000
time: 0.030; rss: 104MB translation
MIR trans looks to be a lot better at this case.
@brson
MIR trans's algorithm is indeed very careful to generate a linear number of basic blocks.
Match checking currently takes the most time, which is rather unsurprising since I think it's O(n^2) code. So I'm going to leave this open for the time being, though realistically matches with this many arms are... excessive, but I plan on taking a look at the code and seeing if it's possible to optimize it a little soon, so I'll leave it open till then.
$ for N in {10..100}; do echo -n "$((2**$N)): "; python -c "print('fn main() { match 0 { %s _ => {} } }' % ''.join(' %s => {}' % n for n in xrange(0, 2 ** $N)))" | rustc - -Z time-passes | rg --color never 'match'; done
1024: time: 0.021; rss: 93MB match checking
2048: time: 0.083; rss: 96MB match checking
4096: time: 0.335; rss: 99MB match checking
8192: time: 1.340; rss: 107MB match checking
16384: time: 5.360; rss: 117MB match checking
32768: time: 21.446; rss: 142MB match checking
65536: time: 88.756; rss: 193MB match checking
131072: time: 380.206; rss: 300MB match checking
Actually, the current match checking code seems to be wrongly/poorly implemented. The original paper had linear performance on simple sequences here.
EDIT: I interpret the figures wrongly:
the X-axis shows the size of matchings squared, (expressed as source file size squared)
I'd like to make it clear that match checking is a NP-Complete problem, and quadratic average time is not very bad for such complex algorithm.
However, it's true that it's taking up significant portion of compile time, and we probably want to special case trivial (C/switch-like) matches so that they use a loglinear algorithm instead.
Triage:
~$ for N in 500 1000 2000 4000; do echo $N; python -c "print('fn main() { match 0 { %s _ => {} } }' % ''.join(' %s => {}' % n for n in range(0, $N)))" | rustc +devel - -Z time-passes | grep "checking 2"; done
500
time: 0.028 misc checking 2
1000
time: 0.087 misc checking 2
2000
time: 0.312 misc checking 2
4000
time: 1.220 misc checking 2
~$ for N in 500 1000 2000 4000; do echo $N; python -c "print('fn main() { match 0 { %s _ => {} } }' % ''.join(' %s => {}' % n for n in range(0, $N)))" | rustc +devel - -Z time-passes | grep "match"; done
500
time: 0.029 rvalue promotion + match checking
1000
time: 0.087 rvalue promotion + match checking
2000
time: 0.313 rvalue promotion + match checking
4000
time: 1.233 rvalue promotion + match checking
triage: P-medium.
This is not believed to be a high priority work-item; in practice these match checking times, even for large matches, are minuscule compared to other passes in the compiler.
Still, it might be a fun thing for a volunteer to attack.
However, it's true that it's taking up significant portion of compile time, and we probably want to special case trivial (C/switch-like) matches so that they use a loglinear algorithm instead.
(Even this simpler step would probably be both easy and beneficial!)
It's 2020 and
$ for K in {8..16}; do
N=$((1 << K))
echo -n "$((N)): "
python -c "print('fn main() { match 0 { %s _ => {} } }' % ''.join(' %s => {}' % n for n in xrange(0, $N)))" |
rustc - -Z time-passes |
rg --color never 'match'
done
256: time: 0.001; rss: 88MB match_checking
512: time: 0.004; rss: 89MB match_checking
1024: time: 0.013; rss: 89MB match_checking # huh, rss is almost constant up to here
2048: time: 0.046; rss: 93MB match_checking
4096: time: 0.179; rss: 96MB match_checking
8192: time: 0.715; rss: 103MB match_checking
16384: time: 3.152; rss: 118MB match_checking
32768: time: 11.660; rss: 148MB match_checking
65536: time: 55.046; rss: 206MB match_checking
The match exhaustiveness algorithm is indeed completely quadratic: for every branch, we check if it is redundant with any of the branches above. We could hack workarounds for some special cases like what #76918 did, but those usually stop being useful for non-trivial patterns, and I'm not sure they're worth it since most matches have only a few branches. (I would love some data to confirm that, if you want to help see https://github.com/rust-lang/rustc-perf/issues/792)
A real solution would involve a pretty drastic redesign of the algorithm. I have been thinking about that in my spare time, but as @ishitatsuyuki mentioned, it's not easy.
Most helpful comment
I'd like to make it clear that match checking is a NP-Complete problem, and quadratic average time is not very bad for such complex algorithm.
However, it's true that it's taking up significant portion of compile time, and we probably want to special case trivial (C/switch-like) matches so that they use a loglinear algorithm instead.