Rust: NLL compile-time performance regression

Created on 5 Feb 2019  Â·  21Comments  Â·  Source: rust-lang/rust

NLL seems to cause a severe regression in compile time for code generated by the gll crate.

  1. Check out https://github.com/rust-lang-nursery/wg-grammar.git (current hash is e1d072a158a5d1f3d811bbd48bba3608bff1c99f).
  2. cargo build. For me, takes about 1 minute.
  3. Edit src/lib.rs and add #![feature(nll)]
  4. cargo build. For me, takes 77 minutes.

gll generates a very large parse.rs file, about 46,000 lines long. (Located in target/debug/build/rust-grammar-*/out/parse.rs)

From what little debugging I did, I see it spending most of its time in a function called RegionInferenceContext::find_outlives_blame_span.

Tested: 1.32.0 (stable) and rustc 1.34.0-nightly (f6fac4225 2019-02-03).

Please let me know if there's any other information I can provide.

cc @qmx

A-NLL NLL-performant P-medium T-compiler

Most helpful comment

58210 is really simple but reduces instruction counts by 22% and wall-time by 8% for compilation of my cut-down version of foo.lyg.

All 21 comments

Cc @nnethercote @pnkfelix @nikomatsakis

Thank you for the detailed steps to reproduce. I had to use cargo +nightly build and then I can confirm the results: 55 seconds for a normal build, 61 minutes for an NLL build.

perf top also agrees that find_outlives_blame_span (or something inlined into it) is the problem.

The generated parse.rs file contains one function (parse) that is 16,000 lines. I wonder if that's the root of the problem.

@ehuss: Using normal profiling tools on a test case that is slow long-running will be painful. Do you have any suggestions for generating test case that is roughly 10--100x smaller? I guess I could try just removing things from grammar/*.lyg but perhaps you can give me more targeted ideas?

I did some ad hoc profiling on a partial run and found that find_constraint_paths_between_regions is the critical function. It has an outer loop that is hot and an inner loop that is even hotter.

Here is a partial histogram showing the distribution of loop counts for the inner loop:

260467026 counts:
(  1) 78981836 (30.3%, 30.3%): n2 = 2
(  2) 78236479 (30.0%, 60.4%): n2 = 1
(  3) 24332897 ( 9.3%, 69.7%): n2 = 4
(  4) 22356356 ( 8.6%, 78.3%): n2 = 3
(  5) 13480529 ( 5.2%, 83.5%): n2 = 5
(  6) 10596430 ( 4.1%, 87.5%): n2 = 8
(  7)  8426562 ( 3.2%, 90.8%): n2 = 6
(  8)  5604005 ( 2.2%, 92.9%): n2 = 19
(  9)  5547737 ( 2.1%, 95.0%): n2 = 9
( 10)  2602505 ( 1.0%, 96.0%): n2 = 12

Here is the weighted version of the same histogram:

1139707263 counts:
(  1) 157963672 (13.9%, 13.9%): n2 = 2
(  2) 106476095 ( 9.3%, 23.2%): n2 = 19
(  3) 97331588 ( 8.5%, 31.7%): n2 = 4
(  4) 84771440 ( 7.4%, 39.2%): n2 = 8
(  5) 78844077 ( 6.9%, 46.1%): n2 = 37
(  6) 78236479 ( 6.9%, 53.0%): n2 = 1
(  7) 67402645 ( 5.9%, 58.9%): n2 = 5
(  8) 67069068 ( 5.9%, 64.8%): n2 = 3
(  9) 50559372 ( 4.4%, 69.2%): n2 = 6
( 10) 49929633 ( 4.4%, 73.6%): n2 = 9

The ratios of the "counts" line in each one is 4.4, i.e. the average inner loop count is 4.4, and although it sometimes (not shown in the above data) gets above 300, generally it loops a handful of times.

I also found that the Trace::NotVisited = context[sub_region] condition within the inner loop is only true about 20% of the time.

In contrast, here is a partial histogram showing the distribution of loop counts for the outer loop:

30929 counts:
(  1)      287 ( 0.9%,  0.9%): n1a = 7
(  2)      273 ( 0.9%,  1.8%): n1a = 2
(  3)      239 ( 0.8%,  2.6%): n1a = 5
(  4)      210 ( 0.7%,  3.3%): n1a = 14
(  5)      209 ( 0.7%,  3.9%): n1a = 9
(  6)      198 ( 0.6%,  4.6%): n1a = 8
(  7)      197 ( 0.6%,  5.2%): n1a = 10
(  8)      189 ( 0.6%,  5.8%): n1a = 12
(  9)      180 ( 0.6%,  6.4%): n1a = 11
( 10)      171 ( 0.6%,  7.0%): n1a = 6
( 11)      158 ( 0.5%,  7.5%): n1a = 13
( 12)      127 ( 0.4%,  7.9%): n1a = 660
( 13)      125 ( 0.4%,  8.3%): n1a = 642
( 14)      124 ( 0.4%,  8.7%): n1a = 624
( 15)      124 ( 0.4%,  9.1%): n1a = 152
( 16)      120 ( 0.4%,  9.5%): n1a = 46
( 17)      118 ( 0.4%,  9.9%): n1a = 141
( 18)      113 ( 0.4%, 10.2%): n1a = 674
( 19)      110 ( 0.4%, 10.6%): n1a = 48
( 20)      108 ( 0.3%, 10.9%): n1a = 44

Here is the weighted version of the same histogram:

260475188 counts:
(  1)   567226 ( 0.2%,  0.2%): n1a = 51566
(  2)   567028 ( 0.2%,  0.4%): n1a = 51548
(  3)   566830 ( 0.2%,  0.7%): n1a = 51530
(  4)   566632 ( 0.2%,  0.9%): n1a = 51512
(  5)   566434 ( 0.2%,  1.1%): n1a = 51494
(  6)   566236 ( 0.2%,  1.3%): n1a = 51476
(  7)   565598 ( 0.2%,  1.5%): n1a = 51418
(  8)   565400 ( 0.2%,  1.7%): n1a = 51400
(  9)   565202 ( 0.2%,  2.0%): n1a = 51382
( 10)   565004 ( 0.2%,  2.2%): n1a = 51364
( 11)   564806 ( 0.2%,  2.4%): n1a = 51346
( 12)   564608 ( 0.2%,  2.6%): n1a = 51328
( 13)   563761 ( 0.2%,  2.8%): n1a = 51251
( 14)   563563 ( 0.2%,  3.0%): n1a = 51233
( 15)   563365 ( 0.2%,  3.3%): n1a = 51215
( 16)   562188 ( 0.2%,  3.5%): n1a = 51108
( 17)   561990 ( 0.2%,  3.7%): n1a = 51090
( 18)   561792 ( 0.2%,  3.9%): n1a = 51072
( 19)   561462 ( 0.2%,  4.1%): n1a = 51042
( 20)   561264 ( 0.2%,  4.3%): n1a = 51024

The ratios of the "counts" line in each one is 8421, i.e. the average outer loop count is 8421, and it gets as high as 51566.

Another data point: the length of context in find_constraint_paths_between_regions (i.e. the number of definitions) is 63577 in the dominant case. I guess this is probably coming from the huge parse function. And then we're calling find_constraint_paths_between_regions lots of times for parse, so it's basically quadratic.

Thanks for taking a look. It's tricky to create an isolated reproduction. One approach as you mentioned is to just use a smaller grammar. Delete all the files in grammar/ and create a new one (name it foo.lyg or whatever) with these contents: foo.lyg. (You can delete more lines if that is still too slow.) For me, the non-nll build is 2s and the nll build is ~70s.

I did a Cachegrind run using a cut-down version of foo.lyg. It basically confirmed everything I found above. The cost of traversing the constraint graph is a bit higher than I expected, as these lines show:

            .           impl<'s, D: ConstraintGraphDirecton> Iterator for Edges<'s, D> {
            .               type Item = OutlivesConstraint;
            .               
            .               fn next(&mut self) -> Option<Self::Item> {
1,771,873,465 ( 2.50%)          if let Some(p) = self.pointer {
  708,321,207 ( 1.00%)              self.pointer = self.graph.next_constraints[p];
            .            
9,208,629,436 (13.00%)              Some(self.constraints[p])
  354,512,802 ( 0.50%)          } else if let Some(next_static_idx) = self.next_static_idx {
            .                       self.next_static_idx =
      983,892 ( 0.00%)                  if next_static_idx == (self.graph.first_constraints.len() - 1) {
            .                               None 
            .                           } else {
            .                               Some(next_static_idx + 1)
            .                           };

and
354,140,082 ( 0.50%) let fr_static = self.universal_regions.fr_static; 17,710,114,719 (25.00%) for constraint in self.constraint_graph . .outgoing_edges(r, &self.constraints, fr_static) . { 3,542,079,795 ( 5.00%) assert_eq!(constraint.sup, r); . let sub_region = constraint.sub; 2,125,247,877 ( 3.00%) if let Trace::NotVisited = context[sub_region] { 1,302,130,110 ( 1.84%) context[sub_region] = Trace::FromOutlivesConstraint(constraint); . deque.push_back(sub_region); . } . }
(It's extremely rare to see 25% or 13% of all instructions attributed to a single line of code!)

I can see one or two opportunities to improve things, but they'll only be constant-time shavings. An algorithmic change will be necessary to make things reasonable, and I don't understand this code well enough to come up with anything myself.

58210 is really simple but reduces instruction counts by 22% and wall-time by 8% for compilation of my cut-down version of foo.lyg.

@matthewjasper and I have also been looking at this, and removing the call to find_outlives_blame_span here shows it's indeed too costly of a function to be called thousands of time (but _that_ should IIUC only happen rarely: when checking huge closures like these here; I've seen one in the logs with 70860 region definitions).

For example, stubbing the call out makes the NLL overhead closer to 10% here (but of course simply removing the call worsens the diagnostics).

Matthew mentioned they'll be looking at this very soon.

@lqd could find_outlives_blame_span be called optionally if we could verify ahead of time what the search space is? That way for the common case we still get the nicer error messages with a reasonable level of slowness, while keeping the running time constrained in the worst cases.

@estebank seems possible to have such a threshold if need be yeah. But Matthew (who, unlike me, has worked on this part of the code before — and which Niko wrote IIRC) also mentioned on Zulip there are other ways to get this diagnostics information.

(Oh and in the little experiment I mentioned earlier, the "worse diagnostics" weren't terrible or anything)

I've opened #58347 which gets the performance to a much better level (less than 50% slower for a full clean debug build). Once that has landed we probably want to add the test to perf.rlo and profile what's causing the remaining slow down.

With #58347 merged, this should be much better now. Could @Mark-Simulacrum add this to perf.rlo? And does anyone want to investigate the remaining perf gap?

I probably won't have a chance for a few days but it should be fairly straightforward to do so - copy files into collector/benchmarks/grammar and file a PR.

The input size might need tweaking so that the runtime is reasonable.

NLL triage. Seems to me like we still want to investigate the remaining performance gap here. Tagging as P-high.

Also nominating for discussion at NLL meeting (mostly to see if I can find someone to continue the investigation of the remaining performance gap).

assigning to @csmoe after they pinged me on zulip

un-nominating based on assumption that @csmoe is going to continue investigating.

I did some investigation on this, it appears that we are still spending too much time on the closure bounds. Lazily determining the best constraint seems like the best way forwards here.

Downgrading from P-high to P-medium; the bulk of the truly awful performance regression has been addressed.

Some issues may remain, but its not clear if they are worth tracking in this specific issue. So I may close this in the future.

Reassigning from @csmoe to self so that I keep track of its status.

After further discussion with @Mark-Simulacrum and @nnethercote , closing as essentially fixed (at least to the point of not being worth tracking in this issue).

Was this page helpful?
0 / 5 - 0 ratings