Rust: html5ever in the rustc-perf repository is memory-intensive

Created on 3 Jul 2018  Â·  33Comments  Â·  Source: rust-lang/rust

I see OOMs locally with a 16 GB memory computer.

Source code: https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector/benchmarks/html5ever

A-NLL NLL-performant T-compiler

Most helpful comment

@lqd just pointed me at a crate (unic-ucd-name-0.7.0) with high memory usage under NLL. My HybridBitSet change reduces its max-rss by 96%, from 29.1GB to 1.2GB! (Compared to non-NLL, that improves the increase in max-rss from 42.5x to 1.7x.)

All 33 comments

I did a run with Massif:
massif
Yep, that's a 14 GiB peak, with 12 GiB of it coming from a single stack trace involving visit_ty/visit_with. I'm not surprised it OOM'd on a 16 GiB machine.

Massif doesn't get great stack traces due to inlining, but DHAT does a better job by using debuginfo. Here's the important info:

-------------------- 2 of 500 --------------------
max-live:    12,884,901,888 in 1 blocks
tot-alloc:   12,884,901,888 in 1 blocks (avg size 12884901888.00)
deaths:      1, at avg age 219,437,305,943 (40.34% of prog lifetime) 
acc-ratios:  0.60 rd, 0.60 wr  (7,768,302,580 b-read, 7,768,302,580 b-written)
   at 0x4C2DECF: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
   by 0x7A673EC: alloc (alloc.rs:62)
   by 0x7A673EC: alloc (alloc.rs:123)
   by 0x7A673EC: reserve_internal<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:679)
   by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:502)
   by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:464)
   by 0x7A673EC: push<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:1060)
   by 0x7A673EC: {{closure}}<&rustc::ty::TyS> (liveness.rs:171)
   by 0x7A673EC: visit_region<closure> (fold.rs:305)
   by 0x7A673EC: rustc::ty::structural_impls::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::sty::RegionKind>::visit_with (structural_impls.rs:956)
   by 0x7A68568: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:888)
   by 0x7A68568: rustc::ty::fold::TypeVisitor::visit_ty (fold.rs:188)
   by 0x7A69AD0: visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:903)
   by 0x7A69AD0: {{closure}}<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
   by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1706)
   by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
   by 0x7A69AD0: try_fold<&rustc::ty::TyS,(),closure,core::iter::LoopState<(), ()>> (mod.rs:2431)
   by 0x7A69AD0: try_for_each<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
   by 0x7A69AD0: any<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1705)
   by 0x7A69AD0: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
   by 0x7A69AD0: rustc::ty::fold::TypeFoldable::visit_with (fold.rs:63)

Holy cow! That's a single 12 GiB allocation happening within push_type_live_constraint:
https://github.com/rust-lang/rust/blob/master/src/librustc_mir/borrow_check/nll/type_check/liveness.rs#L170-L172

(Once again, add_liveness_constraints and the code beneath it is causing difficulties.)

Because liveness_set is so enormous, add_region_liveness_constraints_from_type_check() is slow. From Callgrind:

            .          for (region, location) in liveness_set {
            .              debug!("generate: {:#?} is live at {:#?}", region, location);
  777,018,978              let region_vid = regioncx.to_region_vid(region);
1,942,548,420              regioncx.add_live_point(region_vid, *location);
59,657,319,186  => /home/njn/moz/rust0/src/librustc_mir/borrow_check/nll/region_infer/mod.rs:rustc_mir::borrow_check::nll::region_infer::RegionInferenceContext::add_live_point (388507821x)
            .          }

add_live_point() then does some SparseBitMatrix insertions.

There are also some other causes of slowness for html5ever.

First, precompute_borrows_out_of_scope is super hot. Callgrind output:

            .      mir: &'a Mir<'tcx>,
            .      regioncx: &Rc<RegionInferenceContext<'tcx>>, 
            .      borrows_out_of_scope_at_location: &mut FxHashMap<Location, Vec<BorrowIndex>>,
            .      borrow_index: BorrowIndex,
            .      borrow_region: RegionVid,
            .      location: Location,
            .  ) {     
            .      // Keep track of places we've locations to check and locations that we have checked.
       46,668      let mut stack = vec![ location ];
            .      let mut visited = FxHashSet();
       62,224      visited.insert(location);
    4,413,266  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (15431x)
            .                  
            .      debug!(     
            .          "borrow {:?} has region {:?} with value {:?}",
            .          borrow_index,
            .          borrow_region,
            .          regioncx.region_value_str(borrow_region),
            .      );      
            .      debug!("borrow {:?} starts at {:?}", borrow_index, location);
  971,769,410      while let Some(location) = stack.pop() {
            .          // If region does not contain a point at the location, then add to list and skip
            .          // successor locations.
  971,769,410          if !regioncx.region_contains_point(borrow_region, location) {
            .              debug!("borrow {:?} gets killed at {:?}", borrow_index, location);
       99,190              borrows_out_of_scope_at_location
    1,290,825  => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::Entry<'a, K, V>>::or_insert (13827x)
    3,298,140  => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::HashMap<K, V, S>>::entry (13827x)
       28,340                  .entry(location)
            .                  .or_insert(vec![])
            .                  .push(borrow_index);
            .              continue;
            .          }
            .
            .          let bb_data = &mir[location.block];
            .          // If this is the last statement in the block, then add the
            .          // terminator successors next.
1,457,611,605          if location.statement_index == bb_data.statements.len() {
            .              // Add successors to locations to visit, if not visited before.
       59,584              if let Some(ref terminator) = bb_data.terminator {
       89,376                  for block in terminator.successors() {
      799,865  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Terminator::successors (29527x)
      190,405                      let loc = block.start_location();
      112,488  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::BasicBlock::start_location (37496x)
      228,486                      if visited.insert(loc) {
    5,583,147  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (37496x)
            .                          stack.push(loc);
            .                      }
            .                  }
            .              }
            .          } else {
            .              // Visit next statement in block.
2,429,203,715              let loc = location.successor_within_block();
1,943,360,660  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Location::successor_within_block (485840165x)
2,915,044,458              if visited.insert(loc) {
68,678,149,511  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (485840165x)
            .                  stack.push(loc);
            .              }
            .          }
            .      }
            .  }

It's the bit at the end that's the hottest, executing 485 million times. Because of this function, the following things are also super hot:

  • region_contains_point/RegionValues::contains
  • HashSet::insert. (Note that the visited hash table gets big: a total of 38 GB is allocated/reallocated in 109,000 allocations with an average size of 347 KB!)
  • successor_within_block

Second, unroll_place is also very hot:
. fn unroll_place<'tcx, R>( . place: &Place<'tcx>, . next: Option<&PlaceComponents<'_, 'tcx>>, . op: impl FnOnce(PlaceComponentsIter<'_, 'tcx>) -> R, . ) -> R { . match place { 8,892,046,598 Place::Projection(interior) => unroll_place( 42,799,560,024 => /home/njn/moz/rust0/src/librustc_mir/borrow_check/places_conflict.rs:rustc_mir::borrow_check::places_conflict::unroll_place'2 (485791408x) 534,414,126 &interior.base, 1,068,828,252 Some(&PlaceComponents { . component: place, . next, . }), 2,672,070,630 op, . ), . . Place::Local(_) | Place::Static(_) => { 1,943,739,512 let list = PlaceComponents { . component: place, . next, . }; 1,943,739,512 op(list.iter()) . } . } 6,754,274,663 }

I always wondered it that super naive liveness_set would cause us problems =) seems clear the answer is yes.

Thanks @nnethercote for the data, super helpful.

I suspect we can remove the liveness_set vector entirely.

But it will require some refactoring. I think I would want to build this atop https://github.com/rust-lang/rust/pull/51987. Let me summarize the setup in that PR:

  • Type check generates outlives constraints ('a: 'b) as well as the set of liveness facts

    • liveness facts are presently stored as a set of (R, P) points

  • Once type check completes, we know the full set of region variables
  • We create the region inference context

    • we compute the SCC amongst region variables

    • we create a value per SCC which contains the points where any of the regions in that SCC are live

    • at present, we also create a liveness set for each individual region, which is used only in error reporting

We used to have a fairly hard requirement that we know the number of region variables before we could create the region inference context. This is because we represented the values as a huge matrix of bits. But now that we use sparse bit sets that is not true; we could grow the number of region variables fairly easily.

This implies to me that -- to start -- we could refactor so that type check generates the initial liveness results (storing in a sparse bit set per variable) directly. This would effectively replace the liveness_set vector that we have today. We would then hand that off to the region inference context.

The reason we'd have to be able to grow the number of region variables is that I think that computing liveness constraints can sometimes create fresh region variables and possibly constraints between those variables. (This is due to the "dropck" computation.)

It'd be sort of nice if that were not true, though, and it may not be true -- if it weren't, we could compute the SCC ahead of time and then never generate the per-region liveness, instead focusing on the liveness per SCC. This may affect diagnostics, which we could recover by doing more refined computations lazilly perhaps (or maybe it wouldn't affect diagnostics, i'm not sure, I'd have to think about it).

But it'd probably still be a big improvement to move to the sparse bit set instead of the vector of points.

Another option that I have been wondering about is moving from "set of points" to a more compact representation -- for example, the SEME region abstraction implemented in https://github.com/rust-lang-nursery/rustc-seme-regions may be more compact. However, presently there can be multiple disjoint liveness regions, and SEME regions can't express that, so we'd have to figure out what to do there.

More about precompute_borrows_out_of_scope: there's something quadratic going on. I added this line to the very end of the function:

eprintln!("visited {}", visited.len());

The resulting output when compiling html5ever is dominated by this sequence of length 9858:

visited 88704
visited 88703
visited 88691
visited 88683
visited 88675
visited 88667
...
visited 9899
visited 9891
visited 9883
visited 9875
visited 9867
visited 9862
visited 9861 

It definitely feels like we're traversing a long sequence from A to B, then moving forward A slightly (by 8 steps, to be precise), then traversing from A to B again, over and over.

Yeah, precompute_borrows_out_of_scope is potentially quadratic -- it itself is an optimization over an earlier approach -- but I've not thought of a better way for it to work. It basically walks the scope of each borrow -- the number of borrows is sort of "O(n)" in the size of the code, and the potential extent of each borrow is similarly "O(n)".

Hmm, as an aside, I wonder if the premise of finding the borrows-in-scope via a dataflow is a bit silly, given that precompute_borrows_out_of_scope exists. That is, it seems like we are duplicating our work here with some of the dataflow propagation code. Maybe it would be worth killing that code.

(This is another place where changing to an alternative representation of regions, one that optimizes continuous stretches of things, might be a win.)

I made several attempts to speed up unroll_place -- using SmallVec, using Vec, moving the borrow_components computation outside the inner loop in each_borrow_involving_path -- but they all made things slower or at best made no difference :(

@nnethercote yeah, that code was not naive -- in fact, you had previously optimized it to use SmallVec, and unroll_place was itself an optimization on that =) I have some thoughts about how to do better but it would require deeper refactoring. I will try to write some notes down at least.

@Mark-Simulacrum: #52250 reduces html5ever's max-rss from 11,380,868.00 down to 10,285,700.00. (I assume the unit is MB or MiB.) I wonder if that's enough to avoid the OOMs on machines with 16 GiB of RAM.

@nnethercote I'll try and enable it -- the perf collector has, I think, 32 GB of RAM (though free reports 31G, that seems odd) so I think that it should be "good enough" in that regard. Local benchmark runners can exclude it if necessary, I think...

I looked into unroll_place a bit more. The problem is not that unroll_place is slow, but rather that it's called so many times.

Specifically, we hit the for i in candidates loop within each_borrow_involving_path() 150,000 times when compiling html5ever:
https://github.com/rust-lang/rust/blob/31263f320419f21a909a04b92305ed66364f63c0/src/librustc_mir/borrow_check/path_utils.rs#L61-L74

Here's the candidates length distribution for those 150,000 hits:

  • len=20: 8 times
  • len=21: 8 times
  • len=22: 8 times
  • ...
  • len=9854: 8 times
  • len=9855: 8 times
  • len=9856: 9862 times

There's something quadratic going on, as shown by the equal number of occurrences for lengths 1..9855. There is also the over-representation of len=9856. I haven't yet worked out exactly where this is coming from.

52190 reduces the max-rss to 2GB-- still too big, but a lot better. To do much better than that, I think we have to start changing from a simple bitset to something that is more compressed, so we can capture patterns across variables. Something like BDDs come to mind.

The performance remains ungreat though and @nnethercote's profiles are very useful in that regard. It seems clear (also from the fact that liveness dominates) that in this case we have somewhere a large number of borrows accumulating -- each access is then checked against those large number of borrows. I suspect we could do a lot better via some kind of hashing scheme, where we hash paths to figure out if they might possible overlap. I would imagine we would walk down the place to find prefixes.

We are deferring further work that targets max-rss specifically to the release candidate, since it doesn't seem pressing for Edition Preview 2 right now. At this point, it seems like the next step is to look at ways to represent liveness regions that can share between region values -- e.g., something like a BDD (correct ordering will be key here) or perhaps SEME regions.

I re-profiled. The high number of calls to unroll_place is still a major problem, as is the behaviour of precompute_borrows_out_of_scope. Both appear to involve quadratic behaviour.

For posterity's sake, 95% of the time spent in html5ever is due to this big static

Just linking the couple issues Niko filed, related to @nnethercote's and others' profiles (but not strictly about this current issue of memory usage):

Update: With https://github.com/rust-lang/rust/pull/53168, I think we can expect html5ever's memory usage to be reduced to 1.2GB. (At least, the previous PR — which I reverted — had that effect, and this one does the same basic thing, just more soundly.)

@Mark-Simulacrum : are you happy to close this now? With NLL, html5ever's memory usage and speed are now both reasonable (and #53383 will help both a bit more).

html5ever still uses 6x more memory (over a gigabyte) for check builds if I'm reading the dashboard right, which I believe is unacceptably high long-term. However, I do agree that we can discontinue prioritizing this for the edition (performance regression is no longer a huge outlier for html5ever specifically).

Note that https://github.com/rust-lang/rust/pull/53327 seems to drop memory use to 600MB here (although not the latest rev, which needs to be investigated).

Pushing this out to RC 2 — it's a perf issue and not a critical one (plus pending PRs ought to help)

Memory use is currently at a 2.28x ratio, weighing in at 501MB.

Still a lot, but not so much as to be an RC2 blocker.

Would be good to know what's going on though.

Here's a Massif graph showing a big spike near the end of a check build:
massif

228MB of that is from SparseBitMatrix instances within ConstraintGeneration::add_regular_live_constraint. Here's a DHAT record for half of it (and there's a second record for the other half that is extremely similar):

max-live:    114,195,072 in 9,858 blocks
tot-alloc:   115,981,392 in 15,556 blocks (avg size 7455.73) 
deaths:      15,556, at avg age 186,718,791 (1.80% of prog lifetime)
acc-ratios:  1.00 rd, 0.00 wr  (116,230,288 b-read, 248,896 b-written)
   at 0x4C2FEE5: calloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
   by 0x7AE3D45: alloc_zeroed (alloc.rs:147)
   by 0x7AE3D45: alloc_zeroed (alloc.rs:174)
   by 0x7AE3D45: allocate_in<u128,alloc::alloc::Global> (raw_vec.rs:104)
   by 0x7AE3D45: with_capacity_zeroed<u128> (raw_vec.rs:156)
   by 0x7AE3D45: from_elem<u128> (vec.rs:1620)
   by 0x7AE3D45: from_elem<u128> (vec.rs:1581)
   by 0x7AE3D45: new<rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (<vec macros>:2)
   by 0x7AE3D45: {{closure}}<rustc_mir::borrow_check::nll::constraints::ConstraintSccIndex,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:355)
   by 0x7AE3D45: get_or_insert_with<rustc_data_structures::bitvec::BitArray<rustc_mir::borrow_check::nll::region_infer::values::PointIndex>,closure> (option.rs:814)
   by 0x7AE3D45: <rustc_data_structures::bitvec::SparseBitMatrix<R, C>>::ensure_row (bitvec.rs:355)
   by 0x7B654AC: add<rustc::ty::sty::RegionVid,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:363)
   by 0x7B654AC: <rustc_mir::borrow_check::nll::region_infer::values::LivenessValues<N>>::add_element (values.rs:182)
   by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind> (constraint_generation.rs:205)
   by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind,closure> (fold.rs:271)
   by 0x7B5690D: visit_region<closure> (fold.rs:333)
   by 0x7B5690D: visit_with<rustc::ty::fold::{{impl}}::any_free_region_meets::RegionVisitor<closure>> (structural_impls.rs:964)
   by 0x7B5690D: any_free_region_meets<&rustc::ty::sty::RegionKind,closure> (fold.rs:291)
   by 0x7B5690D: for_each_free_region<&rustc::ty::sty::RegionKind,closure> (fold.rs:270)
   by 0x7B5690D: add_regular_live_constraint<&rustc::ty::sty::RegionKind> (constraint_generation.rs:201)
   by 0x7B5690D: visit_region (constraint_generation.rs:70)
   by 0x7B5690D: super_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:540)
   by 0x7B5690D: visit_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:138)
   by 0x7B5690D: super_assign<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:403)
   by 0x7B5690D: <rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration<'cg, 'cx, 'gcx, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_assign (constraint_
generation.rs:151)

I added some println! statements. Looks like we have a big SparseBitMatrix that ends up with about 30,000 rows, and 92,652 columns in each row, but only a single bit is set in each row. SparseBitMatrix is moderately sparse, in that rows are instantiated fully on demand.

Perhaps a sparser representation could help. In fact, SparseBitMatrix used to be sparser, but in #52250 I made it denser because it was a significant speed win on some benchmarks. Perhaps there's a middle ground to be found.

I have a plan to fix this, by making SparseBitMatrix sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve using HybridIdxSet for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, including BitSlice, BitArray, and IdxSet. #54177 is the first step.

(The old SparseBitMatrix used BTreeMap for the rows, which gave terrible performance and memory usage when the rows became fairly dense. HybridIdxSet should provide a good compromise: compact when sparse, but space-efficient when dense, and fast in both cases.)

I have a plan to fix this, by making SparseBitMatrix sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve using HybridIdxSet for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, including BitSlice, BitArray, and IdxSet. #54177 is the first step.

54286 is the next step. Once that lands, I will file a PR for using HybridBitSet within SparseBitMatrix. I have it working locally and the memory improvements are sizeable:

html5ever-check avg: -46.7% min: -46.7% max: -46.7% ucd-check avg: -45.2% min: -45.2% max: -45.2% clap-rs-check avg: -23.3% min: -23.3% max: -23.3% inflate-check avg: -14.9% min: -14.9% max: -14.9%

@lqd just pointed me at a crate (unic-ucd-name-0.7.0) with high memory usage under NLL. My HybridBitSet change reduces its max-rss by 96%, from 29.1GB to 1.2GB! (Compared to non-NLL, that improves the increase in max-rss from 42.5x to 1.7x.)

The NLL dashboard has updated. The top 5 now looks like this:

Benchmark    | clean-check  | nll-check    | %
keccak       | 636,612.00   | 912,740.00   | 143.37
html5ever    | 221,892.00   | 266,248.00   | 119.99
coercions    | 181,744.00   | 210,120.00   | 115.61
tuple-stress | 395,444.00   | 442,804.00   | 111.98
script-servo | 2,463,224.00 | 2,617,760.00 | 106.27

A 20% increase in max-rss for html5ever with NLL isn't ideal, but it's a whole lot better than the 14GB+ we were getting when this PR was opened.

Was this page helpful?
0 / 5 - 0 ratings