I see OOMs locally with a 16 GB memory computer.
Source code: https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector/benchmarks/html5ever
I did a run with Massif:
Yep, that's a 14 GiB peak, with 12 GiB of it coming from a single stack trace involving visit_ty
/visit_with
. I'm not surprised it OOM'd on a 16 GiB machine.
Massif doesn't get great stack traces due to inlining, but DHAT does a better job by using debuginfo. Here's the important info:
-------------------- 2 of 500 --------------------
max-live: 12,884,901,888 in 1 blocks
tot-alloc: 12,884,901,888 in 1 blocks (avg size 12884901888.00)
deaths: 1, at avg age 219,437,305,943 (40.34% of prog lifetime)
acc-ratios: 0.60 rd, 0.60 wr (7,768,302,580 b-read, 7,768,302,580 b-written)
at 0x4C2DECF: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
by 0x7A673EC: alloc (alloc.rs:62)
by 0x7A673EC: alloc (alloc.rs:123)
by 0x7A673EC: reserve_internal<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:679)
by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:502)
by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:464)
by 0x7A673EC: push<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:1060)
by 0x7A673EC: {{closure}}<&rustc::ty::TyS> (liveness.rs:171)
by 0x7A673EC: visit_region<closure> (fold.rs:305)
by 0x7A673EC: rustc::ty::structural_impls::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::sty::RegionKind>::visit_with (structural_impls.rs:956)
by 0x7A68568: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:888)
by 0x7A68568: rustc::ty::fold::TypeVisitor::visit_ty (fold.rs:188)
by 0x7A69AD0: visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:903)
by 0x7A69AD0: {{closure}}<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1706)
by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
by 0x7A69AD0: try_fold<&rustc::ty::TyS,(),closure,core::iter::LoopState<(), ()>> (mod.rs:2431)
by 0x7A69AD0: try_for_each<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
by 0x7A69AD0: any<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1705)
by 0x7A69AD0: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
by 0x7A69AD0: rustc::ty::fold::TypeFoldable::visit_with (fold.rs:63)
Holy cow! That's a single 12 GiB allocation happening within push_type_live_constraint
:
https://github.com/rust-lang/rust/blob/master/src/librustc_mir/borrow_check/nll/type_check/liveness.rs#L170-L172
(Once again, add_liveness_constraints
and the code beneath it is causing difficulties.)
Because liveness_set
is so enormous, add_region_liveness_constraints_from_type_check()
is slow. From Callgrind:
. for (region, location) in liveness_set {
. debug!("generate: {:#?} is live at {:#?}", region, location);
777,018,978 let region_vid = regioncx.to_region_vid(region);
1,942,548,420 regioncx.add_live_point(region_vid, *location);
59,657,319,186 => /home/njn/moz/rust0/src/librustc_mir/borrow_check/nll/region_infer/mod.rs:rustc_mir::borrow_check::nll::region_infer::RegionInferenceContext::add_live_point (388507821x)
. }
add_live_point()
then does some SparseBitMatrix
insertions.
There are also some other causes of slowness for html5ever
.
First, precompute_borrows_out_of_scope
is super hot. Callgrind output:
. mir: &'a Mir<'tcx>,
. regioncx: &Rc<RegionInferenceContext<'tcx>>,
. borrows_out_of_scope_at_location: &mut FxHashMap<Location, Vec<BorrowIndex>>,
. borrow_index: BorrowIndex,
. borrow_region: RegionVid,
. location: Location,
. ) {
. // Keep track of places we've locations to check and locations that we have checked.
46,668 let mut stack = vec![ location ];
. let mut visited = FxHashSet();
62,224 visited.insert(location);
4,413,266 => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (15431x)
.
. debug!(
. "borrow {:?} has region {:?} with value {:?}",
. borrow_index,
. borrow_region,
. regioncx.region_value_str(borrow_region),
. );
. debug!("borrow {:?} starts at {:?}", borrow_index, location);
971,769,410 while let Some(location) = stack.pop() {
. // If region does not contain a point at the location, then add to list and skip
. // successor locations.
971,769,410 if !regioncx.region_contains_point(borrow_region, location) {
. debug!("borrow {:?} gets killed at {:?}", borrow_index, location);
99,190 borrows_out_of_scope_at_location
1,290,825 => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::Entry<'a, K, V>>::or_insert (13827x)
3,298,140 => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::HashMap<K, V, S>>::entry (13827x)
28,340 .entry(location)
. .or_insert(vec![])
. .push(borrow_index);
. continue;
. }
.
. let bb_data = &mir[location.block];
. // If this is the last statement in the block, then add the
. // terminator successors next.
1,457,611,605 if location.statement_index == bb_data.statements.len() {
. // Add successors to locations to visit, if not visited before.
59,584 if let Some(ref terminator) = bb_data.terminator {
89,376 for block in terminator.successors() {
799,865 => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Terminator::successors (29527x)
190,405 let loc = block.start_location();
112,488 => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::BasicBlock::start_location (37496x)
228,486 if visited.insert(loc) {
5,583,147 => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (37496x)
. stack.push(loc);
. }
. }
. }
. } else {
. // Visit next statement in block.
2,429,203,715 let loc = location.successor_within_block();
1,943,360,660 => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Location::successor_within_block (485840165x)
2,915,044,458 if visited.insert(loc) {
68,678,149,511 => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (485840165x)
. stack.push(loc);
. }
. }
. }
. }
It's the bit at the end that's the hottest, executing 485 million times. Because of this function, the following things are also super hot:
region_contains_point
/RegionValues::contains
HashSet::insert
. (Note that the visited
hash table gets big: a total of 38 GB is allocated/reallocated in 109,000 allocations with an average size of 347 KB!)successor_within_block
Second, unroll_place
is also very hot:
. fn unroll_place<'tcx, R>(
. place: &Place<'tcx>,
. next: Option<&PlaceComponents<'_, 'tcx>>,
. op: impl FnOnce(PlaceComponentsIter<'_, 'tcx>) -> R,
. ) -> R {
. match place {
8,892,046,598 Place::Projection(interior) => unroll_place(
42,799,560,024 => /home/njn/moz/rust0/src/librustc_mir/borrow_check/places_conflict.rs:rustc_mir::borrow_check::places_conflict::unroll_place'2 (485791408x)
534,414,126 &interior.base,
1,068,828,252 Some(&PlaceComponents {
. component: place,
. next,
. }),
2,672,070,630 op,
. ),
.
. Place::Local(_) | Place::Static(_) => {
1,943,739,512 let list = PlaceComponents {
. component: place,
. next,
. };
1,943,739,512 op(list.iter())
. }
. }
6,754,274,663 }
I always wondered it that super naive liveness_set
would cause us problems =) seems clear the answer is yes.
Thanks @nnethercote for the data, super helpful.
I suspect we can remove the liveness_set
vector entirely.
But it will require some refactoring. I think I would want to build this atop https://github.com/rust-lang/rust/pull/51987. Let me summarize the setup in that PR:
'a: 'b
) as well as the set of liveness facts(R, P)
pointsWe used to have a fairly hard requirement that we know the number of region variables before we could create the region inference context. This is because we represented the values as a huge matrix of bits. But now that we use sparse bit sets that is not true; we could grow the number of region variables fairly easily.
This implies to me that -- to start -- we could refactor so that type check generates the initial liveness results (storing in a sparse bit set per variable) directly. This would effectively replace the liveness_set
vector that we have today. We would then hand that off to the region inference context.
The reason we'd have to be able to grow the number of region variables is that I think that computing liveness constraints can sometimes create fresh region variables and possibly constraints between those variables. (This is due to the "dropck" computation.)
It'd be sort of nice if that were not true, though, and it may not be true -- if it weren't, we could compute the SCC ahead of time and then never generate the per-region liveness, instead focusing on the liveness per SCC. This may affect diagnostics, which we could recover by doing more refined computations lazilly perhaps (or maybe it wouldn't affect diagnostics, i'm not sure, I'd have to think about it).
But it'd probably still be a big improvement to move to the sparse bit set instead of the vector of points.
Another option that I have been wondering about is moving from "set of points" to a more compact representation -- for example, the SEME region abstraction implemented in https://github.com/rust-lang-nursery/rustc-seme-regions may be more compact. However, presently there can be multiple disjoint liveness regions, and SEME regions can't express that, so we'd have to figure out what to do there.
More about precompute_borrows_out_of_scope
: there's something quadratic going on. I added this line to the very end of the function:
eprintln!("visited {}", visited.len());
The resulting output when compiling html5ever
is dominated by this sequence of length 9858:
visited 88704
visited 88703
visited 88691
visited 88683
visited 88675
visited 88667
...
visited 9899
visited 9891
visited 9883
visited 9875
visited 9867
visited 9862
visited 9861
It definitely feels like we're traversing a long sequence from A to B, then moving forward A slightly (by 8 steps, to be precise), then traversing from A to B again, over and over.
Yeah, precompute_borrows_out_of_scope
is potentially quadratic -- it itself is an optimization over an earlier approach -- but I've not thought of a better way for it to work. It basically walks the scope of each borrow -- the number of borrows is sort of "O(n)" in the size of the code, and the potential extent of each borrow is similarly "O(n)".
Hmm, as an aside, I wonder if the premise of finding the borrows-in-scope via a dataflow is a bit silly, given that precompute_borrows_out_of_scope
exists. That is, it seems like we are duplicating our work here with some of the dataflow propagation code. Maybe it would be worth killing that code.
(This is another place where changing to an alternative representation of regions, one that optimizes continuous stretches of things, might be a win.)
I made several attempts to speed up unroll_place
-- using SmallVec
, using Vec
, moving the borrow_components
computation outside the inner loop in each_borrow_involving_path
-- but they all made things slower or at best made no difference :(
@nnethercote yeah, that code was not naive -- in fact, you had previously optimized it to use SmallVec
, and unroll_place
was itself an optimization on that =) I have some thoughts about how to do better but it would require deeper refactoring. I will try to write some notes down at least.
JFYI, I created this Zulip thread to talk about this issue
@Mark-Simulacrum: #52250 reduces html5ever
's max-rss from 11,380,868.00 down to 10,285,700.00. (I assume the unit is MB or MiB.) I wonder if that's enough to avoid the OOMs on machines with 16 GiB of RAM.
@nnethercote I'll try and enable it -- the perf collector has, I think, 32 GB of RAM (though free reports 31G, that seems odd) so I think that it should be "good enough" in that regard. Local benchmark runners can exclude it if necessary, I think...
I looked into unroll_place
a bit more. The problem is not that unroll_place
is slow, but rather that it's called so many times.
Specifically, we hit the for i in candidates
loop within each_borrow_involving_path()
150,000 times when compiling html5ever
:
https://github.com/rust-lang/rust/blob/31263f320419f21a909a04b92305ed66364f63c0/src/librustc_mir/borrow_check/path_utils.rs#L61-L74
Here's the candidates
length distribution for those 150,000 hits:
There's something quadratic going on, as shown by the equal number of occurrences for lengths 1..9855. There is also the over-representation of len=9856. I haven't yet worked out exactly where this is coming from.
The performance remains ungreat though and @nnethercote's profiles are very useful in that regard. It seems clear (also from the fact that liveness dominates) that in this case we have somewhere a large number of borrows accumulating -- each access is then checked against those large number of borrows. I suspect we could do a lot better via some kind of hashing scheme, where we hash paths to figure out if they might possible overlap. I would imagine we would walk down the place to find prefixes.
We are deferring further work that targets max-rss specifically to the release candidate, since it doesn't seem pressing for Edition Preview 2 right now. At this point, it seems like the next step is to look at ways to represent liveness regions that can share between region values -- e.g., something like a BDD (correct ordering will be key here) or perhaps SEME regions.
I re-profiled. The high number of calls to unroll_place
is still a major problem, as is the behaviour of precompute_borrows_out_of_scope
. Both appear to involve quadratic behaviour.
For posterity's sake, 95% of the time spent in html5ever
is due to this big static
Just linking the couple issues Niko filed, related to @nnethercote's and others' profiles (but not strictly about this current issue of memory usage):
Update: With https://github.com/rust-lang/rust/pull/53168, I think we can expect html5ever's memory usage to be reduced to 1.2GB. (At least, the previous PR — which I reverted — had that effect, and this one does the same basic thing, just more soundly.)
@Mark-Simulacrum : are you happy to close this now? With NLL, html5ever's memory usage and speed are now both reasonable (and #53383 will help both a bit more).
html5ever still uses 6x more memory (over a gigabyte) for check builds if I'm reading the dashboard right, which I believe is unacceptably high long-term. However, I do agree that we can discontinue prioritizing this for the edition (performance regression is no longer a huge outlier for html5ever specifically).
Note that https://github.com/rust-lang/rust/pull/53327 seems to drop memory use to 600MB here (although not the latest rev, which needs to be investigated).
Pushing this out to RC 2 — it's a perf issue and not a critical one (plus pending PRs ought to help)
Memory use is currently at a 2.28x ratio, weighing in at 501MB.
Still a lot, but not so much as to be an RC2 blocker.
Would be good to know what's going on though.
Here's a Massif graph showing a big spike near the end of a check
build:
228MB of that is from SparseBitMatrix
instances within ConstraintGeneration::add_regular_live_constraint
. Here's a DHAT record for half of it (and there's a second record for the other half that is extremely similar):
max-live: 114,195,072 in 9,858 blocks
tot-alloc: 115,981,392 in 15,556 blocks (avg size 7455.73)
deaths: 15,556, at avg age 186,718,791 (1.80% of prog lifetime)
acc-ratios: 1.00 rd, 0.00 wr (116,230,288 b-read, 248,896 b-written)
at 0x4C2FEE5: calloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
by 0x7AE3D45: alloc_zeroed (alloc.rs:147)
by 0x7AE3D45: alloc_zeroed (alloc.rs:174)
by 0x7AE3D45: allocate_in<u128,alloc::alloc::Global> (raw_vec.rs:104)
by 0x7AE3D45: with_capacity_zeroed<u128> (raw_vec.rs:156)
by 0x7AE3D45: from_elem<u128> (vec.rs:1620)
by 0x7AE3D45: from_elem<u128> (vec.rs:1581)
by 0x7AE3D45: new<rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (<vec macros>:2)
by 0x7AE3D45: {{closure}}<rustc_mir::borrow_check::nll::constraints::ConstraintSccIndex,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:355)
by 0x7AE3D45: get_or_insert_with<rustc_data_structures::bitvec::BitArray<rustc_mir::borrow_check::nll::region_infer::values::PointIndex>,closure> (option.rs:814)
by 0x7AE3D45: <rustc_data_structures::bitvec::SparseBitMatrix<R, C>>::ensure_row (bitvec.rs:355)
by 0x7B654AC: add<rustc::ty::sty::RegionVid,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:363)
by 0x7B654AC: <rustc_mir::borrow_check::nll::region_infer::values::LivenessValues<N>>::add_element (values.rs:182)
by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind> (constraint_generation.rs:205)
by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind,closure> (fold.rs:271)
by 0x7B5690D: visit_region<closure> (fold.rs:333)
by 0x7B5690D: visit_with<rustc::ty::fold::{{impl}}::any_free_region_meets::RegionVisitor<closure>> (structural_impls.rs:964)
by 0x7B5690D: any_free_region_meets<&rustc::ty::sty::RegionKind,closure> (fold.rs:291)
by 0x7B5690D: for_each_free_region<&rustc::ty::sty::RegionKind,closure> (fold.rs:270)
by 0x7B5690D: add_regular_live_constraint<&rustc::ty::sty::RegionKind> (constraint_generation.rs:201)
by 0x7B5690D: visit_region (constraint_generation.rs:70)
by 0x7B5690D: super_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:540)
by 0x7B5690D: visit_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:138)
by 0x7B5690D: super_assign<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:403)
by 0x7B5690D: <rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration<'cg, 'cx, 'gcx, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_assign (constraint_
generation.rs:151)
I added some println!
statements. Looks like we have a big SparseBitMatrix
that ends up with about 30,000 rows, and 92,652 columns in each row, but only a single bit is set in each row. SparseBitMatrix
is moderately sparse, in that rows are instantiated fully on demand.
Perhaps a sparser representation could help. In fact, SparseBitMatrix
used to be sparser, but in #52250 I made it denser because it was a significant speed win on some benchmarks. Perhaps there's a middle ground to be found.
I have a plan to fix this, by making SparseBitMatrix
sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve using HybridIdxSet
for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, including BitSlice
, BitArray
, and IdxSet
. #54177 is the first step.
(The old SparseBitMatrix
used BTreeMap
for the rows, which gave terrible performance and memory usage when the rows became fairly dense. HybridIdxSet
should provide a good compromise: compact when sparse, but space-efficient when dense, and fast in both cases.)
I have a plan to fix this, by making
SparseBitMatrix
sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve usingHybridIdxSet
for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, includingBitSlice
,BitArray
, andIdxSet
. #54177 is the first step.
HybridBitSet
within SparseBitMatrix
. I have it working locally and the memory improvements are sizeable:
html5ever-check
avg: -46.7% min: -46.7% max: -46.7%
ucd-check
avg: -45.2% min: -45.2% max: -45.2%
clap-rs-check
avg: -23.3% min: -23.3% max: -23.3%
inflate-check
avg: -14.9% min: -14.9% max: -14.9%
@lqd just pointed me at a crate (unic-ucd-name-0.7.0
) with high memory usage under NLL. My HybridBitSet
change reduces its max-rss
by 96%, from 29.1GB to 1.2GB! (Compared to non-NLL, that improves the increase in max-rss
from 42.5x to 1.7x.)
The NLL dashboard has updated. The top 5 now looks like this:
Benchmark | clean-check | nll-check | %
keccak | 636,612.00 | 912,740.00 | 143.37
html5ever | 221,892.00 | 266,248.00 | 119.99
coercions | 181,744.00 | 210,120.00 | 115.61
tuple-stress | 395,444.00 | 442,804.00 | 111.98
script-servo | 2,463,224.00 | 2,617,760.00 | 106.27
A 20% increase in max-rss
for html5ever
with NLL isn't ideal, but it's a whole lot better than the 14GB+ we were getting when this PR was opened.
Most helpful comment
@lqd just pointed me at a crate (
unic-ucd-name-0.7.0
) with high memory usage under NLL. MyHybridBitSet
change reduces itsmax-rss
by 96%, from 29.1GB to 1.2GB! (Compared to non-NLL, that improves the increase inmax-rss
from 42.5x to 1.7x.)