Rust: Investigate memory usage of compiling the packed_simd crate

Created on 22 Jan 2019 · 14Comments · Source: rust-lang/rust

Steps to reproduce

Create a new crate with cargo.
Add packed_simd = '0.3.1' to Cargo.toml of the new crate.
Build the new crate.

Actual results

While compiling packed_simd, rustc takes more than 2 GB of RAM.

Expected results

Lesser RAM usage.

Additional info

Maybe it's just the nature of packed_simd that it takes a lot of RAM to compile, and there's no bug. However, if RAM usage reached 3 GB in the future, the crate would become unbuildable on 32-bit systems. It might be worthwhile to investigate if building packed_simd _has to_ take this much RAM or if there is an opportunity to use less RAM without adversely affecting compilation speed on systems that have plenty of RAM.

A-simd I-compilemem

Source

hsivonen

Most helpful comment

Can you try again with today's nightly?

oli-obk on 10 Feb 2019

👍2

All 14 comments

cc @mw @nnethercote

gnzlbg on 22 Jan 2019

Looks like nll needs a lot of memory here
[0m[0m[1m[32m Compiling[0m packed_simd v0.3.1 time: 0.054; rss: 57MB parsing time: 0.000; rss: 58MB attributes injection time: 0.000; rss: 58MB recursion limit time: 0.000; rss: 58MB crate injection time: 0.000; rss: 58MB plugin loading time: 0.000; rss: 58MB plugin registration time: 0.005; rss: 58MB pre ast expansion lint checks time: 2.550; rss: 369MB expand crate time: 0.000; rss: 369MB check unused macros time: 2.550; rss: 369MB expansion time: 0.000; rss: 369MB maybe building test harness time: 0.012; rss: 369MB maybe creating a macro crate time: 0.048; rss: 370MB creating allocators time: 0.036; rss: 370MB AST validation time: 0.497; rss: 412MB name resolution time: 0.075; rss: 412MB complete gated feature checking time: 0.321; rss: 481MB lowering ast -> hir time: 0.081; rss: 482MB early lint checks time: 0.052; rss: 504MB validate hir map time: 0.353; rss: 504MB indexing hir time: 0.000; rss: 504MB load query result cache time: 0.000; rss: 504MB looking for entry point time: 0.000; rss: 504MB dep graph tcx init time: 0.001; rss: 504MB looking for plugin registrar time: 0.001; rss: 504MB looking for derive registrar time: 0.019; rss: 504MB loop checking time: 0.024; rss: 504MB attribute checking time: 0.000; rss: 515MB solve_nll_region_constraints(DefId(0/1:2171 ~ packed_simd[a932]::v64[0]::f32x2[0]::{{constant}}[0])) *snip* time: 0.000; rss: 527MB solve_nll_region_constraints(DefId(0/1:4611 ~ packed_simd[a932]::vSize[0]::{{impl}}[587]::from[0]::U[0]::array[0]::{{constant}}[0])) time: 0.636; rss: 527MB stability checking time: 0.124; rss: 527MB type collecting time: 0.003; rss: 527MB outlives testing time: 0.019; rss: 527MB impl wf inference time: 0.000; rss: 1113MB solve_nll_region_constraints(DefId(0/1:224 ~ packed_simd[a932]::codegen[0]::shuffle[0]::{{impl}}[0]::{{constant}}[0])) *snip* time: 0.000; rss: 1246MB solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0])) time: 9.972; rss: 1408MB coherence checking time: 0.002; rss: 1408MB variance testing time: 0.000; rss: 1605MB solve_nll_region_constraints(DefId(0/1:366 ~ packed_simd[a932]::codegen[0]::v16[0]::{{impl}}[0]::NT[0]::{{constant}}[0])) *snip* time: 0.000; rss: 2013MB solve_nll_region_constraints(DefId(0/0:4027 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[7]::any[0])) time: 0.000; rss: 2013MB solve_nll_region_constraints(DefId(0/0:4053 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[17]::any[0])) time: 5.040; rss: 2013MB MIR borrow checking time: 0.000; rss: 2013MB dumping chalk-like clauses time: 0.005; rss: 2013MB MIR effect checking time: 0.072; rss: 2018MB death checking time: 0.021; rss: 2018MB unused lib feature checking time: 0.176; rss: 2019MB lint checking time: 0.000; rss: 2019MB resolving dependency formats time: 0.890; rss: 2055MB write metadata time: 0.010; rss: 2055MB collecting roots time: 0.186; rss: 2056MB collecting mono items time: 0.196; rss: 2056MB monomorphization collection time: 0.001; rss: 2056MB codegen unit partitioning time: 0.122; rss: 2060MB codegen to LLVM IR time: 0.000; rss: 2060MB assert dep graph time: 0.000; rss: 2060MB serialize dep graph time: 1.215; rss: 2060MB codegen time: 0.056; rss: 2063MB llvm function passes [packed_simd.smey8184-cgu.0] time: 0.777; rss: 2071MB llvm module passes [packed_simd.smey8184-cgu.0] time: 0.798; rss: 2079MB codegen passes [packed_simd.smey8184-cgu.0] time: 1.703; rss: 1539MB LLVM passes time: 0.000; rss: 1540MB serialize work products time: 0.017; rss: 1540MB linking

matthiaskrgr on 22 Jan 2019

Coherence checking also takes a good chunk of memory:

time: 0.000; rss: 1246MB    solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0]))
  time: 9.972; rss: 1408MB  coherence checking

although NLL is the first suspect here. I wonder why NLL uses this much memory, packed_simd is full of methods, but the great majority of them are essentially one liners.

gnzlbg on 22 Jan 2019

Reported the following spike of memory usage in #57432, which occurred after #56723

packed-simd-memory

memoryruins on 22 Jan 2019

❤1

This one could be closed as duplicate of https://github.com/rust-lang/rust/issues/57432 I guess.

mati865 on 30 Jan 2019

EDIT: @mati865 you are right, these are duplicates, I thought that was a different issue that apparently never got filled, so forget this.

original comment:

@mati865 while they are related, they are two different issues:

this issue is about compiling packed_simd itself, which started using much more memory recently, resulting in some builds failing for consumers (encoding-rs)
57432 is about increased compile-times and memory usage when compiling other crates when packed_simd is part of libcore (e.g. via core::simd)

gnzlbg on 30 Jan 2019

I did a DHAT run. The "At t-gmax" measurement is the relevant one, it's short for "time of global max". It shows that the interning of constants within TypeFolder is accounting for over 54% of the global peak:

AP 1.1.1.1.1/2 (2 children) {
  Total:     912,261,120 bytes (12.02%, 7,312.63/Minstr) in 6 blocks (0%, 0/Minstr), avg size 152,043,520 bytes, avg lifetime 103,155,024,513.33 instrs (82.69% of program duration)
  At t-gmax: 912,261,120 bytes (54.74%) in 6 blocks (0%), avg size 152,043,520 bytes
  At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  Reads:     1,827,458,569 bytes (4.97%, 14,648.81/Minstr), 2/byte
  Writes:    844,260,160 bytes (9.59%, 6,767.54/Minstr), 0.93/byte
  Allocated at {
    #1: 0xB66BCCB: alloc (alloc.rs:72)
    #2: 0xB66BCCB: alloc (alloc.rs:148)
    #3: 0xB66BCCB: allocate_in<u8,alloc::alloc::Global> (raw_vec.rs:96)
    #4: 0xB66BCCB: with_capacity<u8> (raw_vec.rs:140)
    #5: 0xB66BCCB: new<u8> (lib.rs:66)
    #6: 0xB66BCCB: arena::DroplessArena::grow (lib.rs:346)
    #7: 0x8C1BB25: alloc_raw (lib.rs:362)
    #8: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:378)
    #9: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:465)
    #10: 0x8C1BB25: intern_lazy_const (context.rs:1123)
    #11: 0x8C1BB25: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_const (project.rs:423)
    #12: 0x8C1B235: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:1049)
    #13: 0x8C1B235: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:719)
    #14: 0x8C1B235: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #15: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #16: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #17: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #18: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #19: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #20: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #21: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #22: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #23: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #24: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #25: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #26: 0x8C1B183: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #27: 0x8C1B183: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:721)
    #28: 0x8C1B183: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #29: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #30: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #31: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #32: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #33: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #34: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #35: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #36: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #37: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #38: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #39: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #40: 0x8BFE173: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #41: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #42: 0x8BFE173: fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #43: 0x8BFE173: super_fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:397)
    #44: 0x8BFE173: fold_with<core::option::Option<rustc::ty::sty::TraitRef>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #45: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #46: 0x8BFE173: fold_with<rustc::ty::ImplHeader,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #47: 0x8BFE173: fold<rustc::ty::ImplHeader> (project.rs:315)
    #48: 0x8BFE173: normalize_with_depth<rustc::ty::ImplHeader> (project.rs:274)
    #49: 0x8BFE173: normalize<rustc::ty::ImplHeader> (project.rs:258)
    #50: 0x8BFE173: rustc::traits::coherence::with_fresh_ty_vars (coherence.rs:107)

nnethercote on 4 Feb 2019

@eddby @oli-obk @RalfJung Any thoughts on how to improve intern_lazy_const?

nnethercote on 4 Feb 2019

Cc @eddyb

RalfJung on 4 Feb 2019

Any thoughts on how to improve intern_lazy_const?

There is an obvious problem: intern_lazy_const doesn't intern the value! And the values passed are exceedingly repetitive. Here's a histogram of the top 10, which account for 97.2% of the calls:

17886042 counts:
(  1)  5253160 (29.4%, 29.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 2 }) })
(  2)  5192895 (29.0%, 58.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 4 }) })
(  3)  3928986 (22.0%, 80.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 8 }) })
(  4)  1600916 ( 9.0%, 89.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 16 }) })
(  5)   719785 ( 4.0%, 93.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 32 }) })
(  6)   299507 ( 1.7%, 95.0%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 1 }) })
(  7)   271847 ( 1.5%, 96.5%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 64 }) })
(  8)    61636 ( 0.3%, 96.9%): Unevaluated(DefId(0/1:4735 ~ packed_simd[3c0f]::vPtr[0]::mptrx4[0]::{{constant}}[0]), [])
(  9)    61636 ( 0.3%, 97.2%): Unevaluated(DefId(0/1:4823 ~ packed_simd[3c0f]::vPtr[0]::mptrx8[0]::{{constant}}[0]), [])
( 10)    61636 ( 0.3%, 97.6%): Unevaluated(DefId(0/1:4653 ~ packed_simd[3c0f]::vPtr[0]::mptrx2[0]::{{constant}}[0]), [])

Fixing this should drastically reduce the memory usage.

I tried doing the obvious thing by introducing GlobalCtxt::lazy_const_interner, heavily inspired by GlobalCtxt::layout_interner, but I couldn't get the lifetimes to work. I will try again tomorrow if nobody else beats me to it.

nnethercote on 5 Feb 2019

❤1

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

hsivonen on 7 Feb 2019

This just brought down my whole system -- 16GB of RAM used to be enough to compile two rustc in parallel (with 8 jobs each), but with the current RAM consumption that does not seem to be the case any more.

RalfJung on 9 Feb 2019

Can you try again with today's nightly?

oli-obk on 10 Feb 2019

👍2

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

Much better memory usage now. Thank you!

It seems it would be worthwhile to nominate this for uplift to beta, but I'm not permitted to add the tag myself.

hsivonen on 11 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings