Rust: Metadata is too big for its own good

Created on 21 Jan 2015  路  15Comments  路  Source: rust-lang/rust

Couldn't find a previous issue on this, so I'd like to open a tracking issue for this. We've known this for a long time, but the metadata format for the compiler is _far_ too large and there are surely methods to shrink its size and impact. Today when I compile librustc, I get the following numbers:

  • librustc.rlib - 64MB

    • rustc.o - 12MB

    • rust.metadata.bin - 32MB

    • rustc.0.bytecode.deflate - 21MB

This means that the metadata is three times as large as the code we're generating. Another statistic is that 36% of the binary data of the nightly is metadata.

There are, however, a number of competing concerns around metadata:

  • Reading metadata needs to be _fast_. Rustc reads a lot of metadata for upstream crates, and it needs to quickly be able to read the minimum set of metadata for crates. This is currently achieved by storing metadata _uncompressed_ in rlibs to allow LLVM to mmap it directly into the address space and page it in for reading.
  • Metadata needs to be fairly free-form to allow encoding various types of data into it. Ideally it's also extensible into the future so at some point we can use newer compilers against older libraries (this currently not possible for other reasons).
  • All libraries need metadata available in them (currently). This means that if a library produces a dylib/rlib pair (the stdlib is one of the few that does this) then the metadata is duplicated among artifacts. It also means that metadata must be suitable to place inside of a dynamic library.

There are a few open issues on this already, but none of them are necessarily a silver bullet. Here's a smattering of wishlist ideas or various strategies.

  • Changing metadata formats may gain us a win. Currently we're using a "variant" of EBML. #9303
  • Move metadata to a separate crate for faster iteration, allowing external tools to inspect, or in general improving the overall quality of the code. #2213
  • Being able to actually inspect metadata would probably give a helpful hand into what can be eliminated. #2326

More will likely be added to this over time as it's a metabug.

A-metadata metabug

Most helpful comment

1 byte, tops

All 15 comments

For your information, I had written a custom metadata decoder in Python for debugging #15309 (the table is a bit out of date, but still works), and out of curiosity I've also made some basic efforts to reduce the metadata overhead. The rewrite code does the job, and spews the following outputs for 2015-01-20 nightly's lib/libstd-4e7c5e5c.so:

# original compressed size, the total size of given binary (*.so).
2117863 5081718
# uncompressed size of various encoding strategies.
# - orig: the original (unaltered)
# - relax: optimized size fields. the original metadata uses lots of
#   4-byte-long sizes even when less bytes are sufficient;
#   reclaiming them will require some works.
# - no-label: relax + no `Label` node. used for debugging purpose but
#   never disabled afaik.
# - size-elide: relax + one-byte tag. all tags are <0x100, so ignoring EBML
#   (requires 2-byte encoding for tags >=0x80) gives some gains.
#   also do not add sizes for known fixed-size tags (e.g. `U64`).
# - size-elide-2-or-4: one-byte tag + another relaxation strategy.
#   uses different size encoding algorithm: 2 bytes (big endian)
#   for sizes <0x8000, 4 bytes with MSB set otherwise.
#   trade-off between size and performance.
# - size-elide-4: one-byte tag + fixed 4-byte-long size.
orig 16084126 relax 13577526 no-label 8654851 size-elide 13019868 size-elide-2-or-4 14418657 size-elide-4 17563943
# recompressed (zlib -9) size of above.
# note that the original compressed size is *not* optimal.
orig 2087004 relax 1991335 no-label 1747192 size-elide 1966400 size-elide-2-or-4 2014455 size-elide-4 2123731

Wow, those are some nice wins! I had no idea that existed! If we could implement some of those optimizations today that would be awesome.

@alexcrichton Does any breaking modification to metadata need a new snapshot? I'm afraid if such modifications cannot be easily done incrementally.

Thankfully you shouldn't need a snapshot, the stage N compiler conveniently only ever reads metadata generated by itself so there's no bootstrapping issues.

storing metadata uncompressed in rlibs to allow LLVM to mmap it directly into the address space and page it in for reading.

Is this true? I was under the impression that:

  1. We zip compress LLVM bitcode before storing it
  2. We do not compress other metadata (ASTs mostly) which isn't used by LLVM.

@michaelwoerister Rustc uses LLVM's memory map abstraction to mmap the executable. LLVM itself does not use the metadata.

@lifthrasiir Oh, that refers to this: https://github.com/rust-lang/rust/blob/master/src/librustc/metadata/loader.rs#L270. All clear now :)

I'm currently working on two temporary but public branches:

  • One branch that does not change the metadata format itself (compact-metadata)
  • One branch that actively changes the metadata format (metadata-reform), which is intended to be rebased after the former

Any suggestions or patches would be appreciated.

Given https://github.com/rust-lang/rust/pull/22971 was merged, is this fixed? It's hard to tell from

Fixes #2743, #9303 (partially) and #21482.

which implies the first and last were total, and the middle one was partial?

@steveklabnik #2743 is fully fixed. I think I've said #9303 is fixed partially because it does not really fix the naming issue ("Rename it from ebml to atom_trees, change all internal naming"), but in retrospect you can safely close that. I guess this metabug needs to be open since the metadata reduction is an ongoing work (my PR was a sum of low-hanging fruits) and we probably need a central place to discuss that.

Some updates pertaining to this issue:

35764 has significantly reduced the size of metadata since #[inline]d functions are no longer stored as ASTs - The metadata of libcore was more than halved!

rustup update saw quite an improvement:

info: downloading component 'rustc'
 38.2 MiB /  38.2 MiB (100 %)   1.8 MiB/s ETA:   0 s                
info: downloading component 'rust-std'
 46.1 MiB /  46.1 MiB (100 %)   1.8 MiB/s ETA:   0 s                

Before, it looked like this:

info: downloading component 'rustc'
 49.0 MiB /  49.0 MiB (100 %)   1.8 MiB/s ETA:   0 s                
info: downloading component 'rust-std'
 61.9 MiB /  61.9 MiB (100 %)   1.8 MiB/s ETA:   0 s                

So, at what point is metadata small enough that this bug can be considered fixed?

1 byte, tops

This is a super old and much less relevant issue now, so closing.

Was this page helpful?
0 / 5 - 0 ratings