Rust: Switch the default global allocator to System, remove alloc_jemalloc, use jemallocator in rustc

Created on 4 Oct 2016 · 35Comments · Source: rust-lang/rust

Updated description

A long time coming, this issue is that we should implement these changes simultaneously:

Remove the alloc_jemalloc crate
Default allocations for all crate types to std::alloc::System. While currently the default for cdylib/staticlib, it's not the default for staticlib/executable
Add the jemallocator crate to rustc, but only rustc
Long-term, deprecate and remove the alloc_system crate

We for the longest time have defaulted to jemalloc as the default allocator for Rust programs. This has been in place since pre-1.0 and the vision was that we'd give programs a by-default faster allocator than what's on the system. Over time, this has not fared well:

Jemalloc has been disabled on a wide variety of architectures for various reasons, the system allocator seems more reliable.
Jemalloc, for whatever reason as we ship it, is incompatible with valgrind
Jemalloc bloats the size of executables by deafult
Not all Rust programs are bottlenecked on allocations, and those which are can use #[global_allocator] to opt-in to a jemalloc-based global allocator (through the jemallocator or any other allocator crate).

The compiler, however still receives a good deal of benefit from using jemalloc (measured in https://github.com/rust-lang/rust/pull/55202#issuecomment-431514148). If that link is broken, it's basically a blanket across-the-board 8-10% regression in compile time for many benchmarks. (apparently the max rss also regressed on many benchmarks!). For this reason, we don't want to remove jemalloc from rustc itself.

The rest of this issue is now going to be technical details about how we can probably get rid of alloc_jemalloc while preserving jemalloc in rustc itself. The tier 1 platforms that use alloc_jemalloc which this issue will be focused on are:

x86_64-unknown-linux-gnu
i686-unknown-linux-gnu
x86_64-apple-darwin
i686-apple-darwin

Jemalloc is notably disabled on all Windows platforms (I believe due to our inability to ever get it building over there). Furthermore Jemalloc is enabled on some linux platforms but I think ended up basically being disabled on all but the above. This I believe narrows the targets we need to design for, as we basically need to keep the above working.

Note that we also have two modes of using jemalloc. In one mode we could actually use jemalloc-specific API functions, like alloc_jemalloc does today. We could also use the standard API it has and the support to hook into the standard allocator on these two platforms. It's not been measured (AFAIK) at this time the tradeoff between these two strategies. Note that in any case we want to route LLVM's allocations to jemalloc, so we want to be sure to hook into the default allocator somehow.

I believe that this default allocator hooking on Linux works by basically relying on its own symbol malloc overriding that in libc, routing all memory allocation to jemalloc. I'm personally quite fuzzy on the details for OSX, but I think it has something to do with "zone allocators" and not much to do with symbol names. I think this means we can build jemalloc without symbol prefixes on Linux, and with symbol prefixes on OSX, and we should be able to, using that build, override the default allocator in both situations.

I would propose, first, a "hopefully easy" route to solve this:

Let's link the compiler to the "system allocator". Let's then, on the four platforms above, link to jemalloc_sys, pulling in all of jemalloc itself. This should, with the right build configuration, mean that we're not using jemalloc everywhere in the compiler (just as we're rerouting LLVM we're rerouting the compiler).

I'm testing out the performance of this in https://github.com/rust-lang/rust/pull/55217 and will report back with results. Results are that this is universally positive almost! @alexcrichton will make a PR.

Failing this @alexcrichton has ideas for a more invasive solution to use jemalloc-specific API calls in rustc itself, but hopefull that won't be necessary...

Original Description

@alexcrichton and I have increasingly come to think that Rust should not maintain jemalloc bindings in tree and link it by default. The primary reasons being:

Being opinionated about the default allocator is against Rust's general philosophy of getting as close to the underlying system as possible. We've removed almost all runtime baggage from Rust except jemalloc.
Due to breakage we've had to disable jemalloc support on some windows configurations, changing our default allocation characteristics there, and offering different implicit "service levels" on different tier 1 platforms.
Keeping jemalloc working imposes increased maintenance burden. We support a lot of platforms and jemalloc upgrades sometimes do not work across all of them.
The build system is complicated by supporting jemalloc on some platforms but not all.

For the sake of consistency and maintenance we'd prefer to just always use the system allocator, and make jemalloc an easy option to enable via the global allocator and a jemalloc crate on crates.io.

A-allocators C-enhancement T-libs relnotes

Source

brson

👍47 👎1

Most helpful comment

jemalloc also makes Rust look bad to newcomers, because it makes "Hello World" executables much larger (I know it's not a fair way to judge a language, but people do, and I can't stop myself from caring about size of redistributable executables, too)

kornelski on 2 Jan 2017

👍18

All 35 comments

Depends on having stable global allocators.

Since this will result in immediate performance regressions on platforms using jemalloc today we'll need to be sensitive about how the transition is done and make sure it's clear how to regain that allocator perfomance. It might be a good idea to simultaneously publish other allocator crates to demonstrate the value of choice and for benchmark comparisons.

brson on 4 Oct 2016

An alternative I've heard @sfackler advocate from time to time is:

We make no guarantees about the default allocator, allowing us to choose a fast one like jemalloc
We expose _in stable Rust_ the ability to request the system allocator as the global allocator

That would allows us to _optionally_ include jemalloc, but if you want the system allocator for heap profiling, valgrind, or other use cases you can choose so.

alexcrichton on 4 Oct 2016

👍4

I would specifically like to jettison jemalloc entirely and use the system allocator. It breaks way too often, it dropped valgrind support, it adds a couple hundred kb to binaries, etc.

sfackler on 5 Oct 2016

👍3

Some historical speed bumps we've had with jemalloc:

Currently we don't ship on MSVC _or_ MinGW
Failed on osx beta releases - https://github.com/rust-lang/rust/issues/34674
Requires a C compiler, makes cross-compiling that much harder. Not only finding the right compiler but passing the right set of options to the C compiler to match what we're doing in Rust.
Not faster on ARM64 - https://github.com/rust-lang/rust/issues/34476
Packaging nightmare, distros sometimes want to use their own, but this really doesn't jive well with static linking.
Valgrind does not work - https://github.com/rust-lang/rust/issues/28224
Others can rely on jemalloc symbols - https://github.com/rust-lang/rust/issues/31780
Pulls in a large dep on libc which has glibc versioning issues - https://github.com/rust-lang/rust/issues/30966
More problems on MinGW - https://github.com/rust-lang/rust/issues/31030
Doesn't work by default on MinGW - https://github.com/jemalloc/jemalloc/issues/310
Deadlock bugs - https://github.com/jemalloc/jemalloc/issues/315
AArch64 + Fedora segfault - https://github.com/rust-lang/rust/issues/36994
OSX deadlock - https://github.com/jemalloc/jemalloc/issues/895

I'll try to keep this updated as we run into more issues.

alexcrichton on 6 Oct 2016

👍13 ❤1

kornelski on 2 Jan 2017

👍18

Another observation - jemalloc seems to add a significant amount of overhead to thread creation, on both Linux and macOS. This hasn't been a major issue for me as we plan to use the system allocator on Fuchsia, but probably something worth looking into.

raphlinus on 2 Jan 2017

❤3

On the glibc side, we would be interested in workloads where jemalloc shows significant benefits. (@djdelorie is working on improving glibc malloc performance.)

fweimer on 3 Jan 2017

PR implementing this: #38820

japaric on 4 Jan 2017

I'd like to see some light benchmarks to get an idea of the magnitude of the default performance regression we can expect.

bstrie on 16 Mar 2017

I'm not sure if this is active, but wanted to voice a recent pain-point:

I am using stable Rust. I wrote a executable. I wrote a dylib. I called one from the other. It explodes because they have different default allocators and I cannot change either on stable.

Independent of which allocator is fasterest, or hardest to maintain, etc., that there is a difference between the default allocators makes the shared library FFI story on stable Rust pretty bad.

Edit: Also, this issue was opened on my birthday so you should just make it happen. <3

frankmcsherry on 26 Apr 2017

👍8 😄2

I believe we've also encountered a deadlock on OSX with recent versions of jemalloc - https://github.com/jemalloc/jemalloc/issues/895

alexcrichton on 7 Jun 2017

I'm going to close this in favor of https://github.com/rust-lang/rust/issues/27389. It's highly likely that all programs will start to link to jemalloc by default once we stabilize that feature, but there's not really much we can do until that issue lands.

alexcrichton on 20 Jun 2017

Reopening because https://github.com/rust-lang/rust/issues/27389 is about to be closed with the stabilization of the #[global_allocator] attribute (https://github.com/rust-lang/rust/pull/51241) without changing the default.

This may be blocked on https://github.com/rust-lang/rust/issues/51038, assuming we want rustc to keep using jemalloc.

SimonSapin on 31 May 2018

❤1

assuming we want rustc to keep using jemalloc.

It's worth testing again. On Fedora 28 x86_64, a clean build of one of my projects takes 142.20s to compile using Fedora's rustc (with system malloc -- glibc 2.27), and 142.54s using upstream rustc (with jemalloc). That's effectively a tie. YMMV with other versions of libc, of course.

cuviper on 1 Jun 2018

👍4

Another data point: one of my apps is much faster with the GLIBC 2.27 allocator, at the expense of a 5x larger working set. So it's really worth testing.

lnicola on 12 Jun 2018

👍1

Is this done?

Edit: seems no. I've been confused about the state of the global allocator in beta/nightly. Docs seem to indicate this is done.

brson on 1 Jul 2018

Sorry if docs are unclear. In 1.28+ you can now change now change the global allocator from its default, but the default is still jemalloc (on some/most platforms).

There is now no hard blocker for changing the default, but only doing that might (maybe?) regress rustc performance so we may want to do https://github.com/rust-lang/rust/issues/51038 at the same time. How to make https://github.com/rust-lang/rust/issues/51038 work is unfortunately non-obvious though, see discussion there.

SimonSapin on 1 Jul 2018

👍1

I've now done a few triage items for this:

closed a few issues in favor of this one
updated the OP
collected data showing that we still need jemalloc on linux for good perf
proposed a strategy that may be quite small and still give us the benefit, but we'll see when the perf data comes back.

alexcrichton on 20 Oct 2018

🎉1

If it is helpful, or at least informative, we (timely eval) have several cases where you want system alloc on linux for good perf unless you want to hack around jemalloc's heuristics (e.g. pre-allocate 256GB of virtual memory to prevent jemalloc from thrashing by madvising back to the kernel). This shows up on multicore computations mostly, with 16-32 workers, where there is more allocation churn than working set. If you add more threads to these computations, they go slower because of jemalloc, by factors of up to 2x-3x. I believe if we had to pick one allocator and wanted to ensure scaling out to many cores, system alloc would be the one.

Now, that is not at all evidence that other people don't need jemalloc, which I'm sure they do (we get worse numbers with single and few cores using system alloc; it just scales out better). But when you say "we" still need jemalloc, you probably have some use cases in mind, and to the extent that they are single-threaded low allocation churn you may be seeing different results.

(( edit: only point being, the "we" you mentioned doesn't include me; it may include everyone else who uses Rust though. ))

frankmcsherry on 20 Oct 2018

👍4 ❤2

@frankmcsherry The GLIBC version might also be at play here. Its allocator got some improvements recently that made it competitive with jemalloc.

lnicola on 20 Oct 2018

@frankmcsherry oh of course! There's definitely use cases that jemalloc is much better for performance, which is why we never even considered implementing this issue until you could opt-in on stable Rust to jemalloc. Now that we're there though this is just a question of defaults :)

alexcrichton on 20 Oct 2018

@alexcrichton What is the glibc version used in the perf benchmarks you posted (where rustc with jemalloc turns out to be much faster than rustc with glibc malloc)? The glibc malloc improvements @lnicola mentions were introduced in glibc v2.26.

stjepang on 20 Oct 2018

@stjepang ah yes I forgot to clarify, but that's using glibc 2.23, it may indeed very well be that glibc 2.26 is faster! For now I'm hoping that we can largely preserve parity with our current system today before removing jemalloc from rustc, and consider that as a separate change.

If others are enterprising though to run the benchmark suite locally on glibc 2.26 vs jemalloc tht'd be awesome! If we wanted to just decide to jettison jemalloc outright that would also be helpful :)

alexcrichton on 20 Oct 2018

Some excellent results came in from https://github.com/rust-lang/rust/pull/55217, so I'm going to send an official PR for that.

alexcrichton on 21 Oct 2018

🎉1

I've opened https://github.com/rust-lang/rust/pull/55238 to close out this issue

alexcrichton on 21 Oct 2018

@alexcrichton what about the memory usage? It regressed, apparently.

lnicola on 21 Oct 2018

Long-term, deprecate and remove the alloc_system crate

Do we not want to provide System for no_std applications? Maybe that could be done on crates.io, with a setup similar to the libc crate?

SimonSapin on 21 Oct 2018

@SimonSapin yeah I sort of see std as core + libc taken to the limit, and System sort of implies/requires libc so I don't think there's much worth and benefit from providing System in a separate crate that's not std, but it could certainly be done on crates.io by depending on libc!

alexcrichton on 21 Oct 2018

@alexcrichton Is there a tracking issue to track when defaulting to system allocator lands on stable? rustc 1.31.0 (abe02cefd 2018-12-04) on macOS still links in jemalloc. Thanks!

johnthagen on 9 Dec 2018

@johnthagen It just has to ride the normal release train. The PR that closed this issue is currently on the beta branch, on track for 1.32.

cuviper on 9 Dec 2018

👍1

@johnthagen We generally close tracking issues when something is done/implemented in the master branch. We don’t track features individually after that, since the release schedule is predictable.

In this case, you can see that this issue was closed by #55238 on 2018-11-03, so it likely reached the Nightly channel the next day. Every 6 weeks, Beta becomes Stable and Nightly is forked as the new Beta. So it takes 6 to 12 weeks for a PR merge to reach the Stable channel. https://github.com/rust-lang/rust/blob/master/RELEASES.md shows the dates of past releases and https://forge.rust-lang.org/ the expected date of the next release.

SimonSapin on 9 Dec 2018

👍2

Should this be tagged with relnotes?

jonhoo on 10 Dec 2018

👍2

Good point! Done.

SimonSapin on 10 Dec 2018

🎉1

I am quite saddened by this. PL-scale memory throughout regressions like this will use a lot more energy, cost most users (who are unlikely to learn about GlobalAlloc) more on their server bills, and blunt the surprising bliss experienced by so many newcomers whose uncertain first steps blow their previous implementations out of the water.

Binary size is a vanity metric for computing at scale, and for those who require it to be smaller, they have the flexibility to change.

This has real ethical implications, as our DCs are set to consume 20% of the world's electricity by 2025, and the decisions made by those shaping the foundational layers have massive implications.

Overriding GlobalAlloc is not a realistic option for authors of allocation intensive libraries, as it prevents users from using tools like the llvm sanitizers etc...

As engineers building foundational infrastructure, we have an ethical obligation to the planet to minimize the costs we impose on it. This decision was made in direct contradiction of this responsibility to our shared home. Amazing efficiency by default on the platform that is the main driver of world-wide datacenter power consumption is a precious metric for a language with as bright a future for massive scale adoption as rust.

spacejam on 18 Jan 2019

😕5

@spacejam I don't think it's quite fair to characterize this as that grand of a problem. It's not as though jemalloc exclusively makes things faster, and thus not as though this is universally a regression. Quite to the contrary. There are some workloads that are made much better by this. This change also means that, as system allocators improve, so will that of Rust programs. This would not be the case for a compiled-in memory allocator. If you want to go down the life-cycle analysis path, I think it could also be argued that we are saving countless person hours by allowing the user of standardized tools from people who previously had to waste time trying to figure out why valgrind or what didn't just work. Along those same lines, one could argue that every change to the standard library has wide-reaching implications on global energy use, but a) that impact is _minute_; b) that impact is basically _impossible to predict_; and c) it is infeasible to perform that kind of analysis on any kind of representative scale for every (if any) change.

jonhoo on 18 Jan 2019

👍8

Was this page helpful?

0 / 5 - 0 ratings