Rust: TokenStream manipulations are 1000x too slow

Created on 4 Oct 2019  路  15Comments  路  Source: rust-lang/rust

Context: https://github.com/illicitonion/num_enum/pull/14
Switching a proc macro from being token-based to operating on strings with just a final conversion from string to TokenStream can be a 100x improvement in compile time. If we care that people continue to write macros using tokens, the performance needs to be better.

I minimized the slow part of the num_enum macro to this benchmark:
https://github.com/alexcrichton/proc-macro2/tree/12bac84dd8d090d2987a57b747c7ae7bbeb8a3d0/benches/bench-libproc-macro
On my machine the string implementation takes 8ms and the token implementation takes 25721ms.

I know that there is a proc macro server that these calls end up talking to, but I wouldn't expect this huge of a factor from that. If the server calls are the only thing making this slow, is there maybe a way we could buffer operations in memory to defer and batch the server work?

I will file issues in proc-macro2 and quote as well to see if anything can be improved on their end.

FYI @eddyb @petrochenkov @alexcrichton

A-macros C-enhancement I-compiletime T-compiler

Most helpful comment

@alexcrichton: Indeed, I had the same thought late last night! I have used make_mut successfully in TokenStream::from_streams to fix half of the quadratic behaviour. I am now working at using it in TokenStreamBuilder::push.

All 15 comments

This is (AFAICT) from a brief analysis because the code appears to be allocating a fresh vector for the entire tokenstream each time we append one token. I believe that is because our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint, presumably called from this From<TokenTree> impl.
Ultimately, the root cause boils down to this piece of code: https://github.com/rust-lang/rust/blob/032a53a06ce293571e51bbe621a5c480e8a28e95/src/libproc_macro/lib.rs#L170-L184.

That has a FIXME from eddyb that this is super inefficient, and rightfully so. We are essentially always creating a TokenStreamBuilder, pushing on ~two streams, one with lots of elements and the other with two (token from original extend, as well as non joint). Having done that, we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I believe this means that in order to append N tokens we will create N vectors of 1, 2, 3, 4, .. N elements (or so, possibly modulo a constant factor of some kind). If N is large, this is really slow.

cc @pnkfelix as well, as they did some investigating into a similar issue in https://github.com/rust-lang/rust/issues/57735#issuecomment-458903904.


I am not sure what we can do about this -- ideally, we'd not be re-creating new tokenstreams on every appended token, but TokenStream seems to have a structure that isn't really amenable to us appending tokens into it (despite the name). Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal...

cc @nnethercote as well

I remember the FIXME but I also remember this not being that big of a problem for some reason. I guess if the first impl wasn't ending up using the second impl, it would be faster?

TokenStreamBuilder could totally take individual TokenTrees as well, AFAIK, and you could at least keep one builder for the entire extend (but if it's called with iter::once(tt) that would require even more optimization work).

Maybe the main problem here is that there is no API for creating a builder out of a TokenStream in CoW fashion?

Sadly the original vision of using ropes falls apart when you can't even slice a TokenStream.

Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal.

One nice way of doing this (if all else fails) is representing TokenStream on the server side (for proc macros not rustc in general) as TokenStream | TokenStreamBuilder and finishing the builder when any of the TokenStream methods are used which we can't easily implement on the builder itself.

@matklad: this might affect your approach in #64782.

our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint

I think you misread that code. It's a token stream with a single element; that element is a 2-tuple.

we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I have confirmed (via profiling and println! insertion) that this part is true. In the relevant part of execution, the second stream passed to TokenStream::from_streams always has a single element, while the first stream is one larger than the previous call.

I tried changing TokenStream from this:

pub struct TokenStream(Option<Lrc<Vec<TreeAndJoint>>>);

to this

pub struct TokenStream(Vec<TreeAndJoint>>);

This drastically sped up the microbenchmark under discussion, because new elements can be trivially appended to an existing TokenStream, rather than having to duplicate before appending.

However, it was a big slowdown for benchmarks. Most of them slowed down 2-20%, but script-servo and style-servo were both almost 5x worse. This is because the second representation is much more expensive to clone, something which happens frequently and from many places.

@nnethercote I'm not entirely sure where this bottoms out so this may be a distracting comment, but would Rc::make_mut and such make those two representations "effectively equivalent" for the purposes of the benchmark here? That should still give us cheap clones and pushing a bunch of stuff onto one clone should still be quite cheap since it doesn't have to reallocate.

@alexcrichton: Indeed, I had the same thought late last night! I have used make_mut successfully in TokenStream::from_streams to fix half of the quadratic behaviour. I am now working at using it in TokenStreamBuilder::push.

65198 fixes this so that the TokenStream code is roughly 10x slower than the string code, and the slowdown is consistent for different values of N (i.e. no more quadratic behaviour). There is room for additional improvement, but this is a good start.

It would make sense to add a benchmark to rustc-perf for this kind of code. The given microbenchmark is a possibility, but I'd prefer to use real-world code that demonstrated the problem, if possible.

I'd personally prefer the microbenchmark as it's less likely to stop building in the future in some sense :)

But that PR looks excellent. Do we know if the remaining 10x cost is for some particular reason, or are strings just that much faster than the token code? (Lrc, etc.)

The 10x is because there's a lot of faffing around: a couple of once iterators, a chain iterator, a conversion from a TokenTree to a TokenStream, creating a TokenStreamBuilder, pushing onto it, Lrc, etc.

Oh I thought the implementation must be using Lrc<[TreeAndJoint]>, I'm really glad to be wrong and that this is getting fixed :tada:

There's likely no benefit in representing a proc_macro::TokenStream as a builder server-side but we could probably add a way to extend a proc_macro::TokenStream with a proc_macro::TokenTree.

@dtolnay: is the improved performance good enough for your purposes?

I confirmed that the performance is massively improved in nightly-2019-10-10. Thanks!

FYI @illicitonion @anlumo

Was this page helpful?
0 / 5 - 0 ratings

Related issues

behnam picture behnam  路  3Comments

SharplEr picture SharplEr  路  3Comments

mcarton picture mcarton  路  3Comments

zhendongsu picture zhendongsu  路  3Comments

pedrohjordao picture pedrohjordao  路  3Comments