First implemented in https://github.com/rust-lang/rust/pull/43230 the compiler can now tokenize a few AST nodes when necessary losslessly from the original token stream. Currently, however, this is somewhat buggy:
The "real bug" here is that we haven't actually implemented converting an AST to a token tree. The initial implementation in https://github.com/rust-lang/rust/pull/43230 was effectively just a heuristic, and not even a great one! As a result we still need the ability basically to losslessly tokenize an AST node back to its original set of tokens.
Some bugs that arise from this are:
#[cfg]
modifies the AST but doesn't invalidate the cache -- https://github.com/rust-lang/rust/issues/48644macro_rules!
and procedural macros don't play well together -- https://github.com/rust-lang/rust/issues/49846There's an associated FIXME in the code right now, and to fix this we'll need to implement tokenization of an AST node. Right now the thinking of how to implement this is to save all TokenStream
instances adjacent to an AST node, and use that instead of converting back into a token stream
cc @dtolnay, @nrc, @jseyfried
As a note: I'm getting _zero_ span information on a proc-macro derive, making Span
s useless. Every Span
in the TokenStream
is 0
. Here's the (unsurprising) code:
extern crate proc_macro;
use proc_macro::TokenStream;
#[proc_macro_derive(Testing)]
pub fn derive_uri_display(input: TokenStream) -> TokenStream {
println!("Tokens: {:?}", input);
TokenStream::new()
}
@SergioBenitez what does the code look like that you've attached the macro to? You can use RUST_LOG=syntax
and you should get a warning about a "cached token stream not being used" and similarly if you run rustc --pretty normal
you should get a different set of tokens then you see in the file itself (and that's the bug to fix)
@alexcrichton Just a structure with one field:
#[derive(Testing)]
pub struct Bad {
inner: usize
}
Here's the log message:
INFO 2018-07-31T02:52:43Z: syntax::parse::token: cached tokens found, but they're not "probably equal", going with stringified version
And rustc --pretty normal
just spits back the code with a ,
added to the end of the field:
#[macro_use]
extern crate proc_macro_bug;
#[derive(Testing)]
pub struct Bad {
inner: usize,
}
pub fn main() { }
Indeed, if I manually add the comma to end of the field in the source file, things work as expected. This is...very surprising.
Yeah, there are multiple weird quirks that can cause span information to be dropped. Another example is extern fn a() {}
vs extern "C" fn a() {}
. It would be nice if span-loss could be detected and an appropriate message emitted to the user (as a temporary notice until this issue is fixed and Rust stops dropping spans).
@SergioBenitez ah yeah, that'd do it!
Fixing that would be done with either:
probably_equal_for_proc_macro
implementation to handle trailing commas (and ignore them and/or skip over them)@mjbshaw your issue is similar in the sense that we'd need to either ignore those tokens for equality purposes or we'd need to update the AST to store whether a string was specified or not.
Unfortunately no matter how we slice it this is a very difficult issue :(
Unfortunately no matter how we slice it this is a very difficult issue :(
@alexcrichton or @petrochenkov would it be possible to make an order of magnitude estimate of effort for a real fix? Multiple person-weeks, multiple months, most of a year?
Does most of the difficulty come from performance implications? I imagine the naive fix of making the AST look more like Syn's AST where every token is individually represented would be pretty bad for compiler performance.
I've been hitting this issue over and over and over since it was filed. It wasn't so bad with only derive macros where the symptom was usually just error messages being printed in the wrong place, but now with attribute macros and function-like macros I often find it totally impossible to build composable macros (as seen in #57207) and the consequence of losing spans is no longer cosmetic, it is that code that should compile doesn't compile.
@dtolnay
I think we should start with attaching associated token streams to all AST fragments that can be used as interpolated tokens in macros (the list can be found in enum Nonterminal
), like it's already done for ast::Item
. This shouldn't require much time.
(Note, that significant part of enum Nonterminal
is used only for legacy quasi-quoting, that's being removed right now.)
This may regress performance of expansion-heavy compilation by few percent, but correctness seems more important and this can be optimized later.
We need to start with attaching tokens to ast::Expr
and running benchmarks.
Tokens attached to non-terminals will fix the decl macro issues, but not #[cfg]
or inner attributes.
Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.
Alternatively, AST as it exists now would only be created after expansion, and only tokens + things necessary for expansion, like attribute targets, would exist before that (unfortunately, too much is needed to resolve and expand an attribute).
Or some hybrid using AST only for expanded parts and being gradually folded from token-only to AST-only during expansion.
That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.
Thanks! That's very helpful.
That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.
Let's try to find the right person to at least be thinking passively about this problem, or else it will keep sitting here racking up mentions as every procedural macro library hits it. Would @eddyb be the most appropriate person?
cc @matklad who was also interested in an alternative AST.
Thanks for the ping @petrochenkov! It's interesting that I am experimenting with macro expansion in rust-analyzer in this very moment :)
I don't have much expertise here: I don't really know how rust macros actually work, and have taken only a cursory look at the spans/syntax context infrastructure. Below are my random thoughts about ast/token tree conversion in the context of IDEs:
In rust-analyzer, the syntax trees are not based on token trees (in IDE context, braces are not always balanced, and you need to account for trivia, etc). My current plan is to make token-trees strictly a macro-specific interface and materialize them only when expanding macro invocations. And, because rust-analyzer syntax trees are pretty heavy weight, I also want (https://github.com/rust-analyzer/rust-analyzer/issues/386) to introduce a "post-expansion" AST based on more traditional hierarchy-of-enums representation.
Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.
One way to achieve that would be libsyntax2 style tree, whose "real" representation is a sequence of token trees, and where AST nodes don't store data themselves besides pointers into the token tree.
I think we probably just haven't put enough weight on this issue yet because it wasn't really all the pressing earlier, but I definitely agree with @dtolnay that it's far more pressing now!
I've always thought that the hard part of this is that we'd basically have to make libsyntax's AST looks like syn's, which I figured was a monumental task given the size and usage of libsyntax. I never really considered the performance implications, but I wouldn't be surprised if it had more than a few regressions.
Thinking on this for a bit though, we may be able to get to a more short-term solution. Up to now #[cfg]
is the only (AFAIK) extension which actually modifies the AST pre-expansion. If that's true, then I think we can take a hybrid libsyntax/syn-like approach. We could perhaps say that all composite structures, those which can be internally #[cfg]
'd, have the ability to generate their token stream. They do this by iterating over their components and recursively generating token streams. All terminal nodes have cached token streams from the parser (like ast::Item
does today).
A strategy like that, which is specific to #[cfg]
and falls apart if something else modifies the AST, may be an order of magnitude less work than "make the entire AST syn-like". I haven't thought this through too much though, and the numerous locations we can place #[cfg]
may effectively mean that we're just implementing a ToTokens
trait for the whole AST anyway.
In any case the best way forward with this I think is to try to brainstorm a bit to figure out a few possibilities of what a "reasonable" solution might look like, and then bring this up with the broader compiler team because it has a lot of implications on the compiler!
How feasible is it to implement ToTokens for the existing AST without data structure changes? We care most about preserving spans of identifiers as those are what lead to compilation failures. As I understand it, for every identifier represented in the AST there should be a way to form a proc_macro::Span corresponding to its hygiene context. As a first pass, we care less about spans of other tokens like keywords and punctuation since those would typically only affect error message placement.
I'm not sure I've actually got a great idea of how feasible that would be. I feel like it would be difficult because we don't track all spans everywhere, but that's based on a mental snapshot of the AST from like 3 years ago. Diagnostics have come a long way since then which often comes with annotating more parts of the AST with more spans.
I'm pretty sure, however, that most punctuation in the AST doesn't have span information. For example nothing in #[foo]
or foo::bar
or foo + bar
has span information on just the punctuation itself. That may the hardest part because all identifiers may have spans at this point.
(oh right, also keywords aren't really tracked at all)
It may just not actually be that hard to implement a raw ToTokens
for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to a TokenStream
and then pretty printing that later on
all identifiers may have spans at this point.
That's what I mean. We can implement conversions from AST to token stream (similar to ToTokens) in a way that preserves spans of all identifiers while inventing spans (as currently done) for keywords and punctuation.
It may just not actually be that hard to implement a raw
ToTokens
for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to aTokenStream
and then pretty printing that later on
@alexcrichton or @petrochenkov would one of you be able to give this a shot for one of the simplest AST types? Then we could see whether other people can fill in the work for the rest of the AST.
Before we give this a stab I'd want to game out the viability of this strategy. I feel like the punctuation/keyword spans are basically unused for TokenStream
but are critical for setting span information on all the rustc AST nodes (as the hi/lo is used all over the place in the parser). I'd be wary that if identifiers were correct and nothing else we'd still be in a pretty bad place
@eddyb had some ideas at the Rust All Hands for how to fix this. As I understand it, we want to treat cfg
eliminations and other manipulations of the syntax tree as overlays, with each overlay also carrying a token range that identifies the contiguous group of source tokens involved. This overlay mechanism already exists to some extent but is collapsed at some point in a way that is not information-preserving. Based on the token range information captured in overlays it should be possible to reconstruct accurate spans for every token in the syntax tree. And the performance will possibly even be better than the current stringify-reparse workaround.
This will be impactful not just for procedural macros. We identified that IDEs also care about span information captured accurately in macro-expanded code.
Mentioning @alexreg who may be interested in tackling this issue.
Just so I understand. this issue is about the Span?
If so is it even close to solving? Because This gives me a lot of problems in my attribute macro, every error in the function is shown like it's on the macro.
i.e.
83 | #[logfn(INFO)]
| ^^^^^^^^^^^^^^ expected struct `std::string::String`, found array of 20 elements
A concrete case arose while trying to use a raw identifier in an attribute:
alpha/src/main.rs
use beta::Beta;
#[derive(Beta)]
enum Dummy {
#[foo(bar(r#value))]
Variant
}
fn main() {}
alpha/beta/src/lib.rs
extern crate proc_macro;
use proc_macro::TokenStream;
#[proc_macro_derive(Beta, attributes(foo))]
pub fn beta_derive(input: TokenStream) -> TokenStream {
println!("{:#?}", input);
TokenStream::new()
}
This produces the ident without the raw identifier marking (r#
):
Ident {
ident: "value",
span: #0 bytes(0..0)
}
Complete
TokenStream
TokenStream [
Ident {
ident: "enum",
span: #0 bytes(0..0)
},
Ident {
ident: "Dummy",
span: #0 bytes(0..0)
},
Group {
delimiter: Brace,
stream: TokenStream [
Punct {
ch: '#',
spacing: Alone,
span: #0 bytes(0..0)
},
Group {
delimiter: Bracket,
stream: TokenStream [
Ident {
ident: "foo",
span: #0 bytes(0..0)
},
Group {
delimiter: Parenthesis,
stream: TokenStream [
Ident {
ident: "bar",
span: #0 bytes(0..0)
},
Group {
delimiter: Parenthesis,
stream: TokenStream [
Ident {
ident: "value",
span: #0 bytes(0..0)
}
],
span: #0 bytes(0..0)
}
],
span: #0 bytes(0..0)
}
],
span: #0 bytes(0..0)
},
Ident {
ident: "Variant",
span: #0 bytes(0..0)
},
Punct {
ch: ',',
spacing: Alone,
span: #0 bytes(0..0)
}
],
span: #0 bytes(0..0)
}
]
Nominating for possible prioritization (sorry, the past few months have been pretty hectic).
cc @aturon @nikomatsakis wrt me focusing on a large-scale solution for this
@eddyb
(I've read the "AST(“CST”)/Parsing Evolution" paper.)
Both the paper and this issue talks about converting AST to tokens, but it seems like we don't actually need AST before macro expansion, we rather need to build AST as a result of macro expansion (with partial AST for already expanded code being available during expansion).
For input to macro expansion, something like "token streams on steroids" would be enough, I think:
struct PreexpTokenStream {
preexp_tts: Vec<PreexpTokenTree>,
}
struct PreexpTokenTree {
TokenStream(TokenStream),
PreexpGroup(PreexpGroup),
AttributeInvocation(Attribute, AttributeInvocationKind),
}
struct PreexpGroup {
delimiter: Delimiter,
preexpr_ts: PreexpTokenStream,
}
enum AttributeInvocationKind {
// #[attr] TOKENS
Outer(PreexpTokenStream),
// HEADER_TOKENS { #![attr] TOKENS }
Inner(PreexpTokenStream, PreexpGroup),
}
struct Attribute {
path: PreexpTokenStream,
tts: TokenStream,
}
Attributes and their targets need to be available early due to eager expansion (cfg
now and https://github.com/rust-lang/rfcs/pull/2320 in the future).
Unfortunately we still need a full (or nearly full) Rust language parser to build such a PreexpTokenStream
, because we need to interpret #[attr] a b c d
as (#[attr] a b /* part of the attribute target */) (c d /* not a part of the attribute target */)
and we cannot do that without knowing the syntactic details of attribute targets.
@petrochenkov Yeah, I think some weaker parsing (preferably to some sort of Parse Tree) would be good here - however, I think that's longer term, unless you expect there to be a low-effort version?
Also, to be clear, I don't want to convert the AST (via "pretty"-printing), but rather keep track of the equivalent TokenStream, effectively having both the AST and what you're talking about, at the same time.
I think that's longer term, unless you expect there to be a low-effort version
No, certainly not low-effort, we'd at least need to convert built-in derives to the token model and who knows what else.
(On the other hand, not many things in this area are low-effort).
keep track of the equivalent TokenStream, effectively having both the AST and what you're talking about, at the same time.
Yes, even if AST don't need to be produced before expansion, it still can be produced before expansion.
Building AST and "PreexpTokenStream"s simultaneously seems pretty natural given that they need to use the same parser.
I'm not sure how that can be done in details though.
(Hmm, perhaps the user facing TokenStream
could use PreexpTokenStream
as a compiler counterpart, it's inspectable only through iteration anyway.)
@petrochenkov I see, yes, I have no problem with moving to that kind of a representation, as long as it's still embedded into today's AST (until we can get rid of that).
My plan with detecting mutation is to eventually force all synthetic AST nodes to come with their (Preexp
)TokenStream
representation already present in them, and stop pretty-printing altogether.
At that point, it might be easier to have some sort of limited quasi-quoting (kind of like the quote
crate today - and it doesn't need to support literals, so it should be efficient) in built-in derives, and force them to be parsed - in fact, we could do that even sooner, if you think it's a good idea in the first place.
Removing the I-nominated
label -- but it seems like this would make a good design meeting topic, since the fix is non-trivial? I see that there is lots of active discussion.
https://github.com/rust-lang/rust/pull/60679 addresses the issue for literals (all literal tokens are passed to proc macros precisely).
Hi, how is this issue progressing? The core foundational crates (e.g. cortex-m
) for the embedded ecosystem are heavily dependent on proc macros and not getting sane error messages (with lost span information) really hinders development and adoption for beginners.
I'm not sure how to "rattle the tree" to have this issue get more attention, if you have tips I'm very open help improve awareness of this issue.
cc @therealprof @jamesmunns @japaric
For many cases, I think we can trivially get incremental wins here -- e.g. you'll note that https://github.com/rust-lang/rust/issues/64561 is E-easy and E-mentor -- but solving this problem entirely I think is somewhat up in the air as to how to do so, though we do have some ideas. @matklad in particular has been doing some work to move us over to using more TokenStream rather than ASTs as input to proc macros, I believe, which can help here.
If the embedded ecosystem is hitting specific pain points frequently (i.e., you have examples of doing X that causes breakage) it's quite possible we can selectively patch those and help y'all out without the more complete fix.
Are things at a point where it would possible to start working on this (beyond the small improvements that @Mark-Simulacrum mentioned)? Or is there still design work that needs to be done?
I've been meaning to work on this for a while and made a few plans, but I didn't get the time to spend on it - feel free to talk to me elsewhere, maybe a few things can be done.
E.g. I wanted to add a TokenStream
to Nonterminal
to allow recovering the exact original input, and perhaps even get rid of AST nodes embedded in tokens altogether (at the cost of reparsing).
@petrochenkov: I'd like to work on this issue - is there anything I can do to help?
@Aaron1011
I still think https://github.com/rust-lang/rust/issues/43081#issuecomment-450641481 would be a good first step.
Attaching a TokenStream
to ast::Expr
, like it's already attached to ast::Item
, and measuring the effect on performance.
If tokens are kept for all AST fragments in token::Nonterminal
, it will eliminate at least one major source of pretty-printing.
With https://github.com/rust-lang/rust/pull/72287, https://github.com/rust-lang/rust/pull/72393, https://github.com/rust-lang/rust/pull/72388, and https://github.com/rust-lang/rust/pull/72306 merged, we should be hitting this issuse much less frequently.
@Aaron1011
Optimization idea - https://github.com/rust-lang/rust/pull/72287 collects tokens only for expressions with attributes and it resulted in speedups, same can be done for items (which are currently collecting tokens unconditionally).
Analysis of breakage from partially fixing the span loss (cases related to passing interpolated tokens) - https://github.com/rust-lang/rust/issues/72622.
some question on this issue:
#[cfg(features="some_not_set")]
loses the spans ? I would like to provide some help (if it's in my skill/time)cfg
stuffThanks all
@thiolliere: Summary of the current progress:
Once all of the pieces of https://github.com/rust-lang/rust/pull/76130 are merged, any code that compiles on stable should have all spans available (except for any additional pretty-printer issues that may get discovered).
Using proc-macros as inner attributes will still cause spans to be lost, but such usage is nightly-only.
Most helpful comment
Nominating for possible prioritization (sorry, the past few months have been pretty hectic).
cc @aturon @nikomatsakis wrt me focusing on a large-scale solution for this