Rust: Compiler loses location information before calling macros (sometimes)

Created on 6 Jul 2017 · 36Comments · Source: rust-lang/rust

Updated description

First implemented in https://github.com/rust-lang/rust/pull/43230 the compiler can now tokenize a few AST nodes when necessary losslessly from the original token stream. Currently, however, this is somewhat buggy:

It only happens for items in certain situations (aka no inner attributes). The consequence of this is that more and more attributed items will have invalid span information when sent to procedural macros.
We don't invalidate the cache when later changing the item. The consequence of this is that procedural macros will receive a buggy view of what the AST node actually represents.

The "real bug" here is that we haven't actually implemented converting an AST to a token tree. The initial implementation in https://github.com/rust-lang/rust/pull/43230 was effectively just a heuristic, and not even a great one! As a result we still need the ability basically to losslessly tokenize an AST node back to its original set of tokens.

Some bugs that arise from this are:

Usage of #[cfg] modifies the AST but doesn't invalidate the cache -- https://github.com/rust-lang/rust/issues/48644
Modules with attributes disagree on what their internal tokens are -- https://github.com/rust-lang/rust/issues/47627
macro_rules! and procedural macros don't play well together -- https://github.com/rust-lang/rust/issues/49846
~Invoking a macro with brackets loses span information -- https://github.com/rust-lang/rust/issues/50840~

Original Description

There's an associated FIXME in the code right now, and to fix this we'll need to implement tokenization of an AST node. Right now the thinking of how to implement this is to save all TokenStream instances adjacent to an AST node, and use that instead of converting back into a token stream

cc @dtolnay, @nrc, @jseyfried

A-diagnostics A-macros-1.2 A-macros-2.0 C-bug P-medium T-compiler

Source

alexcrichton

👍7

Most helpful comment

Nominating for possible prioritization (sorry, the past few months have been pretty hectic).
cc @aturon @nikomatsakis wrt me focusing on a large-scale solution for this

eddyb on 30 Apr 2019

❤3 👍2 😄1

All 36 comments

As a note: I'm getting _zero_ span information on a proc-macro derive, making Spans useless. Every Span in the TokenStream is 0. Here's the (unsurprising) code:

extern crate proc_macro;

use proc_macro::TokenStream;

#[proc_macro_derive(Testing)]
pub fn derive_uri_display(input: TokenStream) -> TokenStream {
    println!("Tokens: {:?}", input);
    TokenStream::new()
}

SergioBenitez on 30 Jul 2018

@SergioBenitez what does the code look like that you've attached the macro to? You can use RUST_LOG=syntax and you should get a warning about a "cached token stream not being used" and similarly if you run rustc --pretty normal you should get a different set of tokens then you see in the file itself (and that's the bug to fix)

alexcrichton on 30 Jul 2018

@alexcrichton Just a structure with one field:

#[derive(Testing)]
pub struct Bad {
    inner: usize
}

Here's the log message:

INFO 2018-07-31T02:52:43Z: syntax::parse::token: cached tokens found, but they're not "probably equal", going with stringified version

And rustc --pretty normal just spits back the code with a , added to the end of the field:

#[macro_use]
extern crate proc_macro_bug;

#[derive(Testing)]
pub struct Bad {
    inner: usize,
}

pub fn main() { }

Indeed, if I manually add the comma to end of the field in the source file, things work as expected. This is...very surprising.

SergioBenitez on 31 Jul 2018

Yeah, there are multiple weird quirks that can cause span information to be dropped. Another example is extern fn a() {} vs extern "C" fn a() {}. It would be nice if span-loss could be detected and an appropriate message emitted to the user (as a temporary notice until this issue is fixed and Rust stops dropping spans).

mjbshaw on 31 Jul 2018

@SergioBenitez ah yeah, that'd do it!

Fixing that would be done with either:

Update the probably_equal_for_proc_macro implementation to handle trailing commas (and ignore them and/or skip over them)
Change the definition of the Rust AST node to indicate whether a trailing comma was parsed, and pretty-print it faithfully.

@mjbshaw your issue is similar in the sense that we'd need to either ignore those tokens for equality purposes or we'd need to update the AST to store whether a string was specified or not.

Unfortunately no matter how we slice it this is a very difficult issue :(

alexcrichton on 1 Aug 2018

Unfortunately no matter how we slice it this is a very difficult issue :(

@alexcrichton or @petrochenkov would it be possible to make an order of magnitude estimate of effort for a real fix? Multiple person-weeks, multiple months, most of a year?

Does most of the difficulty come from performance implications? I imagine the naive fix of making the AST look more like Syn's AST where every token is individually represented would be pretty bad for compiler performance.

I've been hitting this issue over and over and over since it was filed. It wasn't so bad with only derive macros where the symptom was usually just error messages being printed in the wrong place, but now with attribute macros and function-like macros I often find it totally impossible to build composable macros (as seen in #57207) and the consequence of losing spans is no longer cosmetic, it is that code that should compile doesn't compile.

dtolnay on 30 Dec 2018

👍4

@dtolnay
I think we should start with attaching associated token streams to all AST fragments that can be used as interpolated tokens in macros (the list can be found in enum Nonterminal), like it's already done for ast::Item. This shouldn't require much time.
(Note, that significant part of enum Nonterminal is used only for legacy quasi-quoting, that's being removed right now.)

This may regress performance of expansion-heavy compilation by few percent, but correctness seems more important and this can be optimized later.
We need to start with attaching tokens to ast::Expr and running benchmarks.

Tokens attached to non-terminals will fix the decl macro issues, but not #[cfg] or inner attributes.
Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.

Alternatively, AST as it exists now would only be created after expansion, and only tokens + things necessary for expansion, like attribute targets, would exist before that (unfortunately, too much is needed to resolve and expand an attribute).
Or some hybrid using AST only for expanded parts and being gradually folded from token-only to AST-only during expansion.
That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.

petrochenkov on 31 Dec 2018

👍2

Thanks! That's very helpful.

That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.

Let's try to find the right person to at least be thinking passively about this problem, or else it will keep sitting here racking up mentions as every procedural macro library hits it. Would @eddyb be the most appropriate person?

dtolnay on 1 Jan 2019

cc @matklad who was also interested in an alternative AST.

petrochenkov on 1 Jan 2019

❤2

Thanks for the ping @petrochenkov! It's interesting that I am experimenting with macro expansion in rust-analyzer in this very moment :)

I don't have much expertise here: I don't really know how rust macros actually work, and have taken only a cursory look at the spans/syntax context infrastructure. Below are my random thoughts about ast/token tree conversion in the context of IDEs:

In rust-analyzer, the syntax trees are not based on token trees (in IDE context, braces are not always balanced, and you need to account for trivia, etc). My current plan is to make token-trees strictly a macro-specific interface and materialize them only when expanding macro invocations. And, because rust-analyzer syntax trees are pretty heavy weight, I also want (https://github.com/rust-analyzer/rust-analyzer/issues/386) to introduce a "post-expansion" AST based on more traditional hierarchy-of-enums representation.

Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.

One way to achieve that would be libsyntax2 style tree, whose "real" representation is a sequence of token trees, and where AST nodes don't store data themselves besides pointers into the token tree.

matklad on 1 Jan 2019

I think we probably just haven't put enough weight on this issue yet because it wasn't really all the pressing earlier, but I definitely agree with @dtolnay that it's far more pressing now!

I've always thought that the hard part of this is that we'd basically have to make libsyntax's AST looks like syn's, which I figured was a monumental task given the size and usage of libsyntax. I never really considered the performance implications, but I wouldn't be surprised if it had more than a few regressions.

Thinking on this for a bit though, we may be able to get to a more short-term solution. Up to now #[cfg] is the only (AFAIK) extension which actually modifies the AST pre-expansion. If that's true, then I think we can take a hybrid libsyntax/syn-like approach. We could perhaps say that all composite structures, those which can be internally #[cfg]'d, have the ability to generate their token stream. They do this by iterating over their components and recursively generating token streams. All terminal nodes have cached token streams from the parser (like ast::Item does today).

A strategy like that, which is specific to #[cfg] and falls apart if something else modifies the AST, may be an order of magnitude less work than "make the entire AST syn-like". I haven't thought this through too much though, and the numerous locations we can place #[cfg] may effectively mean that we're just implementing a ToTokens trait for the whole AST anyway.

In any case the best way forward with this I think is to try to brainstorm a bit to figure out a few possibilities of what a "reasonable" solution might look like, and then bring this up with the broader compiler team because it has a lot of implications on the compiler!

alexcrichton on 2 Jan 2019

How feasible is it to implement ToTokens for the existing AST without data structure changes? We care most about preserving spans of identifiers as those are what lead to compilation failures. As I understand it, for every identifier represented in the AST there should be a way to form a proc_macro::Span corresponding to its hygiene context. As a first pass, we care less about spans of other tokens like keywords and punctuation since those would typically only affect error message placement.

dtolnay on 2 Jan 2019

I'm not sure I've actually got a great idea of how feasible that would be. I feel like it would be difficult because we don't track all spans everywhere, but that's based on a mental snapshot of the AST from like 3 years ago. Diagnostics have come a long way since then which often comes with annotating more parts of the AST with more spans.

I'm pretty sure, however, that most punctuation in the AST doesn't have span information. For example nothing in #[foo] or foo::bar or foo + bar has span information on just the punctuation itself. That may the hardest part because all identifiers may have spans at this point.

(oh right, also keywords aren't really tracked at all)

It may just not actually be that hard to implement a raw ToTokens for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to a TokenStream and then pretty printing that later on

alexcrichton on 2 Jan 2019

all identifiers may have spans at this point.

That's what I mean. We can implement conversions from AST to token stream (similar to ToTokens) in a way that preserves spans of all identifiers while inventing spans (as currently done) for keywords and punctuation.

It may just not actually be that hard to implement a raw ToTokens for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to a TokenStream and then pretty printing that later on

@alexcrichton or @petrochenkov would one of you be able to give this a shot for one of the simplest AST types? Then we could see whether other people can fill in the work for the rest of the AST.

dtolnay on 3 Jan 2019

Before we give this a stab I'd want to game out the viability of this strategy. I feel like the punctuation/keyword spans are basically unused for TokenStream but are critical for setting span information on all the rustc AST nodes (as the hi/lo is used all over the place in the parser). I'd be wary that if identifiers were correct and nothing else we'd still be in a pretty bad place

alexcrichton on 3 Jan 2019

@eddyb had some ideas at the Rust All Hands for how to fix this. As I understand it, we want to treat cfg eliminations and other manipulations of the syntax tree as overlays, with each overlay also carrying a token range that identifies the contiguous group of source tokens involved. This overlay mechanism already exists to some extent but is collapsed at some point in a way that is not information-preserving. Based on the token range information captured in overlays it should be possible to reconstruct accurate spans for every token in the syntax tree. And the performance will possibly even be better than the current stringify-reparse workaround.

This will be impactful not just for procedural macros. We identified that IDEs also care about span information captured accurately in macro-expanded code.

Mentioning @alexreg who may be interested in tackling this issue.

dtolnay on 5 Feb 2019

👍1

Just so I understand. this issue is about the Span?
If so is it even close to solving? Because This gives me a lot of problems in my attribute macro, every error in the function is shown like it's on the macro.

i.e.

83 |     #[logfn(INFO)]                                      
   |     ^^^^^^^^^^^^^^ expected struct `std::string::String`, found array of 20 elements

elichai on 6 Feb 2019

A concrete case arose while trying to use a raw identifier in an attribute:

alpha/src/main.rs

use beta::Beta;

#[derive(Beta)]
enum Dummy {
    #[foo(bar(r#value))]
    Variant
}

fn main() {}

alpha/beta/src/lib.rs

extern crate proc_macro;

use proc_macro::TokenStream;

#[proc_macro_derive(Beta, attributes(foo))]
pub fn beta_derive(input: TokenStream) -> TokenStream {
    println!("{:#?}", input);

    TokenStream::new()
}

This produces the ident without the raw identifier marking (r#):

Ident {
    ident: "value",
    span: #0 bytes(0..0)
}

Complete TokenStream

TokenStream [
    Ident {
        ident: "enum",
        span: #0 bytes(0..0)
    },
    Ident {
        ident: "Dummy",
        span: #0 bytes(0..0)
    },
    Group {
        delimiter: Brace,
        stream: TokenStream [
            Punct {
                ch: '#',
                spacing: Alone,
                span: #0 bytes(0..0)
            },
            Group {
                delimiter: Bracket,
                stream: TokenStream [
                    Ident {
                        ident: "foo",
                        span: #0 bytes(0..0)
                    },
                    Group {
                        delimiter: Parenthesis,
                        stream: TokenStream [
                            Ident {
                                ident: "bar",
                                span: #0 bytes(0..0)
                            },
                            Group {
                                delimiter: Parenthesis,
                                stream: TokenStream [
                                    Ident {
                                        ident: "value",
                                        span: #0 bytes(0..0)
                                    }
                                ],
                                span: #0 bytes(0..0)
                            }
                        ],
                        span: #0 bytes(0..0)
                    }
                ],
                span: #0 bytes(0..0)
            },
            Ident {
                ident: "Variant",
                span: #0 bytes(0..0)
            },
            Punct {
                ch: ',',
                spacing: Alone,
                span: #0 bytes(0..0)
            }
        ],
        span: #0 bytes(0..0)
    }
]

shepmaster on 26 Apr 2019

Nominating for possible prioritization (sorry, the past few months have been pretty hectic).
cc @aturon @nikomatsakis wrt me focusing on a large-scale solution for this

eddyb on 30 Apr 2019

❤3 👍2 😄1

@eddyb
(I've read the "AST(“CST”)/Parsing Evolution" paper.)

Both the paper and this issue talks about converting AST to tokens, but it seems like we don't actually need AST before macro expansion, we rather need to build AST as a result of macro expansion (with partial AST for already expanded code being available during expansion).

For input to macro expansion, something like "token streams on steroids" would be enough, I think:

struct PreexpTokenStream {
    preexp_tts: Vec<PreexpTokenTree>,
}

struct PreexpTokenTree {
    TokenStream(TokenStream),
    PreexpGroup(PreexpGroup),
    AttributeInvocation(Attribute, AttributeInvocationKind),
}

struct PreexpGroup {
    delimiter: Delimiter,
    preexpr_ts: PreexpTokenStream,
}

enum AttributeInvocationKind {
    // #[attr] TOKENS
    Outer(PreexpTokenStream),
    // HEADER_TOKENS { #![attr] TOKENS }
    Inner(PreexpTokenStream, PreexpGroup),
}

struct Attribute {
    path: PreexpTokenStream,
    tts: TokenStream,
}

Attributes and their targets need to be available early due to eager expansion (cfg now and https://github.com/rust-lang/rfcs/pull/2320 in the future).

Unfortunately we still need a full (or nearly full) Rust language parser to build such a PreexpTokenStream, because we need to interpret #[attr] a b c d as (#[attr] a b /* part of the attribute target */) (c d /* not a part of the attribute target */) and we cannot do that without knowing the syntactic details of attribute targets.

petrochenkov on 2 May 2019

❤3 🚀1

@petrochenkov Yeah, I think some weaker parsing (preferably to some sort of Parse Tree) would be good here - however, I think that's longer term, unless you expect there to be a low-effort version?

Also, to be clear, I don't want to convert the AST (via "pretty"-printing), but rather keep track of the equivalent TokenStream, effectively having both the AST and what you're talking about, at the same time.

eddyb on 4 May 2019

I think that's longer term, unless you expect there to be a low-effort version

No, certainly not low-effort, we'd at least need to convert built-in derives to the token model and who knows what else.
(On the other hand, not many things in this area are low-effort).

keep track of the equivalent TokenStream, effectively having both the AST and what you're talking about, at the same time.

Yes, even if AST don't need to be produced before expansion, it still can be produced before expansion.
Building AST and "PreexpTokenStream"s simultaneously seems pretty natural given that they need to use the same parser.
I'm not sure how that can be done in details though.

(Hmm, perhaps the user facing TokenStream could use PreexpTokenStream as a compiler counterpart, it's inspectable only through iteration anyway.)

petrochenkov on 4 May 2019

👍1

@petrochenkov I see, yes, I have no problem with moving to that kind of a representation, as long as it's still embedded into today's AST (until we can get rid of that).

My plan with detecting mutation is to eventually force all synthetic AST nodes to come with their (Preexp)TokenStream representation already present in them, and stop pretty-printing altogether.

At that point, it might be easier to have some sort of limited quasi-quoting (kind of like the quote crate today - and it doesn't need to support literals, so it should be efficient) in built-in derives, and force them to be parsed - in fact, we could do that even sooner, if you think it's a good idea in the first place.

eddyb on 5 May 2019

Removing the I-nominated label -- but it seems like this would make a good design meeting topic, since the fix is non-trivial? I see that there is lots of active discussion.

nikomatsakis on 9 May 2019

https://github.com/rust-lang/rust/pull/60679 addresses the issue for literals (all literal tokens are passed to proc macros precisely).

petrochenkov on 9 May 2019

Hi, how is this issue progressing? The core foundational crates (e.g. cortex-m) for the embedded ecosystem are heavily dependent on proc macros and not getting sane error messages (with lost span information) really hinders development and adoption for beginners.

I'm not sure how to "rattle the tree" to have this issue get more attention, if you have tips I'm very open help improve awareness of this issue.

cc @therealprof @jamesmunns @japaric

korken89 on 22 Sep 2019

For many cases, I think we can trivially get incremental wins here -- e.g. you'll note that https://github.com/rust-lang/rust/issues/64561 is E-easy and E-mentor -- but solving this problem entirely I think is somewhat up in the air as to how to do so, though we do have some ideas. @matklad in particular has been doing some work to move us over to using more TokenStream rather than ASTs as input to proc macros, I believe, which can help here.

If the embedded ecosystem is hitting specific pain points frequently (i.e., you have examples of doing X that causes breakage) it's quite possible we can selectively patch those and help y'all out without the more complete fix.

Mark-Simulacrum on 22 Sep 2019

Are things at a point where it would possible to start working on this (beyond the small improvements that @Mark-Simulacrum mentioned)? Or is there still design work that needs to be done?

Aaron1011 on 16 Oct 2019

I've been meaning to work on this for a while and made a few plans, but I didn't get the time to spend on it - feel free to talk to me elsewhere, maybe a few things can be done.

E.g. I wanted to add a TokenStream to Nonterminal to allow recovering the exact original input, and perhaps even get rid of AST nodes embedded in tokens altogether (at the cost of reparsing).

eddyb on 23 Oct 2019

@petrochenkov: I'd like to work on this issue - is there anything I can do to help?

Aaron1011 on 12 Mar 2020

👍2

@Aaron1011
I still think https://github.com/rust-lang/rust/issues/43081#issuecomment-450641481 would be a good first step.
Attaching a TokenStream to ast::Expr, like it's already attached to ast::Item, and measuring the effect on performance.

If tokens are kept for all AST fragments in token::Nonterminal, it will eliminate at least one major source of pretty-printing.

petrochenkov on 12 Mar 2020

👍2

With https://github.com/rust-lang/rust/pull/72287, https://github.com/rust-lang/rust/pull/72393, https://github.com/rust-lang/rust/pull/72388, and https://github.com/rust-lang/rust/pull/72306 merged, we should be hitting this issuse much less frequently.

Aaron1011 on 25 May 2020

🎉3

@Aaron1011
Optimization idea - https://github.com/rust-lang/rust/pull/72287 collects tokens only for expressions with attributes and it resulted in speedups, same can be done for items (which are currently collecting tokens unconditionally).

petrochenkov on 25 May 2020

👍1

Analysis of breakage from partially fixing the span loss (cases related to passing interpolated tokens) - https://github.com/rust-lang/rust/issues/72622.

petrochenkov on 31 May 2020

some question on this issue:

Is there some in-progress work or does some people thought about what is needed in order to fix the fact that usage of #[cfg(features="some_not_set")] loses the spans ? I would like to provide some help (if it's in my skill/time)

do we have a more complete list of all situation losing the span in proc_macros ? I am aware of:
- outer attribute, with WIP PR https://github.com/rust-lang/rust/pull/76130
- usage of cfg stuff

Thanks all

thiolliere on 13 Nov 2020

@thiolliere: Summary of the current progress:

PR https://github.com/rust-lang/rust/pull/78991 fixes an inconsistency in parsing of statement macros (it's currently waiting on a Crater run)
Assuming it gets merged, it will unblock https://github.com/rust-lang/rust/pull/78296, which fixes our handling of attributes on statements
Once that PR is merged, I will continue to split up https://github.com/rust-lang/rust/pull/76130

Once all of the pieces of https://github.com/rust-lang/rust/pull/76130 are merged, any code that compiles on stable should have all spans available (except for any additional pretty-printer issues that may get discovered).

Using proc-macros as inner attributes will still cause spans to be lost, but such usage is nightly-only.

Aaron1011 on 13 Nov 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

const fn tracking issue (RFC 911)

nikomatsakis · 274Comments

Tracking issue for promoting `!` to a type (RFC 1216)

nikomatsakis · 259Comments

Tracking issue for async/await (RFC 2394)

withoutboats · 308Comments

Resolve `await` syntax

cramertj · 512Comments

[Stabilization] Pin APIs

withoutboats · 213Comments