rustfmt 0.4.1-nightly (374dba8 2018-03-22) fails to format the following code:
#![feature(raw_identifiers)]
fn main() {
let r#x = "hello, world";
println!("{}", r#x);
}
It emits the following error:
error: found invalid character; only `#` is allowed in raw string delimitation: x
--> /home/topecongiro/test.rs:4:9
|
4 | let r#x = "hello, world";
| ^^
I want to figure out how rustfmt works, so going to try this out. Will check back in when I'm stuck or done! 馃巿
Alright, checking back in confused :)
I guess the line the failure occurs at is around 670 in src/lib.rs:
syntax::with_globals(|| format_input_inner(input, config, out));
This is a call to an external crate, syntax. Am I correct in assuming that we just need to make sure the correct configuration gets passed in config, and so resolving this issue is a matter of making sure the appropriate flag in there gets set? And also, I noticed that unstable_features is set as
unstable_features: (Cell { value: false }, false, false, true)
Is this the parameter I should be focusing on? I think I'll have to go back and interpret how that works together, but I figured it wouldn't hurt to ask.
@mtn The first thing to do is to upgrade rustc-ap-syntax. That is our version of the syntax crate and should mean that we can at least parse the raw identifier. I'm not sure if that is all that needs doing or whether something else will then be broken. I don't think you need to change the config, but I might be wrong.
Thanks @nrc! One more question -- should the feature only be used when it's enabled, or always? The reason I ask is I recall there being a note somewhere in the docs that rustfmt even is meant to work sometimes when the code wouldn't compile.
Edit: And another question: it seems like the default behavior is to rename identifiers to remove raw characters. I guess this isn't desirable, especially when the feature is enabled, right? But should they stay only when the feature is enabled, or always?
It should always work, we don't generally check features. Indeed Rustfmt works when the code.
I think we should never remove the 'raw'-ness, the output identifier should be the same as the input, the thing to check is that we don't crash and don't screw up any surrounding formatting.
Got it, that makes sense. Thanks for the help.
I'm still having some trouble -- I'll walk through what I've understood so far.
After updating the syntax crate version and making a few changes in macro.rs to pass raw as True for some identifiers, rustfmt would compile and run, but changes the identifiers by default to remove the raw prefix (eg. r#x -> x). To try and understand where this is happening, I've been reading the visitor code.
My input (from a file) is the same as the example above:
#![feature(raw_identifiers)]
fn main() {
let r#x = "hello, world";
println!("{}", r#x);
}
Once the input is parsed, printing out krate.module, I see
Mod { inner: Span { lo: BytePos(0), hi: BytePos(98), ctxt: #0 },
items: [Item { ident: main#0, attrs: [], id: NodeId(4294967295), node: Fn(FnDecl
{ inputs: [], output: Default(Span { lo: BytePos(40), hi: BytePos(40), ctxt: #0
}), variadic: false }, Normal, Spanned { node: NotConst, span: Span { lo:
BytePos(30), hi: BytePos(32), ctxt: #0 } }, Rust, Generics { params: [],
where_clause: WhereClause { id: NodeId(4294967295), predicates: [], span: Span
{ lo: BytePos(0), hi: BytePos(0), ctxt: #0 } }, span: Span { lo: BytePos(0),
hi: BytePos(0), ctxt: #0 } }, Block { stmts: [stmt(4294967295: let x = "hello,
world";), stmt(4294967295: println!("{}" , r#x);)], id: NodeId(4294967295),
rules: Default, span: Span { lo: BytePos(40), hi: BytePos(98), ctxt: #0 },
recovered: false }), vis: Spanned { node: Inherited, span: Span { lo:
BytePos(27), hi: BytePos(28), ctxt: #0 } }, span: Span { lo: BytePos(30), hi:
BytePos(98), ctxt: #0 }, tokens: Some(TokenStream { kind: Stream([TokenStream
{ kind: Tree(Token(Span { lo: BytePos(30), hi: BytePos(32), ctxt: #0 },
Ident(fn#0, false))) }, TokenStream { kind: Tree(Token(Span { lo: BytePos(33),
hi: BytePos(37), ctxt: #0 }, Ident(main#0, false))) }, TokenStream { kind:
Tree(Delimited(Span { lo: BytePos(37), hi: BytePos(39), ctxt: #0 }, Delimited
{ delim: Paren, tts: ThinTokenStream(None) })) }, TokenStream { kind:
Tree(Delimited(Span { lo: BytePos(40), hi: BytePos(98), ctxt: #0 }, Delimited
{ delim: Brace, tts: ThinTokenStream(Some([TokenStream { kind: Tree(Token(Span
{ lo: BytePos(46), hi: BytePos(49), ctxt: #0 }, Ident(let#0, false))) },
TokenStream { kind: Tree(Token(Span { lo: BytePos(50), hi: BytePos(53), ctxt: #0
}, Ident(x#0, true))) }, TokenStream { kind: Tree(Token(Span { lo: BytePos(54),
hi: BytePos(55), ctxt: #0 }, Eq)) }, TokenStream { kind: JointTree(Token(Span
{ lo: BytePos(56), hi: BytePos(70), ctxt: #0 }, Literal(Str_(hello, world),
None))) }, TokenStream { kind: Tree(Token(Span { lo: BytePos(70), hi:
BytePos(71), ctxt: #0 }, Semi)) }, TokenStream { kind: JointTree(Token(Span
{ lo: BytePos(76), hi: BytePos(83), ctxt: #0 }, Ident(println#0, false))) },
TokenStream { kind: Tree(Token(Span { lo: BytePos(83), hi: BytePos(84), ctxt: #0
}, Not)) }, TokenStream { kind: JointTree(Delimited(Span { lo: BytePos(84), hi:
BytePos(95), ctxt: #0 }, Delimited { delim: Paren, tts:
ThinTokenStream(Some([TokenStream { kind: JointTree(Token(Span { lo:
BytePos(85), hi: BytePos(89), ctxt: #0 }, Literal(Str_({}), None))) },
TokenStream { kind: Tree(Token(Span { lo: BytePos(89), hi: BytePos(90), ctxt: #0
}, Comma)) }, TokenStream { kind: Tree(Token(Span { lo: BytePos(91), hi:
BytePos(94), ctxt: #0 }, Ident(x#0, true))) }])) })) }, TokenStream { kind:
Tree(Token(Span { lo: BytePos(95), hi: BytePos(96), ctxt: #0 }, Semi)) }]))
})) }]) }) }] }
The raw parameter value in the ident tuple for x is true, which is good I think (and it's false if it isn't raw). However, already, the statement let r#x... seems to have been transformed (but not the raw identifier in the println. Tracing further, I couldn't figure out where the transformations were occurring.
I guess it isn't out of the question that more tracing through things would make this clear to me, but am I missing anything/where should I be looking to understand why the identifiers are displaying the way they are?
The working branch on my fork is here.
In order to properly format raw identifiers, first we need to implement rewrite_ident which takes ast::Ident and returns Option<String>. It should add #r prefix if it's a raw identifier. There is Token::from_ast_ident() defined in rustc-ap-syntax which should be useful.
Then, we need to change every code where we used to simply rewrite identifiers...this will be a bit daunting.
Let's look at let r#x = "hello, world";. let statement is represented as struct Local in AST:
// from rustc-ap-syntax-80.0.0
/// Local represents a `let` statement, e.g., `let <pat>:<ty> = <expr>;`
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
pub struct Local {
pub pat: P<Pat>,
pub ty: Option<P<Ty>>,
/// Initializer expression to set the value, if any
pub init: Option<P<Expr>>,
pub id: NodeId,
pub span: Span,
pub attrs: ThinVec<Attribute>,
}
so we can tell that the r#x is held in Pat, which is another struct:
// from rustc-ap-syntax-80.0.0
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash)]
pub struct Pat {
pub id: NodeId,
pub node: PatKind,
pub span: Span,
}
Looking at PatKind, there is a variant called Ident, which holds SpannedIdent, which is a wrapper type over Ident (don't be confused!).
// from rustc-ap-syntax-80.0.0
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
pub enum PatKind {
/* ... */
/// A `PatKind::Ident` may either be a new bound variable (`ref mut binding @ OPT_SUBPATTERN`),
/// or a unit struct/variant pattern, or a const pattern (in the last two cases the third
/// field must be `None`). Disambiguation cannot be done with parser alone, so it happens
/// during name resolution.
Ident(BindingMode, SpannedIdent, Option<P<Pat>>),
/* ... */
}
// from rustc-ap-syntax-80.0.0
pub type SpannedIdent = Spanned<Ident>;
The function related to rewriting patterns are in patterns.rs. The following is a portion which deals with PatKind::Ident:
https://github.com/rust-lang-nursery/rustfmt/blob/0f55350c7d7658cc0701b1b8b83ab7d72591bc43/src/patterns.rs#L68-L90
Here, we need to replace ident.node.to_string() with rewrite_ident(ident).
Thank you for the explanation. It was extremely helpful!
Unfortunately Token::from_ast_ident doesn't actually recover the information, but "guesses" at whether a raw identifier would have been used based on whether the name would be invalid without it. So let might fail as a variable name, and so it gets returned as raw, but x, which didn't need to be, gets rewritten.
However, I can recover the information from the ParseSess! It has a field called raw_identifier_spans, which records which spans were raw identifiers. I'm going to be making the change that way (and will backtrack later if this wasn't the optimal way).
Edit: I guess figuring out all the places the rewrite needs to be fixed isn't straightforward. I thought to just search for instances of ident.to_string, but found that all but the three I'd already replaced were ast::Ident which doesn't include the span, which was how I was figuring out which identifiers were raw. But as long as there's enough information to rewrite those correctly, I think there aren't too many more.
Current progress is on my fork.
@mtn Thank you for your work!
Using raw_identifier_spans sounds good, though I am a bit worried - the doc comment on raw_identifier_spans says that it is used for feature gating raw identifiers. It seems to me that the field may get removed after the raw_identifiers feature gets stabilized. We may need to ask rustc developers to keep the field around or add a field to the AST node of ident which tells whether it is raw identifier or not (cc @nrc).
Also looking up HashSet every time we encounter Ident seems to add a decent amount of overhead, but given how raw identifier is implemented in libsyntax, I cannot think of a better way than using raw_identifiers_spans.
Right, I see what you mean about the feature-gating. I also think I don't have enough information to use raw_identifier_spans everywhere (namely, I don't have access to the span in every place the an Ident is rewritten, though I think I have access to the list of raw spans everywhere. It might be straightforward to work around this, but I'm not sure.
Also, thanks for mentioning making the function pub in the review -- I thought that was probably natural but didn't want to make that call without hearing about the project style.
Edit: Also, I think it would be nice in general if ast::Ident indicated whether it was raw of not, because I think that would cover all cases.
Sorry for the silence over the last few days. I'm going to be coming back to this, and will be looking into whether I can tell where raw identifiers were used in all cases.
To address a question I left off with, how should I proceed if the main or only way to tell if something was used as a raw identifier was to check a vector that might be removed after the feature is added to stable?
That raw_identifier_spans should not be used.
r#ident for non-keyword identifiers is something that rustfmt can (and should) "normalize" into ident because they are not supposed to be used in code (some edition-specific tweaks to this rule may be necessary in the future when rustfmt discerns between Rust 2015 and Rust 2018 code), so from_ast_ident is the right choice.
One exception is printing identifiers in macros - they should be printed precisely, but they are kept as Tokens, so the is_raw flag is available for them.
Hm, do you mind clarifying what you mean? For example, in the first code sample, are you saying we wouldn't want to rewrite the identifier that is an argument to println!?
That is, this
#![feature(raw_identifiers)]
fn main() {
let r#x = "hello, world";
println!("{}", r#x);
}
would produce this
#![feature(raw_identifiers)]
fn main() {
let x = "hello, world";
println!("{}", r#x);
}
because the x in let x would get rewritten but not identifiers. Or maybe you meant that this only applies to the macro name?
I think I'm missing the motivation a bit, which is why my speculation might be completely wrong.
@petrochenkov Thank you for clarifying the context behind raw identifiers!
@mtn I think that we only need to preserve the r# prefix inside macro defs (macro_rules! and macro 2.0). So we need to keep r#catch inside the following macro_rules!, but we can remove #r prefix inside the println! call in the above example.
#![feature(raw_identifiers)]
macro_rules! foo {
($x:expr) => {
let r#catch = $x + 1;
println!("{}", r#catch);
}
}
fn main() {
foo!(3);
}
Thank you @topecongiro and @petrochenkov! I really appreciate the help.
Another question related to the implementation. I guess the reformatting where the raw identifiers are removed the macro body happens here (around line 630 or lib.rs):
// Wrap the given code block with `fn main()` if it does not have one.
let snippet = enclose_in_main_block(code_snippet, config);
let mut result = String::with_capacity(snippet.len());
let mut is_first = true;
// Trim "fn main() {" on the first line and "}" on the last line,
// then unindent the whole code block.
let formatted = format_snippet(&snippet, config)?;
// 2 = "}\n"
Before the call to format_snippet, the prefixes are there, and they are removed right after. Of course, these are the same methods that are used to format lots of other things. I can think of things I could do to pass in a flag that we're formatting the inside of a macro declaration. But I'm not sure if this is the desired way.
Am I missing the point or looking in the wrong place? I know there was a suggestion that there'd only be Token types, but I am not really seeing that (and I started tracing from rewrite_macro).
Hm, I started this at the beginning of the (school) quarter, and it remains undone as I'm approaching the end. Sorry -- will check back on this in a few days!
Closed via #2807.
Most helpful comment
That
raw_identifier_spansshould not be used.r#identfor non-keyword identifiers is something that rustfmt can (and should) "normalize" intoidentbecause they are not supposed to be used in code (some edition-specific tweaks to this rule may be necessary in the future when rustfmt discerns between Rust 2015 and Rust 2018 code), sofrom_ast_identis the right choice.One exception is printing identifiers in macros - they should be printed precisely, but they are kept as
Tokens, so theis_rawflag is available for them.