Literals:
Comments:
Path validators: I am not sure if we check this, but stuff like super::super::foo or use foo::{crate::bar}; should be forbidden.
These are still up for grabs if anybody wants to get their hands dirty and contribute something small but useful!
The relevant code is currently in crates/ra_syntax/src/validation.rs. The existing validators should serve as decent examples, but please don't hesitate to ask how to go about it, if you're interested in giving it a go!
I'll take a crack at this over the next couple of days.
@chuck-flowers, did you already start? I am also working on this, though not very fast ...
I haven't done much yet
@chuck-flowers I am not contributing very fast, but I suppose this issue is not the most important one since it was opened more than a year ago and I already learned a decent amount of codebase around ra_syntax where the validation occurs.
Well, maybe, might you tackle some other issue to avoid conflicts with my changes? I also have some unfinished work to do on the previous refactoring...
No problem. I can find something else to work on.
Okay, thanks )
@matklad
@kiljacken
I see we are dropping some useful information from rutc_lexer when tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token is terminated or not, whether block comment has closing */ or not and some other helpful information.
See its definition here (I wish I could embed code from foreign repos in github):
the link
pub enum LiteralKind {
/// "12_u8", "0o100", "0b120i99"
Int { base: Base, empty_int: bool },
/// "12.34f32", "0b100.100"
Float { base: Base, empty_exponent: bool },
/// "'a'", "'\\'", "'''", "';"
Char { terminated: bool },
/// "b'a'", "b'\\'", "b'''", "b';"
Byte { terminated: bool },
/// ""abc"", ""abc"
Str { terminated: bool },
/// "b"abc"", "b"abc"
ByteStr { terminated: bool },
/// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
RawStr { n_hashes: usize, started: bool, terminated: bool },
/// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}
But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/src/parsing/lexer.rs#L17-L28
Well, saving this info in ra_parser::SyntaxKind is not possible since its representation must be fixed to u16. I guess we would need to recompute that anyway, wouldn't we?
If we do need to recompute I think we could reuse rustc_lexer::first_token() that would return the same token that was previously dropped, or do we want to lessen the coupling and implement that logic in our crate?
Ideally, the text tokenization function should produce a separate list of
diagnostics, just like the SourceFile::parse does
On Sunday, 19 January 2020, Veetaha notifications@github.com wrote:
@matklad https://github.com/matklad
@kiljacken https://github.com/kiljacken
I see we are dropping some useful information from rutc_lexer when
tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token
is terminated or not, whether block comment has closing */ or not and
some other helpful information.See its definition here (I wish I could embed code from foreign repos in
github):
the link
https://github.com/rust-lang/rust/blob/a916ac22b9f7f1f0f7aba0a41a789b3ecd765018/src/librustc_lexer/src/lib.rs#L121pub enum LiteralKind {
/// "12_u8", "0o100", "0b120i99"
Int { base: Base, empty_int: bool },
/// "12.34f32", "0b100.100"
Float { base: Base, empty_exponent: bool },
/// "'a'", "'\'", "'''", "';"
Char { terminated: bool },
/// "b'a'", "b'\'", "b'''", "b';"
Byte { terminated: bool },
/// ""abc"", ""abc"
Str { terminated: bool },
/// "b"abc"", "b"abc"
ByteStr { terminated: bool },
/// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
RawStr { n_hashes: usize, started: bool, terminated: bool },
/// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/
3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/
src/parsing/lexer.rs#L17-L28Well, saving this info in ra_parser::SyntaxKind is not possible since its
representation must be fixed to u16. I guess we would need to recompute
that anyway, yes?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-analyzer/rust-analyzer/issues/223?email_source=notifications&email_token=AANB3M3SZJXDOLGYIMK4J2LQ6R5SLA5CNFSM4GC2OOLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKWI2A#issuecomment-576021608,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANB3MYTDDVYMLUZWARB3FLQ6R5SLANCNFSM4GC2OOLA
.
@matklad so do you think we can rework our tokenize(text: &str) -> Vec<Token>
to something like tokenize(text: &str) -> impl Iterator<Item = (Token, Option<SyntaxErrror>)> or something like that? I wonder how would we go about validation apart from tokenizing (though in fact why would the user need this at all if in order to get tokens he needs to do tokenizing that already returns diagnostics.
Okay, let me rereview the whole ra_syntax crate and think of the most rational decision ...
I'd imagine tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>)
@matklad
Okay, just want to clarify if you do want to go with vectors and not iterators.
Why I used impl Iterator in the first place was to let the caller decide whether he wants all tokens at once.
We already have such usecases:
rustc_lexer also exposes iterator API
But another solution would be to add an explicit fn first_token(text: &str) -> Option<Token> function and rustc_lexer does that as well
A better name would be single_token as they require exactly one token.
@bjorn3 well, to me, first already means that it will return exactly one token.
In the mentioned examples there must not be two or more tokens, while first_token implies that more tokens may follow.
Well, to conclude, the previous ideas, I am not ready to choose which of the proposed design approaches would be the best now.
From my perspective, ra_syntax and ra_parser crates need to be thoroughly reviewed by me.
Don't want to jump straight to conclusions, I am used to be rational, so I'll get there once I feel enough context.
This will be fairly not complicated since the codebase is pretty neat and I already have more than a high-level view on these two crates.
I may ask more questions here if you don't mind (just for the history in case of any unforeseen circumstances).
@Veetaha I think both iterators and vectors might work. I think we don't necessary need iterators (as we either dump everything to vec anyway, or tokenize a really short string which should get tokenized into a single token). In this cases, in Rust, I usually just default to Vecs as in Rust they are easier (don't carry lifetime with them, and don't require opaque types).
It might be a good idea to move thouse two "odd" uses of tokenize inside the lexer module, such that we don't actually depend on the specifics of the interface.
Some of the required validations were implemented in PR #2911. Take a note that they are done during the tokenisation-time and were named TokenizeErrors.
Other errors like "Doc comments are attached to nodes" or "Symbol path must not contain crate keyword in the middle" are soon to be implemented in new PR. I'd like to take the latter errors apart with some dedicated enum like SemmanticError as the pattern dictates in SyntaxErrorKind definition (though one would argue that SemmanticError is not a subtype of SyntaxError and I'd propose her to pick the better name then...). I'll try to come up with a better name if you guys wish.
Here is the updated todo list:
crate keyword in the middle[UPD according to PR review]
(Token, Option<SyntaxError>) and (Vec<Token>, Vec<SyntaxError>) instead of ParsedToken[s]Location and change SyntaxError from being an enum to { msg: String, loc: TextRange } shape.#[rustfmt::skip] when better formatting is obviousif {} else {}match ... { ... => return (val, Some(err)), ... => val } instead of local ok(), ok_if(), err() functionsI took a shot validating the use of crate in paths (https://github.com/rust-analyzer/rust-analyzer/pull/4178) but it turns out it gets pretty subtle in case of use items because crate is allowed at the semantic root of a path, not the syntactic root. Example:
use {crate as bar, {{crate as cat, foo}};
This is pretty obviously solvable with a traversal of the tree, but I'm not sure what performance implications I should be worried about, if any. Any guidance would be super helpful.
Well, my initial thought was about not having a separate validation step and collect errors like crate in invalid position during parsing itself where you have as much context available as you need and even more. I am still not sure we cannot follow this idea, but I guess @matklad has a strong opinion against this. Anyway, this does seem like more of a semantic error, even the rust reference doesn't prohibit crate keyword in arbitrary positions on grammar level.
From my naive perspective, what I see as the ultimate solution is desugaring such patterns like use {{{}}, {}} where the path segment starts with the curly brace right away by removing the curly braces completely and unwrapping this into multiple use statements. However, the desugaring part in rust-analyzer is not obvious at least to me nowadays. Our syntax trees save the source code form without any transformations (I may be mistaken, maybe @matklad can clarfiy) so this will require additional groundwork for desugaring plumbing...
Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages.
Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs
and reading through it I feel like an idiot. There's a PathSegmentKind that will tell me whether the segment is crate keyword, so I'll rewrite in terms of that. For now though, I think there's still value in the conservative validation pass, so long as we track that it doesn't error on UseTrees
Imo, this should just be a separate traversal of syntax tree. I wouldn't
worry about perf.
On Tue, 28 Apr 2020 at 19:26, djrenren notifications@github.com wrote:
Yeah I also thought about moving this down to the parser but I think it's
important to move as many errors out the parser and into a separate
pass/flow so that in the long-run we can give higher quality error messages.Also I found the desugaring / lowering pass:
https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rsand reading through it I feel like an idiot. There's a PathSegmentKind
that will tell me whether the segment is crate keyword, so I'll rewrite
in terms of that. For now though, I think there's still value in the
conservative validation pass, so long as we track that it doesn't error on
UseTrees—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-analyzer/rust-analyzer/issues/223#issuecomment-620746386,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANB3MZHWVTUQDRLPCYA7K3RO4GTHANCNFSM4GC2OOLA
.
As there seems to have been no activity here for over a month, would it make sense for me to try to make some progress on this? @Veetaha, are you planning to do more work here?
@jacg yup, feel free to go :)
(no activity planned by me so far)
Looking at validation.rs I'm really missing some tests to help explain what is going on, what cases have been taken into consideration etc.
Would there be any objection to rewriting the validate_* functions so that they return Vec<SyntaxError> (or maybe just a single Option<SyntaxError> (it's not entirely clear whether there might be an accumulation of multiple errors, or whether at most one is generated)) which would get combined in the main validate function, and then adding a bunch of unit tests for the validate_*s?
Would there be any objection to rewriting the validate_* functions so that they return Vec
The functions by design accept a sink argument (&mut Vec) so as not to do N allocations. Returning an Option<SyntaxError> should be ok though.
Actually Option<SyntaxError> is a special case of Iterator<Item = SyntaxError> (where it yields at most 1 element), but I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea
@matklad So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?
@Veetaha
I suppose using
impl Iterator<Item = T>in validation will be hard, so&mut Vecdoesn't seem like a bad idea
I don't follow. What's wrong with the validate_*s returning Option<SyntaxError>, which can easily
errors inside validate?
So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?
I haven't looked at the code in a while, but I think producing more than one error is ok. Generally, sink argument is a pattern that usually works best for producing a bunch of errors/warnings.
So, in order to write unit tests for these, do you recommend just leaving the sink argument, and having the tests provide an empty one? It would make expressing the tests a bit more of a pain, but I guess that could actually be hidden in a little macro.
I think those are best tested via data-driven tests, so that the tests are completely oblivious about the actual interface:
Yup, just adding a bunch of input .rs files somewhere to ra_syntax/test_data/parser/error I guess should do the thing. Once you run cargo test .rast file will be autogenerated and then you should check that it has errors at the end of the syntax tree that you expected
Oh, that's excellent, I'm so glad this exists!
IIUC, the purpose of the validator is to generate syntax errors in the context of the parser being tolerant to invalid input: The parser accepts incorrect syntax, and the validator, in a separate traversal of the tree, should generate errors in such cases.
More specifically, in the context of this issue, let's consider RawString, which is the next unchecked SyntaxKind in the TODO list in the top comment.
I would expect validate_literal occasionally to receive some tokens with SyntaxKind::RAW_STRING which do not correctly represent raw strings. The work to be done here, would be to recognize such situations, and generate an apropriate SyntaxError. However, I'm failing to find such a case. For example, if I type let x = r#"abc"; in my editor, I am immediately presented with
Syntax Error: Missing trailing `"` with `#` symbols to terminate the raw string literal
If I save this sample somewhere in test_data/parser/err, tests fail with the same message.
So it is unclear to me what needs to be done to complete raw string validation.
I think every raw string (and raw byte string) that parses correctly (has the right number of hashes and quotes) is valid.
An example of invalid string would be "\c" or "\xgg".
Does that mean that there is nothing left to be done for raw strings?
I might be wrong, but I think there's nothing to be done for raw strings and raw binary strings.
@jacg
Does that mean that there is nothing left to be done for raw strings?
I am 90% sure that yes. Raw strings errors are collected from rustc_lexer during the tokenization step: https://github.com/rust-analyzer/rust-analyzer/blob/bd61ad756cc0a7bfeaa5dae81ac5ab50a2e71697/crates/ra_syntax/src/parsing/lexer.rs#L218-L239.
I recommend looking at this comment with the remaining list of errors that are not yet reported, this is more up-to-date.
I think we impled the bulk of stuff, so closing this issue, and the rest should be handled in specific issues.
Most helpful comment
In the mentioned examples there must not be two or more tokens, while
first_tokenimplies that more tokens may follow.