rust-analyzer 🚀 - Missing validators

Path validators: I am not sure if we check this, but stuff like super::super::foo or use foo::{crate::bar}; should be forbidden.

matklad on 21 Nov 2018

These are still up for grabs if anybody wants to get their hands dirty and contribute something small but useful!

The relevant code is currently in crates/ra_syntax/src/validation.rs. The existing validators should serve as decent examples, but please don't hesitate to ask how to go about it, if you're interested in giving it a go!

kiljacken on 10 Jan 2020

❤1

I'll take a crack at this over the next couple of days.

chuck-flowers on 17 Jan 2020

@chuck-flowers, did you already start? I am also working on this, though not very fast ...

Veetaha on 19 Jan 2020

I haven't done much yet

chuck-flowers on 19 Jan 2020

@chuck-flowers I am not contributing very fast, but I suppose this issue is not the most important one since it was opened more than a year ago and I already learned a decent amount of codebase around ra_syntax where the validation occurs.
Well, maybe, might you tackle some other issue to avoid conflicts with my changes? I also have some unfinished work to do on the previous refactoring...

Veetaha on 19 Jan 2020

No problem. I can find something else to work on.

chuck-flowers on 19 Jan 2020

Okay, thanks )

Veetaha on 19 Jan 2020

@matklad
@kiljacken
I see we are dropping some useful information from rutc_lexer when tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token is terminated or not, whether block comment has closing */ or not and some other helpful information.

See its definition here (I wish I could embed code from foreign repos in github):
the link

pub enum LiteralKind {
    /// "12_u8", "0o100", "0b120i99"
    Int { base: Base, empty_int: bool },
    /// "12.34f32", "0b100.100"
    Float { base: Base, empty_exponent: bool },
    /// "'a'", "'\\'", "'''", "';"
    Char { terminated: bool },
    /// "b'a'", "b'\\'", "b'''", "b';"
    Byte { terminated: bool },
    /// ""abc"", ""abc"
    Str { terminated: bool },
    /// "b"abc"", "b"abc"
    ByteStr { terminated: bool },
    /// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
    RawStr { n_hashes: usize, started: bool, terminated: bool },
    /// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
    RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}

But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/src/parsing/lexer.rs#L17-L28

Well, saving this info in ra_parser::SyntaxKind is not possible since its representation must be fixed to u16. I guess we would need to recompute that anyway, wouldn't we?

If we do need to recompute I think we could reuse rustc_lexer::first_token() that would return the same token that was previously dropped, or do we want to lessen the coupling and implement that logic in our crate?

Veetaha on 19 Jan 2020

Ideally, the text tokenization function should produce a separate list of
diagnostics, just like the SourceFile::parse does

On Sunday, 19 January 2020, Veetaha notifications@github.com wrote:

@matklad https://github.com/matklad
@kiljacken https://github.com/kiljacken
I see we are dropping some useful information from rutc_lexer when
tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token
is terminated or not, whether block comment has closing */ or not and
some other helpful information.

See its definition here (I wish I could embed code from foreign repos in
github):
the link
https://github.com/rust-lang/rust/blob/a916ac22b9f7f1f0f7aba0a41a789b3ecd765018/src/librustc_lexer/src/lib.rs#L121

pub enum LiteralKind {
/// "12_u8", "0o100", "0b120i99"
Int { base: Base, empty_int: bool },
/// "12.34f32", "0b100.100"
Float { base: Base, empty_exponent: bool },
/// "'a'", "'\'", "'''", "';"
Char { terminated: bool },
/// "b'a'", "b'\'", "b'''", "b';"
Byte { terminated: bool },
/// ""abc"", ""abc"
Str { terminated: bool },
/// "b"abc"", "b"abc"
ByteStr { terminated: bool },
/// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
RawStr { n_hashes: usize, started: bool, terminated: bool },
/// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}

But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/
3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/
src/parsing/lexer.rs#L17-L28

Well, saving this info in ra_parser::SyntaxKind is not possible since its
representation must be fixed to u16. I guess we would need to recompute
that anyway, yes?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-analyzer/rust-analyzer/issues/223?email_source=notifications&email_token=AANB3M3SZJXDOLGYIMK4J2LQ6R5SLA5CNFSM4GC2OOLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKWI2A#issuecomment-576021608,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANB3MYTDDVYMLUZWARB3FLQ6R5SLANCNFSM4GC2OOLA
.

matklad on 19 Jan 2020

@matklad so do you think we can rework our tokenize(text: &str) -> Vec<Token>
to something like tokenize(text: &str) -> impl Iterator<Item = (Token, Option<SyntaxErrror>)> or something like that? I wonder how would we go about validation apart from tokenizing (though in fact why would the user need this at all if in order to get tokens he needs to do tokenizing that already returns diagnostics.

Veetaha on 19 Jan 2020

Okay, let me rereview the whole ra_syntax crate and think of the most rational decision ...

Veetaha on 19 Jan 2020

I'd imagine tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>)

matklad on 19 Jan 2020

@matklad
Okay, just want to clarify if you do want to go with vectors and not iterators.
Why I used impl Iterator in the first place was to let the caller decide whether he wants all tokens at once.
We already have such usecases:

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_ide/src/references/rename.rs#L20-L25

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L56-L60

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L67-L70

rustc_lexer also exposes iterator API

But another solution would be to add an explicit fn first_token(text: &str) -> Option<Token> function and rustc_lexer does that as well

Veetaha on 19 Jan 2020

A better name would be single_token as they require exactly one token.

bjorn3 on 19 Jan 2020

@bjorn3 well, to me, first already means that it will return exactly one token.

Veetaha on 19 Jan 2020

In the mentioned examples there must not be two or more tokens, while first_token implies that more tokens may follow.

bjorn3 on 19 Jan 2020

👍2 😕1

Well, to conclude, the previous ideas, I am not ready to choose which of the proposed design approaches would be the best now.

From my perspective, ra_syntax and ra_parser crates need to be thoroughly reviewed by me.

Don't want to jump straight to conclusions, I am used to be rational, so I'll get there once I feel enough context.
This will be fairly not complicated since the codebase is pretty neat and I already have more than a high-level view on these two crates.
I may ask more questions here if you don't mind (just for the history in case of any unforeseen circumstances).

Veetaha on 20 Jan 2020

@Veetaha I think both iterators and vectors might work. I think we don't necessary need iterators (as we either dump everything to vec anyway, or tokenize a really short string which should get tokenized into a single token). In this cases, in Rust, I usually just default to Vecs as in Rust they are easier (don't carry lifetime with them, and don't require opaque types).

It might be a good idea to move thouse two "odd" uses of tokenize inside the lexer module, such that we don't actually depend on the specifics of the interface.

matklad on 20 Jan 2020

👍1

Some of the required validations were implemented in PR #2911. Take a note that they are done during the tokenisation-time and were named TokenizeErrors.

Other errors like "Doc comments are attached to nodes" or "Symbol path must not contain crate keyword in the middle" are soon to be implemented in new PR. I'd like to take the latter errors apart with some dedicated enum like SemmanticError as the pattern dictates in SyntaxErrorKind definition (though one would argue that SemmanticError is not a subtype of SyntaxError and I'd propose her to pick the better name then...). I'll try to come up with a better name if you guys wish.

Here is the updated todo list:

[x] String is terminated
[x] Char is terminated
[x] Byte is terminated
[x] ByteString is terminated
[x] RawString is properly started and terminated
[x] RawByteString is properly started and terminated
[x] Integers have digits after the base specifier
[x] Floats have digits after the exponent specifier
[x] /* comments have a closing */
[x] Lifetime names don't start with a number
[x] Create tests for errors returned during tokenization
[ ] Report all unescape errors
[ ] Doc comments are attached to nodes
[ ] Symbol paths do not contain crate keyword in the middle
[x] ~~Come up with the name for post-parsing validation errors (unnecessary after future SyntaxError redesign)~~
[ ] Move validation functions that are used for debugging or tests into a separate file

[UPD according to PR review]

[x] Use tuples (Token, Option<SyntaxError>) and (Vec<Token>, Vec<SyntaxError>) instead of ParsedToken[s]
[x] Remove Location and change SyntaxError from being an enum to { msg: String, loc: TextRange } shape.
[x] Remove some suggestion comments (like regarding enum_dispatch)
[x] Use #[rustfmt::skip] when better formatting is obvious
[x] Use early-return instead of if {} else {}
[x] Use match ... { ... => return (val, Some(err)), ... => val } instead of local ok(), ok_if(), err() functions

Veetaha on 27 Jan 2020

I took a shot validating the use of crate in paths (https://github.com/rust-analyzer/rust-analyzer/pull/4178) but it turns out it gets pretty subtle in case of use items because crate is allowed at the semantic root of a path, not the syntactic root. Example:

use {crate as bar, {{crate as cat, foo}};

This is pretty obviously solvable with a traversal of the tree, but I'm not sure what performance implications I should be worried about, if any. Any guidance would be super helpful.

djrenren on 28 Apr 2020

Well, my initial thought was about not having a separate validation step and collect errors like crate in invalid position during parsing itself where you have as much context available as you need and even more. I am still not sure we cannot follow this idea, but I guess @matklad has a strong opinion against this. Anyway, this does seem like more of a semantic error, even the rust reference doesn't prohibit crate keyword in arbitrary positions on grammar level.

From my naive perspective, what I see as the ultimate solution is desugaring such patterns like use {{{}}, {}} where the path segment starts with the curly brace right away by removing the curly braces completely and unwrapping this into multiple use statements. However, the desugaring part in rust-analyzer is not obvious at least to me nowadays. Our syntax trees save the source code form without any transformations (I may be mistaken, maybe @matklad can clarfiy) so this will require additional groundwork for desugaring plumbing...

Veetaha on 28 Apr 2020

Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages.

Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs

and reading through it I feel like an idiot. There's a PathSegmentKind that will tell me whether the segment is crate keyword, so I'll rewrite in terms of that. For now though, I think there's still value in the conservative validation pass, so long as we track that it doesn't error on UseTrees

djrenren on 28 Apr 2020

Imo, this should just be a separate traversal of syntax tree. I wouldn't
worry about perf.

On Tue, 28 Apr 2020 at 19:26, djrenren notifications@github.com wrote:

Yeah I also thought about moving this down to the parser but I think it's
important to move as many errors out the parser and into a separate
pass/flow so that in the long-run we can give higher quality error messages.

Also I found the desugaring / lowering pass:
https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs

and reading through it I feel like an idiot. There's a PathSegmentKind
that will tell me whether the segment is crate keyword, so I'll rewrite
in terms of that. For now though, I think there's still value in the
conservative validation pass, so long as we track that it doesn't error on
UseTrees

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-analyzer/rust-analyzer/issues/223#issuecomment-620746386,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANB3MZHWVTUQDRLPCYA7K3RO4GTHANCNFSM4GC2OOLA
.

matklad on 29 Apr 2020

As there seems to have been no activity here for over a month, would it make sense for me to try to make some progress on this? @Veetaha, are you planning to do more work here?

jacg on 9 Jun 2020

@jacg yup, feel free to go :)
(no activity planned by me so far)

Veetaha on 9 Jun 2020

Looking at validation.rs I'm really missing some tests to help explain what is going on, what cases have been taken into consideration etc.

Would there be any objection to rewriting the validate_* functions so that they return Vec<SyntaxError> (or maybe just a single Option<SyntaxError> (it's not entirely clear whether there might be an accumulation of multiple errors, or whether at most one is generated)) which would get combined in the main validate function, and then adding a bunch of unit tests for the validate_*s?

jacg on 10 Jun 2020

Would there be any objection to rewriting the validate_* functions so that they return Vec

The functions by design accept a sink argument (&mut Vec) so as not to do N allocations. Returning an Option<SyntaxError> should be ok though.

matklad on 10 Jun 2020

Actually Option<SyntaxError> is a special case of Iterator<Item = SyntaxError> (where it yields at most 1 element), but I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

Veetaha on 10 Jun 2020

@matklad So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

@Veetaha

I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

I don't follow. What's wrong with the validate_*s returning Option<SyntaxError>, which can easily

be compared to the expected result in unit tests
have its contents, if any, extracted and pushed to errors inside validate

?

jacg on 10 Jun 2020

So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

I haven't looked at the code in a while, but I think producing more than one error is ok. Generally, sink argument is a pattern that usually works best for producing a bunch of errors/warnings.

matklad on 10 Jun 2020

So, in order to write unit tests for these, do you recommend just leaving the sink argument, and having the tests provide an empty one? It would make expressing the tests a bit more of a pain, but I guess that could actually be hidden in a little macro.

jacg on 10 Jun 2020

I think those are best tested via data-driven tests, so that the tests are completely oblivious about the actual interface:

https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_syntax/test_data/parser/inline/err/0010_bad_tuple_index_expr.rast#L51-L53

matklad on 10 Jun 2020

Yup, just adding a bunch of input .rs files somewhere to ra_syntax/test_data/parser/error I guess should do the thing. Once you run cargo test .rast file will be autogenerated and then you should check that it has errors at the end of the syntax tree that you expected

Veetaha on 10 Jun 2020

Oh, that's excellent, I'm so glad this exists!

jacg on 10 Jun 2020

IIUC, the purpose of the validator is to generate syntax errors in the context of the parser being tolerant to invalid input: The parser accepts incorrect syntax, and the validator, in a separate traversal of the tree, should generate errors in such cases.

More specifically, in the context of this issue, let's consider RawString, which is the next unchecked SyntaxKind in the TODO list in the top comment.

I would expect validate_literal occasionally to receive some tokens with SyntaxKind::RAW_STRING which do not correctly represent raw strings. The work to be done here, would be to recognize such situations, and generate an apropriate SyntaxError. However, I'm failing to find such a case. For example, if I type let x = r#"abc"; in my editor, I am immediately presented with

Syntax Error: Missing trailing `"` with `#` symbols to terminate the raw string literal

If I save this sample somewhere in test_data/parser/err, tests fail with the same message.

So it is unclear to me what needs to be done to complete raw string validation.

jacg on 11 Jun 2020

I think every raw string (and raw byte string) that parses correctly (has the right number of hashes and quotes) is valid.

An example of invalid string would be "\c" or "\xgg".

lnicola on 11 Jun 2020

Does that mean that there is nothing left to be done for raw strings?

jacg on 11 Jun 2020

I might be wrong, but I think there's nothing to be done for raw strings and raw binary strings.

lnicola on 11 Jun 2020

@jacg

Does that mean that there is nothing left to be done for raw strings?

I am 90% sure that yes. Raw strings errors are collected from rustc_lexer during the tokenization step: https://github.com/rust-analyzer/rust-analyzer/blob/bd61ad756cc0a7bfeaa5dae81ac5ab50a2e71697/crates/ra_syntax/src/parsing/lexer.rs#L218-L239.
I recommend looking at this comment with the remaining list of errors that are not yet reported, this is more up-to-date.

Veetaha on 11 Jun 2020

I think we impled the bulk of stuff, so closing this issue, and the rest should be handled in specific issues.

matklad on 15 Jul 2020

Rust-analyzer: Missing validators

Most helpful comment

All 41 comments

Related issues