context: https://github.com/github/semantic/issues/461
context: https://github.com/github/semantic/blob/master/docs/adding-new-languages.md (specifically, step 1 - the other steps are about to change according to that team).
GitHub now has built in functionality for code navigation. The first step to supporting it for Dart is to provide a treesitter grammar implementation.
Right now, I do not believe there is a complete specification of Dart's grammar to work from if someone wants to do this. I'm also half-hoping that someone on the langauge team has already noticed this and may be working on it :)
I'd be happy to contribute to something like this, but would like to know I'm working from the right sources.
/cc @munificent @leafpetersen @eernstg @lrhn
We do have an ANTLR grammar, so if it really is easily transcribed into a tree-sitter grammar, then we should be able to get there.
Nice! That looks like it should help.
The ANTLR grammar should be helpful in this respect; however, note that well-written tree-sitter parsers often differ from idiomatic ANTLR parsers. The docs have more information about this, but the upshot is that the grammar should use the built-in tree-sitter combinators for operator precedence (prec), alternation (choice), and repetition (seq); the ANTLR/EBNF grammar keeps things like operator precedence implicit in its structure, as the Dart spec (p. 198) notes. Luckily, the subsequent page provides a table of operators and their precedence, which should provide the majority of information required to express the expression parser with prec and friends.
An idiomatic tree-sitter grammar is important beyond just maintainability reasons: the code generation steps we use to generate AST types from these grammars relies on the presence of these combinators to generate efficient types. Luckily, these combinators usually make a given grammar shorter and more readable.
Note that the precedence table in the Dart language specification comes with no guarantees: The actual notion of operator precedence must be derived from the grammar.
The grammar implies almost all the precedence relationships in the table: For example, when you parse await e then e must be a unary expression, so await 1 + 1 means (await 1) + 1 and not await (1 + 1) because 1 + 1 cannot be derived from <unaryExpression>.
However, when you parse x..y += z, the structure is <conditionalExpression> <cascadeSequence> where the latter is '..' <cascadeSelector> <cascadeAssignment>, whereas x += y..z is <assignableExpression> <assignmentOperator> <expression>. So a blanket assumption that += has a higher precedence than .. won't work, because these two examples just showed that .. and += can occur under/over each other in both directions in an abstract syntax tree, without any use of parentheses.
So you might be better off just using a grammar which is very similar to the ANTLR grammar, and avoid using the built-in prec specifications, because that will actually give you the precise notion of precedence. Do you have any idea why prec is recommended? Maybe it's cheaper (time/space-wise) to use prec than having a larger grammar?
the ANTLR/EBNF grammar keeps things like operator
precedence implicit in its structure
Right, we don't use anything like a precedence specification on derivation rules in the ANTLR grammar (and I don't think any such mechanism has direct support in ANTLR, but you never know ;-), but there is one more property of ANTLR grammars that we need to keep in mind: In general, the order of the alternatives in an ANTLR rule specifies a precedence relation (among alternatives, not among operators). This means that a successful parse using alternative _k_ of a rule _R_ will be used, even in the case where there is another successful parse using alternative _k+n_ of _R_. However reliance on this would emerge when the grammar is processed by some other parser generator tool than ANTLR, because it would then be an ambiguity. For instance, a reduce-reduce conflict in a LALR(1) parser generator. In that case you should be able to resolve the ambiguity by specifying that the conflict is eliminated by preferring alternative _k_.
@eernstg Thanks for the detailed analysis! Some points:
I don鈥檛 believe that the described x..y += z issue should pose a problem, though I appreciate the heads up. I know that this case is handled correctly in the Ruby and Python grammars. (The Python grammar definition is defined in a direct ANTLR style but was portable to a prec-having tree-sitter parser, so there鈥檚 historical precedent for such.) In any case, the tree-sitter-dart parser can check in its corpus file that this case is handled correctly.
you should be able to resolve the ambiguity by specifying that the conflict is eliminated by preferring alternative k.
tree-sitter has pretty sophisticated mechanisms for resolving ambiguities (it鈥檚 interactive, even!); you can check the docs here.
Maybe it's cheaper (time/space-wise) to use prec than having a larger grammar?
I believe that proper use of prec would generate a better parser than encoding precision manually, though @maxbrunsfeld is more qualified than I to comment on this. I know that avoiding prec would mean you don鈥檛 get to opt into the nice features that make tree-sitter grammars so pleasant (relatively speaking) to write.
Just curious - does the actual Dart compiler use the ANTLR parser, or is it maintained separately from the main parser?
In any case, there are multiple reasons that Tree-sitter grammars use operator precedence
annotations rather than implying operators' precedence from pure CFG rules.
The main reason is that Tree-sitter's output is a concrete syntax tree - a tree structure whose structure directly mirrors the structure of the source grammar. By using operator precedence, you can produce a tree that is far, far more intuitive to analyze than you can produce by using a pure CFG.
For example, suppose we wanted to parse a simple Dart expression containing just the identifier x. In a typical Tree-sitter grammar, we'd produce one node for the x variable:
(identifier)
But if we produced a concrete syntax tree directly from the Dart ANTLR grammar, the syntax tree for x would look like this:
(expression
(conditionalExpression
(ifNullExpression
(logicalOrExpression
(logicalAndExpression
(equalityExpression
(relationalExpression
(bitwiseOrExpression
(bitwiseXorExpression
(bitwiseAndExpression
(shiftExpression
(additiveExpression
(multiplicativeExpression
(unaryExpression
(postfixExpression (identifier))))))))))))))))
That would make code analysis much more tedious and verbose. It would also incur a large slowdown, both in parsing (producing the syntax tree) and in downstream analysis (consuming the syntax tree).
@maxbrunsfeld The ANTLR parser appears to be a reference work only: I found this parser, which appears to be a hand-rolled parser first generated from somewhere else, though Dart people should inform me if I鈥檓 wrong about that.
@patrickt @maxbrunsfeld the actual parser that is in use by our tools lives here. It is completely hand-written.
For creating the Tree-sitter dart parser, the approach I would recommend would be to not copy any existing document verbatim, but to try and create a grammar that results in a tree which resembles to some degree the parse trees created by the actual Dart parser.
The ANTLR grammar is still a useful guide, which can help to indicate what structures are valid, by manual inspection. But from our experience creating grammars for many other languages with similar kinds of language specs, it's not something that we would want to follow rigidly.
@maxbrunsfeld wrote:
does the actual Dart compiler use the ANTLR parser
@mraleph already mentioned that the tools use this hand-written parser.
Here's the other side of that coin: I created the ANTLR parser in order to improve on the degree to which we can keep the grammar snippets in the language specification consistent and correct, and also in order to make it easier to investigate the viability of proposals for new grammar rules.
[parse tree with a dozen levels or more] ..
That would make code analysis much more tedious and verbose
Sure, so there is in general a need for an abstract syntax tree, and different tools (parser generators) have different kinds and levels of support for expressing that.
But I think the main point here is the following, namely that there is no need to insist on a perfect result as long as the purpose is syntax highlighting and similar tasks:
create a grammar that results in a tree which _resembles_ to some degree
the parse trees created by the actual Dart parser
@patrickt wrote:
tree-sitter has pretty sophisticated mechanisms ..
Thanks for the ref, seems that they've got a very nice system!
I created a dart grammar for tree-sitter located here: https://github.com/UserNobody14/tree-sitter-dart/tree/master I haven't been able to access my npm account for a while after my phone broke, but I'll try to get it available there as well. It is still missing a few of the smaller features from the dart grammar (like rethrow error I believe) and I haven't tested many of the more fragile features extensively but at least its a start! I'm not sure how much work will have to be done after this to get it working fully with semantic.
I've been helping out with the tree-sitter-dart parser that @UserNobody14 has created.
I've run it on several of my projects with 0 errors found.
I've also parsed the entire flutter codebase with the parser.
On the output of the flutter codebase parse, we are only getting ~0.006% of lines in the entire parsed output having any errors. I think we have test cases for those failures. Any feedback would be helpful.
We are trying to stick close to the dart grammar, but also using the tree-sitter recommendations.
Most helpful comment
I created a dart grammar for tree-sitter located here: https://github.com/UserNobody14/tree-sitter-dart/tree/master I haven't been able to access my npm account for a while after my phone broke, but I'll try to get it available there as well. It is still missing a few of the smaller features from the dart grammar (like rethrow error I believe) and I haven't tested many of the more fragile features extensively but at least its a start! I'm not sure how much work will have to be done after this to get it working fully with semantic.