Julia: AST Format Cleanup

Created on 10 May 2017  Â·  14Comments  Â·  Source: JuliaLang/julia

Let's discuss cleaning up some kinks in Julia's AST format to make things easier for macro authors.

Specific Issues

In many cases, the parser does too much work, losing information:

  • :(1(2)) → :(1 * 2)
  • a ? b : c → if a ...
  • if a b elseif c d end → if a b else (if c d end) end
  • f() do ... → f(() -> ...)

I think we should aim for round-trip-ability here and have this work be done only at lowering time. Short function definitions and broadcasting syntax are examples of this being done right.

This is more of a parser issue than an AST issue, but it's not currently possible to write for $cond body end as the parser checks for an x = y iteration specification; instead you must use $(Expr(:for, cond, :body). For the same reason, it's not possible to write @capture(ex, for cond_ body_ end) with MacroTools.

let and for have the opposite ordering for declarations and body.

The existence of both kw and = is strange as it gives us the opposite problem to that above: we have two ways to represent the same surface syntax. Perhaps this is necessary; MacroTools normalises kw to = to avoid surprising failures.

We currently have two ways to represent keyword arguments in calls: f(a, b = c) which has little processing, and f(a; b = c) which becomes f($(Expr(:parameters, Expr(:kw, :a, :b)), a). The latter is one of the few places (let is another) where the ordering/structure of the expression mistmatches that of the surface syntax. I suspect a nicer design is possible but I don't have an exact proposal. A matching pattern like f(a, xs__) currently matches kwargs style and not the other, so at least putting the params at the end of the call consistently would help with that.

Motivation

Losing information about the AST limits what macros can do, and/or adds strange edge cases to what they can do. Consider @static if for example:

@static if foo
  f()
elseif bar
  g()
else
  h()
end

@static if foo
  f()
else
  bar ? g() : h()
end

The fact that we can't tell these two examples apart means that we have to choose a surprising behaviour for one of them.

As another example, due to the parsing of 1(2) you can't use DataFlow.jl's syntax for graphs if the graph contains literal numbers. This isn't a crucial use case, but it illustrates the kind of surprising edge cases this leads to, even for innocuous-looking transformations.

Concerns

I'm not sure to what extent this can be done after 1.0. What levels of AST are part of Julia's public API? People will be relying on it in practice, although the users should be few enough and expert enough to alleviate the impact.

This adds some work for macro authors similar to that caused by short/long-form function definitions, in cases where the difference doesn't matter. This should be very manageable, as it's easy to provide utility functions that do various parts of normalisation (as MacroTools does already for longdef and shortdef).

Let me know if I've missed anything. Input from other macro writers welcome.

breaking parser

Most helpful comment

Please mention more specific examples of weird ASTs if anybody finds any, but otherwise I think we're done here.

All 14 comments

I had assumed it would be easier to write macros if parsing normalized things to some extent, but if people feel that hasn't been the case then ok.

I agree we should try to get rid of the kw head and only use =.

elseif, ?, and do should be pretty easy to change.

I assume the alternative parsing for 1(2) would be as a call. However given how common call nodes are, it seems extremely taxing to check every call to see if the first argument is a numeric literal, in which case you have to know that it will be a multiply instead. We could eliminate implicit multiplication with parens, but then we'd be left with 2(...) as wasted syntax.

I like these being pure sugar and not having to deal with them as separate cases in macros. A tokenizer should be able to round trip these things, but macros shouldn't need to worry about what are semantically just formatting differences in their input.

@JeffBezanson glad to hear that you're on board.

it seems extremely taxing to check every call to see if the first argument is a numeric literal

Assuming you're talking about the burden on macro writers, it shouldn't be that bad. It's a trivial utility function in fact (and this generalises to all the discussed changes):

lower_muls(ex) = postwalk(x -> @capture(x, n_Number(y_)) ? :($n*$y) : x, ex)

Then just call lower_muls once and you get the same AST you would've before. Of course, most macros won't have to do this, in the same way that most macros don't currently have to normalise function definitions or broadcast syntax.

I think it's worth emphasising that macros have to deal with these kinds of things already; this just reduces the number of awkward special cases. I don't expect this to increase the burden on macros (or I wouldn't propose it) so if anyone has concerns about that I'm happy to see examples and work it through.

macros shouldn't need to worry about what are semantically just formatting differences

They're only "semantically just formatting differences" if you aren't changing the semantics, which is exactly what things like @static if do.

Yichao had a good point about kw: https://github.com/JuliaLang/julia/issues/20000#issuecomment-272288503
It avoids changing the meaning of an expression based on the context it's spliced into.

if you aren't changing the semantics, which is exactly what things like @static if do.

@static if doesn't change that the semantics of if and the ternary operator are completely interchangeable - the fact that they're treated exactly the same in the AST means @static was easier to write in a way that works on ternary expressions with no extra effort.

I would point out that the round-tripping issue could be addressed by choosing which form to print with based on properties of the expressions. A conditional with multiple statements has to be an if/else; a conditional with single-expression branches can be either one, if the expressions are short enough, we can format them as a ternary operator.

There are quite a few more ambiguous cases , e.g.
import A: b, c vs import A.b, A.c
or
a ? (b;c) : d
vs

if a 
    b
    c
else
    d
end

A couple more things to consider:

(a...; b...) -> f(a...; b...)

gets parsed confusingly as

begin
    a...
    b...
end -> f(a...; b...)
type a
    b
    c
end

shouldn't get parsed with a begin block IMHO because b and c are not executed sequentially.

I've been using NumberedLines.jl for working with numbered lines but I'd argue that at least some of its features should be included in base. Having a NumberedLine type (or AST node) which includes both a line number and expression would be I think both much safer and much more intuitive.

Parsing code blocks as Expr(:block) should not be a problem here.

  1. It doesn't affect round trip at all.
  2. The printing of the (a..., b...) -> ... could be improved, but not necessarily the parsing
  3. Expr(:block) does not imply evaluation so not executed sequentially is not a good argument for parsing differently as long as there's no ambiguity from the context. By this argument

    • begin end in quote shouldn't be parsed as Expr(:block) since it's not evaluted, in fact nothing should parse in quote since none of them are evaluated as usual.
    • The first arg of function f() end shouldn't be parsed as a call either since it's not evaluated.
    • (a, b) -> 2 the argument list shouldn't be parsed as tuple since it's not evaluated.
    • let a = 1, b = 2, c::Int end should not use Expr(:block) for the var list since they are not evaluated with the same rule as a normal block (in fact, this block uses a similar rule to a struct definition).

    None of the cases above has any ambiguity and there's no reason to create many more expression types for people to handle in any of these cases.

One more is the arguments to for, which hopefully could be similar to the arguments to :generator

e = quote
    result = 0
    for i = 1:10, j = 11:20
        result = result + i + j
    end
    (i + j for i = 1:10, j = 11:20)
end
e.args[4].args
e.args[6].args

Status of this:

  • elseif and let have been fixed
  • Seems we are leaning against changing kw vs. =
  • I don't really want to parse 1(2) as a call, but perhaps there should be an expression head for juxtaposition?
  • ? and do are still up for debate

Actual round-trip-ability is not on the table here; CSTParser handles that very well and moving fully to that is way out of scope for 1.0.

From triage: preserving do sounds good, that's next on the list.

Please mention more specific examples of weird ASTs if anybody finds any, but otherwise I think we're done here.

Awesome. Do I take it we are not in favour of preserving ?, or of delaying for $cond end errors later in the pipeline?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

manor picture manor  Â·  3Comments

sbromberger picture sbromberger  Â·  3Comments

yurivish picture yurivish  Â·  3Comments

wilburtownsend picture wilburtownsend  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments