Dhall-haskell: Can't use `forall` as identifier

Created on 15 Nov 2018  Â·  14Comments  Â·  Source: dhall-lang/dhall-haskell

I get parse error when trying to make variable named forall.

$ echo 'let forall = 1 in forall' | dhall

Error: Invalid input

(stdin):1:11:
  |
1 | let forall = 1 in forall
  |           ^
expecting label or whitespace

$ dhall version
1.18.0

The expression is currently allowed by dhall standard (ABNF grammar there).

Most helpful comment

My project uses an Earley parser, which is a chart parser that parses all possible parses, returning them all when there is ambiguity. Which is nice to see. The only problem is that the parses are returned in no particular order, so I have to disambiguate them based on the AST results. I've been working on that today, and I have made some progress. It's a delicate art. https://github.com/MonoidMusician/dhall-purescript/commit/ceda196ce180b15333c9001cada5ba649be2aa24#diff-ad13855cb32895cb29d761e0a378b686R285

Regarding this particular issue, I've come to the conclusion that supporting pathological expressions like this is probably not worth the effort, especially for parsers where that could incur much backtracking: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/ambiguity.txt#L1

I mean, can you even tell what that's supposed to parse as? I can, but I'm the one who created that monster to test edge cases that I initially failed ;)

Personally, I think leaving forall/etc. as reserved identifiers is probably the simplest solution. It makes it clear what they are supposed to be in the syntax, and what the expected parse is when they occur. Otherwise I am concerned that parsing will become fragile, especially with if…then…else, where what parsing path is chosen could depend on the presence of a keyword much later in the expression.

Obviously (if a then b else c) is a BoolIf, but is it obvious that (if a then b else) would not be (it's repeated function application of variables), and neither would (if a then b + else c) and (if a then b else + c) (those are sums of applications of variables)? Users might find the errors confusing when they say Unbound variable: else instead of a syntax error.

Anyways. Enough of my opinions. If anyone wants to try using my parser to find ambiguous parses, I can throw it up it in a webpage, so you can see what all the results are. Or you can clone my project :)

EDIT: I should mention that how I'm parsing identifiers is to reject any that match a list of reserved words: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/grammar.ne#L350

The list includes the ones listed in the Haskell implementation, but also I had to add some others (env) to reject ambiguous parses (before space was required around annotations), and I disabled some for the purposes of investigating this issue: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/grammar.ne#L44-L89

All 14 comments

Similar problem occurs for other keywords (eg if, let, merge). I'm not sure if dhall-haskell should be adjusted to support those or the standard should forbid "keyword identifiers".

The standard already "forbids" them, in the sense that the ABNF grammar doesn't exclude the possibility of parsing them, but a conjunction of the left-to-right parsing precedence and the reserved rule being before the label rule doesn't let this happen in practice

EDIT: it looks like if, let, etc are not in the list of reserved words. I think I was the one that removed them from there, in order to reduce the ambiguities, but this means that we don't forbid them anymore

EDIT 2: the removal happened in https://github.com/dhall-lang/dhall-lang/pull/222 to reduce ambiguities, and I was confused about it at the time, and now I realize why: the grammar now effectively allows things like let if = 1 in if, which were still possible before the change, but gave nonsensical parses, while this returns well-formed parses. I think we should add these words back in the list

In the end the solutions for removing this ambiguity are:

  • keep a list of reserved words and add a check on labels after the parse (the current solution here, doesn't actually remove the ambiguity from the grammar, but we can standardize this additional step)
  • add a lexing step before parsing

@f-f left-to-right rule applies only for valid parses. For expression let forall = 1 in forall there is only one valid parse allowed by grammar so the one valid parse should be selected, because it is leftmost.

I understand that left-to-right rule was intended to work inside current grammar production, without context of whole expression (which more or less reflects parsing by recursive descent without backtracking, I believe). However this is not clearly described in ABNF file comments. Also, if it would be described, then there is no clean and formal way of specifying where to backtrack and where not to. We can say "backtrack only when comment above says 'BACKTRACK now'", but it isn't a clean approach in my opinion.

My proposal is to add a comment for simple-label rule that it should reject keywords. It's not completely clean either, but I think it's better. And it is more friendly for implementing non-recursive-descent-parser for dhall.

@SupraSummus we had a comment before I introduced the reserved words directly in the grammar, and IMHO that's not a satisfying solution (as you have to read carefully the whole grammar to figure out that you have to add another parsing step)

Also I'll note that I'm using an always-backtracking parser, and that works well. I assume the "backtrack now" indications in the grammar are for efficiency (so you don't have to _always_ backtrack)

The left-to-right rule would work if we'd add back the missing reserved keywords to the reserved-raw rule

Also I'll note that I'm using an always-backtracking parser, and that works well.

You mean that

  • your parser rejects input let forall = 1 in forall and
  • it's built exactly for current dhall grammar and
  • it is always-backtracking?

Hmm, that's interesting. If a grammar allows for given input then always-backtracking parser should parse the input ok, right? So if your parser rejects let forall = 1 in forall that means dhall grammar forbids it. So then I am wrong.

But I am pretty sure that I'm not. Let's take simpler expression forall into consideration. It can be parsed as complete-expression -> ... -> primitive-expression -> identifier -> label -> simple-label -> "forall". And it's the only successful parse, so it is the preferred one.

The left-to-right rule would work if we'd add back the missing reserved keywords to the reserved-raw rule

I don't think so. Adding productions to the grammar can only increase the language size, never decrease.

Left-to-right rule works only for successful parses. It can be used to select one parse from many possible parses but not to forbid valid parses. If the rule is intended to disallow parses allowed by bare ABNF grammar then it needs to be redefined.

Hmm, that's interesting. If a grammar allows for given input then always-backtracking parser should parse the input ok, right? So if your parser rejects let forall = 1 in forall that means dhall grammar forbids it. So then I am wrong.

You are right: I was just following up on your comment about backtracking, by saying that I don't think the backtracking is an issue.
I do generate the parser from the grammar, and the parser accepts let forall = 1 in forall, and it's the only parse possible currently

I don't think so. Adding productions to the grammar can only increase the language size, never decrease.
Left-to-right rule works only for successful parses. It can be used to select one parse from many possible parses but not to forbid valid parses. If the rule is intended to disallow parses allowed by bare ABNF grammar then it needs to be redefined.

If we add back forall and all the other keywords to the reserved-raw rule, there would be another parsing path: complete-expression -> ... -> primitive-expression -> reserved -> reserved-raw -> forall -> "forall", and this parsing path would come the before the identifier one, thus should be preferred.
The other parse would still be valid, but wouldn't happen in practice.

So you're right that we'd be expanding the grammar, but this rule would allow us to just match on the parse tree (by throwing a parse error when finding a reserved there), instead of traversing the tree checking if any identifier is actually a reserved word

Oh, I see. Adding "forall" to reserved-raw will result in a proper parse of "forall" to be detected before the "forall as an identifier" one. Then this case ("forall as a reserved") can be detected and communicated as an error. That makes sense. Thanks for explaining.

I'll throw out a weird solution: accept keywords as a valid identifiers, but prioritize parsing them as keywords over parsing them as identifiers.

Or in other words, make no changes to the standard grammar (since that is my understanding of what it already specifies) and fix the Haskell implementation to permit keywords as identifiers.

Specifically: that means that the following expression:

let if = 1 in if

... would be interpreted as 1. However, this expression:

let if = 1 in if True then "A" else "B"

... would still be interpreted as "A"

@Gabriel439 yeah I think this is the best solution, so we don't have to worry about having special handling for keywords 👍

@Gabriel439 I don’t think this is a weird solution at all. It’s how I prefer to define things.

So the main issue I'm running into is that I don't know a good way to support this efficiently using the current parser because it requires substantial backtracking.

Part of the problem is that I can't easily switch to a LR parser generator like happy because of the support for injecting custom parsers. If I could use something like happy then I could probably implement this efficiently and correctly

My project uses an Earley parser, which is a chart parser that parses all possible parses, returning them all when there is ambiguity. Which is nice to see. The only problem is that the parses are returned in no particular order, so I have to disambiguate them based on the AST results. I've been working on that today, and I have made some progress. It's a delicate art. https://github.com/MonoidMusician/dhall-purescript/commit/ceda196ce180b15333c9001cada5ba649be2aa24#diff-ad13855cb32895cb29d761e0a378b686R285

Regarding this particular issue, I've come to the conclusion that supporting pathological expressions like this is probably not worth the effort, especially for parsers where that could incur much backtracking: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/ambiguity.txt#L1

I mean, can you even tell what that's supposed to parse as? I can, but I'm the one who created that monster to test edge cases that I initially failed ;)

Personally, I think leaving forall/etc. as reserved identifiers is probably the simplest solution. It makes it clear what they are supposed to be in the syntax, and what the expected parse is when they occur. Otherwise I am concerned that parsing will become fragile, especially with if…then…else, where what parsing path is chosen could depend on the presence of a keyword much later in the expression.

Obviously (if a then b else c) is a BoolIf, but is it obvious that (if a then b else) would not be (it's repeated function application of variables), and neither would (if a then b + else c) and (if a then b else + c) (those are sums of applications of variables)? Users might find the errors confusing when they say Unbound variable: else instead of a syntax error.

Anyways. Enough of my opinions. If anyone wants to try using my parser to find ambiguous parses, I can throw it up it in a webpage, so you can see what all the results are. Or you can clone my project :)

EDIT: I should mention that how I'm parsing identifiers is to reject any that match a list of reserved words: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/grammar.ne#L350

The list includes the ones listed in the Haskell implementation, but also I had to add some others (env) to reject ambiguous parses (before space was required around annotations), and I disabled some for the purposes of investigating this issue: https://github.com/MonoidMusician/dhall-purescript/blob/ceda196ce180b15333c9001cada5ba649be2aa24/grammar.ne#L44-L89

Probably best to stay with “tried and tested” and disallow keywords. This makes implementations that use existing parser frameworks easier for dhall implementations in different languages/contexts.

Yeah, I think this is the most compelling reason to disallow keywords as identifiers:

Obviously (if a then b else c) is a BoolIf, but is it obvious that (if a then b else) would not be (it's repeated function application of variables), and neither would (if a then b + else c) and (if a then b else + c) (those are sums of applications of variables)? Users might find the errors confusing when they say Unbound variable: else instead of a syntax error.

The main issue we have is that we need a way to standardize in the grammar that those keywords are disallowed for identifiers. The only solution I'm aware of is to mention that in a comment (like we used to do at one point), but if people know of a better solution just let me know.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vmchale picture vmchale  Â·  5Comments

akshaymankar picture akshaymankar  Â·  3Comments

Profpatsch picture Profpatsch  Â·  4Comments

sjakobi picture sjakobi  Â·  5Comments

chris-martin picture chris-martin  Â·  5Comments