Pegjs: Redesign error reporting

Created on 14 Aug 2013  ·  7Comments  ·  Source: pegjs/pegjs

I see three problems with the current error reporting system in PEG.js:

  1. Actions report errors by returning null
Reporting using `null` is inflexible (it doesn't allow to carry any information about the kind of error) and makes it impossible to return `null` as a regular value from actions, which would be useful sometimes (for example [in a JSON parser](https://github.com/dmajda/pegjs/blob/791034fad92c9cd7a9d1c71187df03441bbfd521/examples/json.pegjs#L45)).

See also #17.

  1. There is no way to produce errors with completely custom messages
For example, imagine a HTTP 1.1 parser that wants to report an error about missing `Host` header in a HTTP request with a message like "A HTTP 1.1 request must contain a Host header". Currently the only way to do this is to throw an exception with desired message manually. This is cumbersome and it does not interact well with the backtracking behavior (throwing an exception halts the parsing completely, even when it is possible to backtrack and try another alternatives).

  1. The expected property of SyntaxError is hard to process mechanically
The `expected` property of `SyntaxError` is either an array of expectations (each describing one alternative which the parser expected at the position of error) or `null` (meaning that the end of input was expected).

Each expectation is represented by a string. This string can mean an expected literal (represented using its PEG.js syntax — a quoted string), character class (represented using its PEG.js syntax — [...]), or any character (represented by string "any"). To differentiate between these cases, one has to parse the string representations manually, which makes building tools based on PEG.js such as autocompleters unnecessarily hard.

I plan to redesign the error reporting system to fix these problems. Before I describe the changes I am thinking about, a little description how the current system works is needed.

How Error Reporting Currently Works?

When a PEG.js parser tries to match a string literal, character class, or ., and fails, it produces a _match failure_. It consists of a _position_ and an _expectation_ (description of what the parser tried to match).

After producing a match failure, the parser backtracks and possibly tries another alternatives. If none of them succeeds and there is nothing more to try, the parser throws the SyntaxError exception. When setting its properties (such as position, expectations, error message, etc.), the parser considers only the furthest match failure (the one with the greatest position). If there are more such failures, their expectations are combined.

The situation is little bit more complicated if named rules are used, but this is irrelevant for this discussion.

Proposed Changes

I am thinking about the following changes in the error reporting system:

  1. Expectations in SyntaxError.expected won't be represented by strings, but by objects. The objects will have a type property (with values like "literal", "class", "any", "eof") which will determine expectation type. For some expectation types, other properties will conain the details (e.g. for "literal" expectation where would be a value property containing the expected string). This will simplify mechanical processing of the expectations.

An alternative to the type property is to use classses. But I think the type property will be easier to handle for users.

  1. Actions will be allowed to return null. It will be a regular value and won't signify an error.
  2. Actions will be able to trigger a match failure using a new error function. It will take an error message as its parameter. Failures triggered by this function (called _custom match failures_) will consist of a _position_ and an _error message_. They won't have any expectation.

The error function won't interrupt the action execution, it will just mark it as failing and save the error message. The actual failure will be produced only after the action execution finishes. If the error function is invoked multiple times, the last invocation will win (its error message will be used).

Custom match failures will be handled like regular match failures, meaning they won't halt the parsing completely and let the parser backtrack and possibly try another alternatives. There is one difference though — when finally throwing the SyntaxError exception, the expectation combination rule which applies to regular match failures won't apply to the custom ones. If in the set of match failures with the furthest position there is at least one custom one, it will simply override the regular ones completely. If there are more custom failures, the one that was produced last will win.

A SyntaxError exception based on a custom match failure will differ from exceptions based on a regular failure. It's message property will be equal to the failure's error message and its expected property will be null.

Example

```
start = sign:[+-]? digits:[0-9]+ {
var result = parseInt((sign || "") + digits.join(""), 10);

 if (result % 2 == 0) {
   error("The number must be an odd integer.");
 }

 return result;

}
```

On input 2, the parser generated from the grammar above will produce a SyntaxError with message set to "The number must be an odd integer." and expected set to null

  1. Actions will also be able to to trigger a regular match failure using the expected function. It will take an expected value description as a parameter. This function will be provided mainly as a convenience for situations where one doesn't need to generate a full error message and automatically generated form "Expected _X_ but "2" found." is enough.

A SyntaxError exception based on a match failure generated by the expected function will be similar to exceptions based on a regular failure.

Example

```
start = sign:[+-]? digits:[0-9]+ {
var result = parseInt((sign || "") + digits.join(""), 10);

 if (result % 2 == 0) {
   expected("odd integer");
 }

 return result;

}
```

On input 2, the parser generated from the grammar above will produce a SyntaxError with message set to "Expected odd integer but "2" found." and expected set to [ { type: "user", description: "odd integer" } ]

Next Steps

I welcome any notes to the proposed changes — please add them as comments. I plan to start implementing the proposal (or some modified version based on feedback) soon.

feature

Most helpful comment

So, there is no "warning" function yet?

All 7 comments

It would be better if multiple invocations of the error function lead to multiple errors. You could then continue parsing as much as possible.

string = '"' value:(!(eol / '"') .)+ '"' { return value; }
       / '"' value:(!(eol / '"') .)+     { error('unterminated string constant'); return value; }

I would also recommend adding support for warnings as well.

It would be better if multiple invocations of the error function lead to multiple errors. You could then continue parsing as much as possible.

Could you name some use case(s) where you would need this? It seems to me that the example you provided would work with my proposal too.

I already have some use cases, but I don't know how representative they are, so I'd like to see some more.

I would also recommend adding support for warnings as well.

I am also thinking about it.

A problem with multiple errors and warnings is that they would need a different interface than simple and intuitive “parse returns some value on success or an exception on error”. The parser would need to report multiple errors on unsuccessful parse, plus warnings on both successful and unsuccessful parse.

What kind of API would you find most inutitive here? Again, I have some ideas, but I'd like to see what others think.

Could you name some use case(s) where you would need this? It seems to me that the example you provided would work with my proposal too.

My primary use case is for non-fatal parse errors. Non-terminated strings, identifiers that start with a number, nested comments, missing semicolons, etc.

Generally, it is best to let the parser advance as far as possible, so that as many errors can be shown to the user as possible. If the parser could even finish the parse, and let the next steps (e.g. compilation) report errors as well, that would be great.

This is now implemented. The tools/impact script reports the following performance impact of the whole set of commits:

Speed impact
------------
Before:     1144.21 kB/s
After:      999.89 kB/s
Difference: -12.62%

Size impact
-----------
Before:     863523 b
After:      1019968 b
Difference: 18.11%

(Measured by /tools/impact with Node.js v0.6.18 on x86_64 GNU/Linux.)

I think that 12.62% speed penalty and 18.11% size penalty is fine for resolving such a long-standing set of problems.

Closing.

@dmajda: That's great news! I'm so glad null no longer signals failure.

So, there is no "warning" function yet?

The warning function topic is tracked at #325

Was this page helpful?
0 / 5 - 0 ratings