I see three problems with the current error reporting system in PEG.js:
null
Reporting using `null` is inflexible (it doesn't allow to carry any information about the kind of error) and makes it impossible to return `null` as a regular value from actions, which would be useful sometimes (for example [in a JSON parser](https://github.com/dmajda/pegjs/blob/791034fad92c9cd7a9d1c71187df03441bbfd521/examples/json.pegjs#L45)).
See also #17.
For example, imagine a HTTP 1.1 parser that wants to report an error about missing `Host` header in a HTTP request with a message like "A HTTP 1.1 request must contain a Host header". Currently the only way to do this is to throw an exception with desired message manually. This is cumbersome and it does not interact well with the backtracking behavior (throwing an exception halts the parsing completely, even when it is possible to backtrack and try another alternatives).
expected
property of SyntaxError
is hard to process mechanicallyThe `expected` property of `SyntaxError` is either an array of expectations (each describing one alternative which the parser expected at the position of error) or `null` (meaning that the end of input was expected).
Each expectation is represented by a string. This string can mean an expected literal (represented using its PEG.js syntax — a quoted string), character class (represented using its PEG.js syntax — [...]
), or any character (represented by string "any"
). To differentiate between these cases, one has to parse the string representations manually, which makes building tools based on PEG.js such as autocompleters unnecessarily hard.
I plan to redesign the error reporting system to fix these problems. Before I describe the changes I am thinking about, a little description how the current system works is needed.
When a PEG.js parser tries to match a string literal, character class, or .
, and fails, it produces a _match failure_. It consists of a _position_ and an _expectation_ (description of what the parser tried to match).
After producing a match failure, the parser backtracks and possibly tries another alternatives. If none of them succeeds and there is nothing more to try, the parser throws the SyntaxError
exception. When setting its properties (such as position, expectations, error message, etc.), the parser considers only the furthest match failure (the one with the greatest position). If there are more such failures, their expectations are combined.
The situation is little bit more complicated if named rules are used, but this is irrelevant for this discussion.
I am thinking about the following changes in the error reporting system:
SyntaxError.expected
won't be represented by strings, but by objects. The objects will have a type
property (with values like "literal"
, "class"
, "any"
, "eof"
) which will determine expectation type. For some expectation types, other properties will conain the details (e.g. for "literal"
expectation where would be a value
property containing the expected string). This will simplify mechanical processing of the expectations.An alternative to the type
property is to use classses. But I think the type
property will be easier to handle for users.
null
. It will be a regular value and won't signify an error.error
function. It will take an error message as its parameter. Failures triggered by this function (called _custom match failures_) will consist of a _position_ and an _error message_. They won't have any expectation.The error
function won't interrupt the action execution, it will just mark it as failing and save the error message. The actual failure will be produced only after the action execution finishes. If the error
function is invoked multiple times, the last invocation will win (its error message will be used).
Custom match failures will be handled like regular match failures, meaning they won't halt the parsing completely and let the parser backtrack and possibly try another alternatives. There is one difference though — when finally throwing the SyntaxError
exception, the expectation combination rule which applies to regular match failures won't apply to the custom ones. If in the set of match failures with the furthest position there is at least one custom one, it will simply override the regular ones completely. If there are more custom failures, the one that was produced last will win.
A SyntaxError
exception based on a custom match failure will differ from exceptions based on a regular failure. It's message
property will be equal to the failure's error message and its expected
property will be null
.
Example
```
start = sign:[+-]? digits:[0-9]+ {
var result = parseInt((sign || "") + digits.join(""), 10);
if (result % 2 == 0) {
error("The number must be an odd integer.");
}
return result;
}
```
On input 2
, the parser generated from the grammar above will produce a SyntaxError
with message
set to "The number must be an odd integer."
and expected
set to null
expected
function. It will take an expected value description as a parameter. This function will be provided mainly as a convenience for situations where one doesn't need to generate a full error message and automatically generated form "Expected _X_ but "2" found." is enough.A SyntaxError
exception based on a match failure generated by the expected
function will be similar to exceptions based on a regular failure.
Example
```
start = sign:[+-]? digits:[0-9]+ {
var result = parseInt((sign || "") + digits.join(""), 10);
if (result % 2 == 0) {
expected("odd integer");
}
return result;
}
```
On input 2
, the parser generated from the grammar above will produce a SyntaxError
with message
set to "Expected odd integer but "2" found."
and expected
set to [ { type: "user", description: "odd integer" } ]
I welcome any notes to the proposed changes — please add them as comments. I plan to start implementing the proposal (or some modified version based on feedback) soon.
It would be better if multiple invocations of the error function lead to multiple errors. You could then continue parsing as much as possible.
string = '"' value:(!(eol / '"') .)+ '"' { return value; }
/ '"' value:(!(eol / '"') .)+ { error('unterminated string constant'); return value; }
I would also recommend adding support for warnings as well.
It would be better if multiple invocations of the error function lead to multiple errors. You could then continue parsing as much as possible.
Could you name some use case(s) where you would need this? It seems to me that the example you provided would work with my proposal too.
I already have some use cases, but I don't know how representative they are, so I'd like to see some more.
I would also recommend adding support for warnings as well.
I am also thinking about it.
A problem with multiple errors and warnings is that they would need a different interface than simple and intuitive “parse
returns some value on success or an exception on error”. The parser would need to report multiple errors on unsuccessful parse, plus warnings on both successful and unsuccessful parse.
What kind of API would you find most inutitive here? Again, I have some ideas, but I'd like to see what others think.
Could you name some use case(s) where you would need this? It seems to me that the example you provided would work with my proposal too.
My primary use case is for non-fatal parse errors. Non-terminated strings, identifiers that start with a number, nested comments, missing semicolons, etc.
Generally, it is best to let the parser advance as far as possible, so that as many errors can be shown to the user as possible. If the parser could even finish the parse, and let the next steps (e.g. compilation) report errors as well, that would be great.
This is now implemented. The tools/impact
script reports the following performance impact of the whole set of commits:
Speed impact
------------
Before: 1144.21 kB/s
After: 999.89 kB/s
Difference: -12.62%
Size impact
-----------
Before: 863523 b
After: 1019968 b
Difference: 18.11%
(Measured by /tools/impact with Node.js v0.6.18 on x86_64 GNU/Linux.)
I think that 12.62% speed penalty and 18.11% size penalty is fine for resolving such a long-standing set of problems.
Closing.
@dmajda: That's great news! I'm so glad null
no longer signals failure.
So, there is no "warning" function yet?
The warning function topic is tracked at #325
Most helpful comment
So, there is no "warning" function yet?