Add "unexpected" rule to override standard error message.
unexpected =
keywords { return "Unexpected keyword "+text()+"."; }
/ expression { return "Unexpected expression «"+text()+"»."; }
/ lamdba_function { return "Unexpected lambda function."; } ;
To implement it just change peg$buildError by:
function peg$buildError() {
var expected = peg$expected[0];
var failPos = expected.pos;
// if "unexpected" rule exist this might throw the appropriate error.
if (typeof peg$parseunexpected !== 'undefined') {
// make peg$expect temporary unavailable, set the cursor position to the fail position
// next get the output of the rule, if it's a string, return it.
const tmp = peg$expect;
peg$expect = new Function();
peg$currPos = failPos;
const unexpected = peg$parseunexpected();
peg$expect = tmp;
if (typeof unexpected === 'string') {
const length = peg$currPos - failPos;
const location = failPos < input.length
? peg$computeLocation(failPos, failPos + length)
: peg$computeLocation(failPos, failPos);
return new peg$SyntaxError(unexpected, expected.variants, unexpected, location);
}
}
// else return standard error.
const unexpected = input.charAt(failPos);
const location = failPos < input.length
? peg$computeLocation(failPos, failPos + 1)
: peg$computeLocation(failPos, failPos);
return new peg$SyntaxError(
peg$SyntaxError.buildMessage(expected.variants, unexpected), expected.variants, unexpected, location);
}
Expected behavior:
Improve error handling.
If it's not ethical, is there anyone who can tell me how to create a plugin that would make the change?
Thanks a lot
I agree that this feature could be added, but until now a way to do it is using the error(...)
function.
For example, you have something could not match:
= FirstMatch
/ AnotherMatch
You could add the following expression, matching every char to be sure that no other match will be done:
= .+
{
error("Unexpected thing: " + text().substring(0,1));
}
And then you could call it when something that should match do not match.
= FirstMatch
/ AnotherMatch
/ UnexpectedThing
your UnexpectedThing
rule cannot handle keywords, expressions or others atoms of your grammar, it just return a single character: text().substring(0,1)
, so what you do is exactly the same thing that the pegjs parser produce. The feature that i proposed is a quite bit different because it handle all specified rules inside the unexpected
rule at any point of the input, fully implicitly and in a non-redondant way.
Here is an example of what the code above will produce for my typescript parser:
Input:
public function some_func(){
public function wtf_here_func(){
}
}
Output error:
Line X, Column 4: Expected ... but "p" found.
Output error: (with unexpected rule feature)
Line X, Column 4: Unexpected "wtf_here_func" method declaration.
The only thing that i added to my grammar is:
unexpected = m:method_declaration { return `Unexpected "${m.identifier}" method declaration.` };
I notice that using the error
function is more elegant than return statement, you're right. So the above should become:
unexpected = m:method_declaration { error(`Unexpected "${m.identifier}" method declaration.`) };
Try it yourself and you will see all that it can bring to our parsers.
You always can write UnexpectedThing
as required. Actually, you already do this, just name it unexpected
. But in your implicit variant you can not have different unexpected rules for different parts of grammar
@Mingun
What you say is wrong, you can override my implicit "unexpected rule" by adding an "UnexpectedThing" in the concerned rule, as you said yourself.
An implicit rule is fully necessary, here is a concrete example, the rule for method declaration:
method_declaration = (privacy __)? "function" _ identifier _ '(' _ args _ ')' _ '{' _ instruction* _ '}';
With the explicit UnexpectedThing at the end:
method_declaration = (privacy __)? "function" _ identifier _ '(' _ args _ ')' _ '{' _ instruction* _ '}' / UnexpectedThing;
This does'nt work if the unexpected thing appear between "function" and the identifier, or between args and ')' etc.
So you should do this:
method_declaration = (privacy __)? ("function"/UnexpectedThing) _ identifier _ ('('/UnexpectedThing) _ args _ (')'/UnexpectedThing) _ ('{'/UnexpectedThing) _ instruction* _ ('}'/UnexpectedThing) / UnexpectedThing;
privacy = ... / UnexpectedThing;
_ = ... / UnexpectedThing;
identifier = ... / UnexpectedThing;
instruction = ... / UnexpectedThing;
...
Why this is bad?
Expected ... or UnexpectedThing1, UnexpectedThing2..., but found "x".
Do you understand my argument now?
It does not matter if you use my code or not, we need an implicit rule for our parsers.
If you want just instead of one symbol in the message Expected... but found X to see a word Expected... but found XXX, it becomes elementary and does not demand any changes. Just catch SyntaxError
and reparse input from error position with special "lexer" parser. You can even define it in the same file as main grammar and reuse some rules. In the code:
let parser = PEG.generate(<main grammar>);
// For correct work this parser must parse any input and return string as result
let lexer = PEG.generate(<lexer grammar>);
try {
return parser.parse(<input>);
} catch (e) {
if (!(e instanceof parser.SyntaxError)) throw e;
// lexer must return string
let found = lexer.parse(input.substr(e.location.start.offset));
// Or you can use specific rule from the same parser
//let found = parser.parse(input.substr(e.location.start.offset), { startRule: "unexpected" });
throw new parser.SyntaxError(
parser.SyntaxError.buildMessage(e.expected, found),
e.expected,
found,
e.location
);
// or you can throw you own exception type
}
To introduce some special support in the generator for this purpose to me sees excessive though I not against that there was a annotation which will mark the rule as a lexer entry point ~when~ if annotations will be implemented.
It is a solution certainly, but too little accessible and difficult to maintain. Anyway thank you for your code, it's always good to take.
I agree that the direct implementation of a lexer in the generator would be welcome.
What concerns do you have so that annotations cannot be implemented?
What concerns do you have so that annotations cannot be implemented?
Unfortunately, as you can see, the project is dead or, at least, in a deep stazis
I actually like the idea of an unexpected
rule, but I'm thinking of putting it in as optional via the features
option. Is that fine with you @log4b0at?
@futagoza Surely
@log4b0at - The unexpected
rule already exists.
The problem with having an unexpected rule specialize on the nature of the thing it failed to parse is that it doesn't know what that is, because it failed to parse it.
Consider the following grammar:
Document = (DadJoke "\n"?)+
Kind = [a-zA-Z0-9 ]+
Car = "car: " Kind ".";
Insect = "insect: " Kind ".";
Annoy = "annoy: " Kind ".";
DadJoke = Car / Insect / Annoy
(This is a dad joke because, obviously, the correct answer for every rule is "bug.")
This should without problems parse input like
car: Honda
insect: Beetle
annoy: Whine
There are two ways to read handling the unexpected
. Either it's that the carrier rule is wrong and you'll handle that, or the subsumed rule is wrong and you'll handle that.
As I understand it, what you're asking for is a rule for unexpected
that lets you give different output based on whether what's unexpected was a car, an insect, or an annoyance.
Let's suppose you have a specialization like:
const UnexpectedCustoms = { // no, not shoes off
'annoy' : 'Unexpected annoyance',
'insect' : 'Unexpected insect',
'car' : 'Unexpected car'
};
So what should it give as an error message when I give it this input?
defect: bug
microphone: bug
disease: bug
hobbyist: bug
Which of those are cars?
What should the error messages there be?
This isn't solvable. This problem is equivalent to saying "hey peg, given that the next thing can't be interpreted, why don't you tell me what it is so I can tell someone?"
First you need to teach it how to interpret that. Next you don't need anything.
This is, in essence, the reason some things are set up as a tokenizer then a lexer. Just use peg
as a tokenizer, in this case, then write a lexer that parses the generated AST and says "uh, you're ... you're not allowed to have a close parenthesis without an open one"
Alternately, if it's about the subsumed rule, rather than the carrier rule, then instead you have an input of the form
car: car: car: ...
Is there any way for pegjs
to know that that's a car, other than writing the rule you can already write to match that out, then handling it?
That's pretty straightforward to handle in-grammar today, and many grammars do. Why do you want extra features for that?
Just write a rule with the name of the feature you're asking for. Pow: done.
No extra peg.js
complexity needed.
Here's the other way to say it.
In order to give an error message for the specific incorrect parsing, you'd need a correct parsing of the incorrect stuff to interpret. Either write a parser that accepts the wrong things and rejects them in the handlers, or write a secondary parser to handle the partial, or write an AST that can accept the wrong thing then interpret the AST to be wrong."
Finally, this really shouldn't be done, because unknown
is the fifth or sixth most common existing name for a rule, after things like document
and operator
I would warrant more than a third of grammars already have this, because the language is already able to express it without features
If you try to add this, all you do is break the existing grammars to add something we already have
This should be declined
@Mingun - I want to resurrect this project. There's no good reason for it to be dead
I think this feature is meant to make the errors more clear, even if an unexpected
rule might not be the right way to implement this sort of thing.
The point is that having an Unexpected X
error seems to make it more practical to spot the parse errors (at least in many cases) rather than having Expected A, B, C, D, E, X, Y or Z
or Expected expression
.
That would be great to see PEG.js being able to do such a thing.
So how do you identify what the X that's unexpected is?
That’s probably the issue of this feature: being able to clearly identify what’s wrong.
The problem here is not trying to identify what X could be, but being sure of what X is.
Like I tried to explain above, that's called "parsing," and the way to do that is to specify it in the grammar
I dont understand well the problem that you trying to raise, can you give me more examples about ?
They'll be literally identical to the existing one.
Try answering the question. It's there socratically and rhetorically; you should learn what the problem is in trying to answer.
Consider the following grammar:
Document = (DadJoke "\n"?)+
Kind = [a-zA-Z0-9 ]+
Car = "car: " Kind ".";
Insect = "insect: " Kind ".";
Annoy = "annoy: " Kind ".";
DadJoke = Car / Insect / Annoy
So what should it give as an error message when I give it this input?
defect: bug
microphone: bug
disease: bug
hobbyist: bug
As I understand your request, the parser is supposed to say something like "I found a disease when I was expecting a car, an insect, or an annoyance."
How's it supposed to know that's a disease?
Parsing fails when the parser doesn't know what the next thing is.
An error message for parsing failure that requires it to know what the next thing is is contradictory to the contextual situation
defect: bug
microphone: bug
disease: bug
hobbyist: bug
"What should the error messages there be?"
You get that with actual error handling:
Line 1, column 1: Expected "annoy: ", "car: ", or "insect: " but "d" found.
If i define an unexpected rule like that
Identifier = [a-zA-Z]+;
unexpected =
DadJoke { error("Unexpected dad joke here"); }
/ i:$Identifier { error(`Unexpected identifier "${i}"`); };
I will get
Line 1, column 1: Unexpected identifier "defect"
"Alternately, if it's about the subsumed rule, rather than the carrier rule, then instead you have an input of the form"
car: car: car:
Here you will get the default message, because no unexpected-rule match ":" punctuator
Line 1, column 8: Expected ... blabla ... but found ":"
Moreover the processus of detecting unexpected things is totally passive and happen only when pegjs detect an error, and don't add any overhead in term of performances.
Does this answer your question correctly?
if you want try yourself, I quickly made a code for 0.10 version of pegjs, replace (in your parser) peg$buildStructuredError
by
function peg$buildStructuredError(expected, found, location) {
if (typeof peg$parseunexpected !== 'undefined') {
peg$fail = new Function();
peg$currPos = location.start.offset;
peg$parseunexpected();
}
return new peg$SyntaxError(peg$SyntaxError.buildMessage(expected, found), expected, found, location);
}
Isn't that roughly what mingun said in 2019?
Now I worry that I'm misunderstanding something here
Use a tokenizer has a cost.
such feature is simple to implement and cost nothing to nobody
okay, that's a fair point
Just some bikeshedding, but when such feature is implemented, it should be let to the user the choice of the rule to be defined as the “unexpected” rule, e.g. with
pegjs.generate( grammar, { unexpected: "unrecognised_token" } )
Hello, I just made a pull request for this functionality, following your advices, namely the use of the error function, much more consistent than a return, suggested by @norech.
And the use of an option to change the name of the rule, suggested by @Seb35
As @futagoza said, the feature is optional and is disabled by default. (I don't know about the features option but by default there is no unexpected rule)
See #661 pull
Most helpful comment
Just some bikeshedding, but when such feature is implemented, it should be let to the user the choice of the rule to be defined as the “unexpected” rule, e.g. with