Antlr4: Closure error message while compiling Rfc5322 grammar

Created on 1 Sep 2013 · 9Comments · Source: antlr/antlr4

I converted the Rfc5322 grammar (Internet Message Format) from Abnf to Antlr using my AbnfToAntlr tool (http://sourceforge.net/projects/abnf-to-antlr).

The resulting grammar causes a compile error...

rule 'obs_body' contains a closure with at least one alternative that can match an empty string

Rfc5322 allows for an empty string as the message body. Why is this considered an error in the grammar?

Here is an excerpt from the converted Rfc5322 grammar to reproduce the problem...

// Internet Message Format (RFC 5322)
//
// ABNF grammar extracted from http://tools.ietf.org/html/rfc5322

grammar rfc5322excerpt;

start           : obs_body;

obs_body        :   ((lf* cr* ((U_0000 | text) lf* cr*)*) | crlf)*;

text            :   CL_A | CL_B;  // remaining characters omitted for clarity

crlf           :  cr lf;
                    // Internet standard newline

lf             :  LF;
                    // line feed

cr             :  CR;
                    // carriage return

//////////////////////////////////////////////////////////////////////////
// Lexer rules generated for each distinct character in original grammar
// per http://www.unicode.org/charts/PDF/U0000.pdf
//////////////////////////////////////////////////////////////////////////

LF : '\u000A';
CR : '\u000D';
CL_A : 'A';
CL_B : 'B';
U_0000 : '\u0000';

Source

rpinchbeck

Most helpful comment

@xied75 The Abnf-To-Antlr project has moved to GitHub...
https://github.com/rpinchbeck/Abnf-To-Antlr

rpinchbeck on 21 Sep 2017

🎉2 😄2

All 9 comments

An optional alternative is not a problem. It's only a problem when you do stuff like this ( )* which has two optional branches. I can imagine how that something you want. You will see that's exactly what you have in obs_body. simply change start to be obs_body? and make obs_body not optional in the loop. Use ()+ maybe?

parrt on 2 Sep 2013

Some documentation for the error is found in the documentation for ErrorType.EPSILON_CLOSURE. To resolve the problem in your code, you need to make sure the body of a closure (either (...)* or (...)*) must match at least one token (in the parser) or character (in the lexer). You can _locate_ the problem by looking at each closure in the obs_body rule individually (starting with the innermost closures):

lf*: not a problem; lf will match exactly one LF token
cr*: not a problem; cr will match exactly one CR token
((U_0000 | text) lf* cr*)*: not a problem; the block will match at least one U_0000, CL_A, or CL_B token regardless of whether or not any LF or CR tokens are matched
((lf* cr* ((U_0000 | text) lf* cr*)*) | crlf)*: here is the problem; the first alternative might not match anything.

You can correct the problem by altering the following block (the first alternative of the problematic closure) to require at least one token match:

(lf* cr* ((U_0000 | text) lf* cr*)*)

This can be done as follows:

( lf+ cr* ((U_0000 | text) lf* cr*)*
| cr+ ((U_0000 | text) lf* cr*)*
| ((U_0000 | text) lf* cr*)+
)

The final rule might look like this:

  : ( ( lf+ cr* ((U_0000 | text) lf* cr*)*
      | cr+ ((U_0000 | text) lf* cr*)*
      | ((U_0000 | text) lf* cr*)+
      )
    | crlf
    )*
  ;

sharwell on 2 Sep 2013

Wow, thanks for the quick and detailed response. Apparently, the IETF did not use Antlr4.1 and AntlrWorks2.1 to detect closure problems in their grammars. :-)

I already fixed the grammar manually before I reported the issue; however, I want to translate IETF grammars and generate parsers without any manual intervention. Antlr 4.1 is working perfectly for Rfc5234 and Rfc3986 -- I would like to get it working for other grammars including Rfc5322.

Which begs the question... is there a general way to detect and correct these kinds of errors given an abstract syntax tree of the grammar? If there is, then I could build it into my translator... although I expect that it will require me to advance my skill level significantly. Alternatively, I could drop a hint, possibly in this very sentence, to include such a feature in Antlr. :-D

rpinchbeck on 3 Sep 2013

You do get automatic reporting of this problem in ANTLR 4 and when using ANTLRWorks 2.1. You could request a new transformation feature for ANTLRWorks 2 (to automatically correct the problem) on its GitHub issue tracker.

sharwell on 3 Sep 2013

The error message was definitely helpful... it just seemed strange that the honey-badger didn't handle it automatically. :-)

Anyway, I will focus on fixing it in my translator. I think my translator can detect it and fix it automatically... but I will have to research how to transform the abstract syntax tree prior to walking it with the output visitor.

Of course, I am happy to submit a transformation feature request for AntlrWorks2 (it won't help me reach my goal of fully automatic parser generation, but I am happy to submit it if you think it's worth doing).

Thanks again for the help.

rpinchbeck on 3 Sep 2013

In 4.0, it did silently handle it. However, when writing the paper it became clear that one allowable way to "handle" the situation is an infinite loop, matching copies of the empty string until the parse tree became so large that the system couldn't allocate more memory. In fact, if the loop is greedy, then that would be the _expected_ behavior.

ANTLR 4.1 will not compile such a grammar, so we are now able to _assume_ that the ATN does not contain such a construct and in turn added a major optimization to the runtime grammar analysis code. Issue #282 contains additional details.

sharwell on 3 Sep 2013

start
: relacionalExpression
;

relacionalExpression
: term relop term
| term
;

relop
: EQ | GT | LT | LE | GE | NEQ
;

term
: factor relop factor |
| factor
;

factor
: LPAREN relacionalExpression RPAREN
| atom
;

atom
: ID
| NUMBER
| LPAREN relacionalExpression RPAREN
;
/*
letter : ('a'..'z')|('A'..'Z')+ ;

number : ('-')?('0'..'9')+('.')('0'..'9')+ ( E('ADD'|'MINUS')?('0'..'9')+) ; */

bool_gram
: expr_boolean
;

expr_boolean
: bool_term bool_op bool_term
| bool_term
;

bool_op
: OR
| AND
;

bool_term
: boolean
| expr_boolean
| unary_op bool_term
;

unary_op
: NOT
;

The following sets of rules are mutually left-recursive [expr_boolean, bool_term]

please help

Zudark on 5 Nov 2015

@rpinchbeck Sorry for commenting in an old issue here. Any chance you could put your Abnf to Antlr on GitHub other than sourceforge? That would be great for the community.