Pegjs: Implement parametrizable rules

Created on 25 Aug 2011  路  29Comments  路  Source: pegjs/pegjs

It would be great to be able to parametrize rules with variables ;

   = '\"' parse_contents '\"' ->
   / '\'' parse_contents('\'') '\'' ->
   / '+' parse_contents('+') '+' -> /* sure why not :) */

parse_contents(terminator='\"')
    = ('\\' terminator / !terminator .)+ -> return stuff
feature

Most helpful comment

Thank you for your time in explanation

I'll probably have another question for you in ten years. Have a good 2020s

All 29 comments

Do you have specific use case where this would save you significant amount of work or make something currently impossible possible?

It makes parsing indentation levels much easier by calling rules that have the level passed as an argument.

Also, in a pure DRY logic, when doing things like "stuff delimited by this character with escape sequence such", it is nicer to call something like delimited('\'', '\\') than just do the rule (and its actions !) three times.

I should have been more clear. By "specific" I was looking for something like "I was working on a grammar of language X and there are 5 rules in there that could have been combined in one, here they are:" That is, I wanted to see real-world use case and real-world code. From that I can judge better in what cases this feature would be useful and for how many people.

Please don't take this as I am opposed to this feature per se. I just generally don't want to implement features useful only for a tiny fraction of languages or developers because of complexity and implementation cost. And in this case the cost is relatively high.

Just writing a parser for javascript, I could have string = delimited_by('\'') / delimited_by('\"') and then later regexp = delimited_by('/').

Lately, I've been writing a parser for a language of my own. I have something like this in a PEG framework I wrote for python :

LeftAssociative(op, subrule): l:subrule rest:(op subrule)* -> a_recursive_function_that_reverses_rest_and_makes_bintrees(l, op, subrule)

And I can then write:

...
exp_mult = LeftAssociative(/[\*\/%]/, exp_paren)
exp_add = LeftAssociative(/[\+-]/, exp_mult)

Since I have about as many levels of priority as in C++ (all its operators plus a few more), I'll let you imagine how useful in can be. I'm not done yet in the parsing expressions, yet I already use it 12 times.

This would be great if combined with an 'import' feature

require(CommaSeparatedList.pegjs)
require(StringLiteral.pegjs)
require(IntegerLiteral.pegjs)

...

Function
 = name:FuncName "(" args:CommaSeparatedList(Literal)  ")" 

Hash
= "{"   props:CommaSeparatedList(Prop)   "}"

Prop
= Key ":" Literal

Literal =
  StringLiteral / IntegerLiteral

(This is a bit more complicated than OP's request, but it seemed too close to justify its own thread.)

I'm building an R5RS Scheme parser with the help of PEG.js. Everything's rosy except for quasiquotations, which require context-aware parsing. It would be useful to be able to parameterize rules for the sake of on-the-fly rule generation from templates, avoiding a large amount of awkward post-processing. For example, a simplified quasiquotation grammar might look like:

    quasiquotation = qq[1]
    qq[n] = "`" qq_template[n]
    qq_template[0] = expression
    qq_template[n] = simple_datum / list_qq_template[n] / unquotation[n]
    list_qq_template[n] = "(" qq_template[n]* ")" / qq[n+1]
    unquotation[n] = "," qq_template[n-1]

I am interested in contributing to the development of this feature if there's any interest in adding it to the tool.

The main reason to do this would be to support grammars sensitive to context, which if I'm not mistaken, most popular languages are (I know for sure that C and python have context specific stuff). According to Trevor Jim, Haskell is also not context free, and asserts that most langauges aren't:

http://trevorjim.com/haskell-is-not-context-free/
http://trevorjim.com/how-to-prove-that-a-programming-language-is-context-free/

Using external state in a parser that can backtrack (like PEG can) is dangerous, and can produce issues such as can be seen in this parser:

{   var countCs = 0;
}

start = ((x/y) ws*)* { return countCs }

x = "ab" c "d"
y = "a" bc "e"

c = "c" { countCs++; }
bc = "bc" { countCs++; }

ws = " " / "\n"

The above returns 2 instead of the correct answer of 1. Issues like this can be hard to reason about, can create insidious hard-to-find bugs, and when found can be very hard to work around at all, much less doing it elegantly. It unclear to me how to even do this without doing post-processing of data returned by PEG. If somehow your parser itself needs the count, its simply out of luck.

Currently, (dangerously) using external state is the only way to parse grammar that are sensitive to context. With parameterized rules, a parser could parse this without risking invalid state:

start = countCs:((x<0>/y<0>) ws*)* { return countCs.reduce(function(a,b){return a+b[0];}, 0); }

x<count> = "ab" theCount:c<count> "d" { return theCount; }
y<count> = "a" theCount:bc<count> "e" { return theCount; }

c<count> = "c" { return count++; }
bc<count> = "bc" { return count++; }

ws = " " / "\n"

David, you asked for real situations, and python's whitespace indentation syntax is clearly an example here. I want to do similar white-space indentation syntax in Lima (the programming language I'm making with PEG). But I would not want to implement anything like that when I could inadvertantly create invalid state that blows everything to hell. I could name any parsing construct that requires context, like C's x* y (is it x times y or y being defined as a pointer to an x-typed value?).

Note that for grammars sensitive to context to be parsable, one would necessarily need to pass information returned from subexpressions already matched into a parameterized rule - otherwise the parser can't actually use any of the context. Here's a real example of a string type I'm considering for Lima that only works if parameterized parsing is available and can access (as variables) lablels of previously matched expressions:

literalStringWithExplicitLength = "string[" n:number ":" characters<n> "]"
number = n:[0-9]* {return parseInt(n.join(''));}
characters<n> = c:. { // base case
  if(n>0) return null; // don't match base case unless n is 0
  else return c;
}
/ c:. cs:characters<n-1> {
  ret c+cs
}

This would be able parse a string like string[10:abcdefghij] . You can't do that with nice pure PEG.js as it stands. You have do something awful like:

{ var literalStringLengthLeft=undefined;
}
literalStringWithExplicitLength = "string[" n:specialNumber ":" characters "]"
specialNumber = n:number {
  literalStringLengthLeft = n;
  return n;
}
number = n:[0-9]* {return parseInt(n.join(''));}
characters = c:character cs:characters? {
  return c + cs
}
character = c:. {
  if(literalStringLengthLeft > 0) {
    literalStringLengthLeft--;
    return c;
  } else {
    literalStringLengthLeft = undefined;
    return null; // doesn't match
  }
}

Many many protocols have this kind of parsing need - for example, IPv4 packets have a field describing its total length. You need that context to properly parse the rest of the packet. Same is true for IPv6, UDP, and probably any other packet-based protocol. Most protocols using TCP are also going to need something like this, since one needs to be able to trasmit multiple completely separate objects using the same conceptual character stream.

Anyways, I hope I've given some good examples and reasons why I think this is not only a nice feature, not only a powerful feature, but really an essential feature that many parsers are missing (including, for the moment, PEG.js).

Pegasus (a project that shares most of its syntax with peg.js) solves this by having a #STATE{} expression that is given the ability to mutate a dictionary of state. This dictionary of state is backtracked when rules are backtracked. This allows it to support significant whitespace parsing (see the wiki entry about Significant Whitespace for the details).

Also, by backtracking the state along with the parsing cursor, memoization can be accomplished for stateful rules as well.

Peg.js could easily do the same, I reckon.

How does Pegasus manage backtracking state when rules backtrack? I can imagine that you could keep a snapshot of the whole program state that changed, and revert it back, but that would be expensive. I could imagine keeping a snapshot of only the variables that changed, but that would either require the user to specify it which would add complexity to creating parsers, or would require the parser to somehow figure out all the state changed in some bit of code. None of these sound ideal, so how does Pegasus do it?

Theoretically, the parser could avoid invalidly executed actions if A. actions are queued up in closures and only executed once the parser has fully completed, and B. because they execute after the parser has completed they couldn't cancel a rule match. Perhaps that scheme would be more optimal than the state backtracking done in pegasus?

Also, fixing the problem of invalid state is very nice indeed, but it doesn't solve the problem of expressibility I brought up related to a string literal like string[10:abcdefghij], but I'm definitely interested in how it works

It doesn't backtrack the whole program's state. It maintains an immutable dictionary of state. It saves the current state dictionary along with the cursor and whenever the cursor is backtracked, the state dictionary gets backtracked with it. The dictionary is immutable anywhere outside of #STATE{} actions, and is COPIED just before each state change.

There is a small performance penalty for setting an extra variable every time you advance the cursor, but this is far outweighed by the ability to memoize stateful rules. Also, this doesn't lead to tons of memory allocation, because the immutable nature of the state dictionary allows it to be shared until it is mutated. For example, if you didn't have state in your parser, there would be only one allocation: a single (empty) state dictionary.

JavaScript doesn't (to my knowledge) have the ability to make an object immutable, but that was mostly a safety feature. Peg.js would just need to copy a state dictionary before processing each #STATE{} code block and backtrack that whenever the cursor is backtracked.

Oh ok, so the user basically does have to specify what state they're changing. Thats pretty cool. But I still don't think it really covers the same benefits that parameterization does. It sounds like its probably useful in its own right for other things.

I just wrote a fork that supplies an environment, accessible using the variable env: https://github.com/tebbi/pegjs
This is the same as the #STATE{} object suggested above.
It is a quick hack, using a (package-)global variable, which is restored whenever a parsing function is left. The copying of env is accomplished with Object.create().

Here is an example grammar that uses it to parse whitespace-defined blocks a la Python:

{
  env.indLevel = -1
}

block =
  empty
  ind:ws* &{
    if (ind.length <= env.indLevel) return false;
    env.indLevel = ind.length;
    return true;
  }
  first:statement? rest:indStatement*
  {
    if (first) rest.unshift(first);
    return rest;
  }

indStatement =
  "\n" empty ind:ws* &{ return env.indLevel === ind.length; }
  stm:statement
  {return stm; }

statement =
    id:identifier ws* ":" ws* "\n"
    bl:block { return [id, bl]; }
  / identifier

identifier = s:[a-z]* { return s.join(""); }

empty = (ws* "\n")*

ws = [ \t\r]

Here is an example input for the resulting parser:

b:
   c
   d:
       e
   f
g

I have the impression that PEG.js doesn't support parameters of any kind on rules - which is surprising. This feature is very important to me.

What I need is simpler than the OP's request - the OP wants to modify the grammar itself depending on the parameter, but at a minimum I just need to pass an integer into a rule. Basically I want to translate an LLLPG rule that looks like this (where PrefixExpr is a high-precedence expression such as a prefix expression like -x, or an identifier...):

@[LL(1)]
rule Expr(context::Precedence)::LNode @{
    {prec::Precedence;}
    e:PrefixExpr(context)
    greedy
    (   // Infix operator
        &{context.CanParse(prec=InfixPrecedenceOf(LT($LI)))}
        t:(NormalOp|BQString|Dot|Assignment)
        rhs:Expr(prec)
        { ... }
    |   // Method_calls(with arguments), indexers[with indexes], generic!arguments
        &{context.CanParse(P.Primary)}
        e=FinishPrimaryExpr(e)
    |   // Suffix operator
        ...
    )*
    {return e;}
};
// Helper rule that parses one of the syntactically special primary expressions
@[private] rule FinishPrimaryExpr(e::LNode)::LNode @{
(   // call(function)
    "(" list:ExprList(ref endMarker) ")"
    { ... }
    |   // ! operator (generic #of)
        "!" ...
    |   // Indexer / square brackets
        {var args = (new VList!LNode { e });}
        "[" args=ExprList(args) "]"
        { ... }
    )
    {return e;}
};

My language has 25 precedence levels, and with these rules I have collapsed almost all of them to be processed by a single rule (you can think of Precedence as a wrapper around a couple of integers that describe the precedence of an operator). Moreover, my language has an infinite number of operators (basically any sequence of punctuation) and the precedence of an operator is decided after it is parsed. While it would be _technically_ possible to parse the language in the usual way, with a separate rule for each of the 25 kinds of operator, that would be a horrible way to do it.

Also you can see here that the inner rule FinishPrimaryExpr constructs a syntax tree that incorporates a parameter passed from the enclosing rule.

So ... is there any way to pass parameters to a PEG.js rule?

+1! In my case, I simply want to generate a parser for a syntax, where some delimiters are globally configurable. In this case, I can achieve this by replacing the delimiter literals by match anything expressions combined with a predicated, but it would be much more elegant (and also more efficient) if the match-everything could simply be replaced by a variable.

+1, any chance to see this implemented in the foreseeable future?

Another use-case. This is from your javascript.pegjs example:

(...)

RelationalExpression
  = head:ShiftExpression
    tail:(__ RelationalOperator __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperator
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken
  / $InToken

RelationalExpressionNoIn
  = head:ShiftExpression
    tail:(__ RelationalOperatorNoIn __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperatorNoIn
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken

(...)

  (...)
  / ForToken __
    "(" __
    init:(ExpressionNoIn __)? ";" __
    test:(Expression __)? ";" __
    update:(Expression __)?
    ")" __
    body:Statement
  (...)

(...)

All these ...NoIn rules (and there are a LOT of them) are required simply because of the for in statement. Wouldn't a much better approach to this be something like:

(...)

RelationalExpression<allowIn>
  = head:ShiftExpression
    tail:(__ RelationalOperator<allowIn> __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperator<allowIn>
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken
  / &{ return allowIn; } InToken
    {return "in";}

(...)

  (...)
  / ForToken __
    "(" __
    init:(Expression<false> __)? ";" __
    test:(Expression<true> __)? ";" __
    update:(Expression<true> __)?
    ")" __
    body:Statement
  (...)

(...)

This feels very similar to how, for example, the JavaScript grammar is specified: https://tc39.github.io/ecma262/#prod-IterationStatement (note the ~In)

A language I'm currently developing has this exact problem: I'd like to disable/enable some rules at certain points only. I'd very much like to refrain from duplicating every affected rule like you did for the JavaScript grammar.

Is there any alternative way to achieve this without duplicating the rules?

+1, any chance to see this implemented in the foreseeable future?

This issue has a milestone assigned (post-1.0.0). The current version of PEG.js is 0.10.0. Obviously, post-1.0.0 issues will be dealt with after releasing 1.0.0, which will happen at some point after releasing 0.11.0 according to the roadmap.

This should answer your question. The best way to accelerate the whole process is to help with issues targeted for 0.11.0 and 1.0.0.

Is there any alternative way to achieve this without duplicating the rules?

A possible way is to track the state manually and then use semantic predicates. But this approach has issues with backtracking and I would not recommend it (in other words, when given a choice between rule duplication and manual state tracking, I would choose duplication).

There are two types of arguments that could be passed to parsers:

  1. Values. Grammars for languages like Python, Nim and Haskell (and also Scheme in a different way) depend on "depth" of the expression inside of the tree. Writing correct grammar requires to somehow pass this context.
  2. Parsers. leftAssociative(element, separator), escapedString(quote) and withPosition(parser) are good examples of that.

There should be the way to somehow mark whether the argument is a parser or a value. When I tried to figure out the correct approach, I ended up using global variables for context, and that's obviously a dead end. Does anyone have any ideas on that?

How about macros?

Given:

Add <Expression, Add>
  = left:Expression _ '+' _ right:Add
    { return { type: 'add', left, right } }
  / Expression

When:

  = Add <MyExpression, MyAdd>

MyExpression
  = [0-9]+

Then:

  = left:MyExpression _ '+' _ right:MyAdd
    { return { type: 'add', left, right } }
  / MyExpression

MyExpression
  = [0-9]+

This allow us to build rules from the bottom-up :smirk:

I agree, I recommend that developers add this feature :)

I really need this feature for an updated JavaScript grammar I'm writing, so this is high on my wish list. Will give it a go and see how it works out.

@samvv I've come across this from a very different route, and haven't read the whole thread yet.
However, in #572, from which I referred to here, I'm showing a technique with which you can simulate parameterized rules.

That is, in essence: return functions as intermediate parse results.

That "trick" is by no means my invention, and probably rather clunky for your purpose I guess. But it might be a work-around for you. I mean until "post v1.0"... :)

@meisl Cool, thanks for the tip! Will try it out when I find some time.

@samvv Ooh, ah... I'm afraid I have overlooked something rather important:

It makes quite a difference whether you want the parameterized rule to

  • only be able to produce values, which depend on the parameter
  • or (also) to have its parsing decisions depend on the parameter

What I was proposing only helps with the former - while the latter is the actual problem of the OP...
Sorry for that.

However, there is a workaround even for the latter, albeit even MORE clunkier.
And, the "dependent decisions" part doesn't have anything to do with returning functions...

I'm appending an example for you to try out in https://pegjs.org/online

The basic idea is: use global state to remember the current "terminator". That's quite a hack, admittedly, and repetitive.
But: adding yet another delimiter, say | would just mean to add one more alternative to str:

  / (t:'|' {term = t}) c:conts t:.&{ return isT(t) }  { return c }

which differs from the others only in that one very character |


{
  var term;
  function isT(ch) { return ch === term }
  function isE(ch) { return ch === '\\' }
}
start = str*

str
  = (t:'\"' {term = t}) c:conts t:.&{ return isT(t) }  { return c }
  / (t:'\'' {term = t}) c:conts t:.&{ return isT(t) }  { return c }

conts
  = c:(
        '\\' t:.&{ return isT(t) || isE(t) } { return t }
      /      t:.!{ return isT(t)           } { return t }
    )*
    { return c.join('') }

... on inputs

  • "abc" -> ["abc"]
  • "a\"bc" -> ["a\"bc"]
  • "a\\bc" -> ["a\bc"]
  • "a\\b\"c"'a\\b\'' -> ["a\b\"c", "a\b'c"]

p.s.: that is really NOT something one would want to write by hand, I know. But hey, imagine it'd be generated for you... I think in principle that is how.

@ceymard - I realize it's ten years later, but I'm curious how this is different than #36

Wow it took me a while to remember. 10 years !

In this PR, rules take arguments and can be parametrized. This is meant to be used by the grammar itself to avoid repeating similar-yet-different rules.

In #36, the rules are specified outside of the grammar itself. The grammar is thus itself parametrized.

I think the scope is different, though one could argue that a grammar is itself a rule and thus this is the same issue. I think it is not though, as #36 would probably mean some slight API changes, while this PR would not.

So, to abuse C++ ish terminology in a deeply incorrect way, the former are template statics, whereas the latter are constructor calls?

I guess this analogy somewhat works, yes.

Thank you for your time in explanation

I'll probably have another question for you in ten years. Have a good 2020s

It would be really useful in removing redundancy of my parser definition. I have a custom grammar that's on purpose very relaxed, and some rules need to be applied in slightly different contexts.

Was this page helpful?
0 / 5 - 0 ratings