Pegjs: Implement a simpler way to express lists with a separator

Created on 21 Sep 2012  路  25Comments  路  Source: pegjs/pegjs

When we have a right recursion we have to do something like this:

Statements
  = head:Statement tail:(__ Statement)* {
      var result = [head];
      for (var i = 0; i < tail.length; i++) {
        result.push(tail[i][1]);
      }
      return result;
    }

I could find useful to be able to do the same as follow:

Statements
  = st:Statement (__ st:Statement)* { return st; /*where st is an array*/ }
feature

All 25 comments

Related: #69

This is not really about right recursion helper, but about lists of items separated by a separator. This is a fairly common pattern in language grammars and as such it probably deserves a simplification. The question is how to do it.

I don't like the solution proposed in the issue description. This is just black magic and is not in line with label scope rules as outlined in #69. Instead, I am currently thinking about the following two solutions:

Special syntax

Imagine something like this:

Args = args:Arg % ","

The meaning would be "a list of Args separated by ","s. The args variable would contain an array of whatever Arg produces, the separators would get forgotten.

One question is how to distinguish separated lists allowing zero or more items from the ones allowing one or more items. Experience shows both are needed. A possible answer is to append or prepend * or + to the % operator:

Args0 = args:Arg %* ","
Args1 = args:Arg %+ ","
Args0 = args:Arg *% ","
Args1 = args:Arg +% ","

The % operator could be even implemented as a modifier of the existing * and + operators:

Args0 = args:Arg* % ","
Args1 = args:Arg+ % ","

Pros: Simple to implement, introduces no new concepts.
Cons: Specific solution to a specific problem, not generic.

Parametric Rules

The second solution is to use parametric rules, already proposed in #45. My current idea about the syntax:

// Template definition
List<item, separator> = head:item tail:(separator item)* { ... boilerplate ... }

// Template use
Args = List<Arg, ",">

This way boilerplate code would be repeated at most once in the grammar. The issue with two kinds of lists can be solved by two templates. These templates could even be built-in.

Pros: Generic, can eliminate other types of boilerplate too.
Cons: Complex to implement, introduces a new concept.


I have not decided which way to go yet. I'd love to hear any thoughts/suggestions/alternative proposals.

Template definition seems like the way to go for this, especially if there can be some (optional) built-in templates, like list.

I have a sister project to peg.js (otac0n/pegasus, it's basically a port for C#) that already uses angle brackets in that position for the data-type of the rule, but it seems like I could figure something out if you went with this.

Just my two cents:

Special syntax

This syntax should also be able to distinguish lists allowing two or more items. It's a pretty common pattern that when the list has two or more items, you'd want to wrap it inside a container node, whereas it only has one item, you just return that item.

Also, the separator might not always be dropped:

complexSelector = simpleSelectors:simpleSelector % (ws* [>+~] ws* / ws+)

This a CSS selector example, in which you probably want to know which combinators are used.

Parametric Rules

I like this idea, but it seems that what template offers is still pretty limited: only expressions can be parameterized. If you only want to swap * with +, you have to use a different template. You can of course nest templates like this:

AbstractList<head, tail> = head:head tail:tail { tail.unshift(head); return tail;  }
List<item, separator> = AbstractList<item, (separator item)*>
List2<item, separator> = AbstractList<item, (separator item)+>

but:

  • naming templates is difficult

    • the name List2 says little about its characteristic, and with its definition being abstracted, the situation is made worse

    • you certainly dont' want to use ListWithTwoOrMoreItems

  • is it really better than:

{ var list = function(head, tail) { tail.unshift(head); return tail; } } args = head:arg tail:(',' a:arg {return a})* { return list(head, tail) } args2 = head:arg tail:(',' a:arg {return a})+ { return list(head, tail) }

I personally find this more explicit and hence more readable.

  • even though expressions can be parameterized, the action is likely to assume they have certain properties. For example, if head is an array, AbstractList will fail. So it's likely that although expressions in a template matches the current rule, you won't actually use it.

The real problem

What this issue is really about is that users want pegjs to merge values of expressions for them, so they don't have to do it manually in actions.

I'm wondering if labels are reused like this, does it make it obvious that users want certain values to be merged?

args = args:arg args:(',' a:arg {return a})* { // args is an array of "arg"s }
args2 = args:arg args:((',' / ';') a:arg)* { //args is an array of "arg"s and separators "," or ";"

Notice the second rule flattens the second value of args

The first rule implies pegjs needs to test the types of values

items = items:item1 items:item2

If neither of item1 and item2 is an array, items is [item1, item2], otherwise items is the concatenation of the two.

The second rule, however, implies some strange behavior that might need to be changed.

items = items:item1 items:item2

If either of item1 and item2 is an array of arrays, it needs to be flattened, but stay as it is when its label is unique

items = items:item1 other:item2

But since when users use one label for two expressions, their minds are likely to have already switched on the "merge mode", so it might be less confusing than it appears.

curvedmark, I don't agree with your bulleted cons to parametric rules. It seems that your first two bullet points both stem from the idea that modularity and abstraction somehow makes things harder to read or understand. This is clearly false as 50 years of computer science has shown. Abstractions are the root of all power in programming

If you have problems naming variables, thats not the fault of the language. Expecting someone to read the internals of your rule to understand what your poorly named variables do is not a solution, it is a kludge. I would argue that you certainly do want names like "ListWithTwoOrMoreItems" - it documents your code meaning you don't then have to write a comment saying what "List2" means.

"is it really better than ..." - yes is it is, far cleaner, easier to read, and more maintainable. The situation becomes even more clear with even _slightly_ more complicated rules

David, I don't understand the need for special syntax here. This:
Args = args:Arg % ","
could be done like this:
Args = args:(Arg ",")* { return args.map(function(v){v[0]}) }

Map (and of course reduce as well) is a super useful function I've used in quite a few places when using PEG.js . Yes its a little longer than the special syntax, but A. Its far far more flexible, and B. doesn't require anyone to learn new PEG syntax. Certainly there is no need to create short and ugly loops for behavior like this.

It seems that your first two bullet points both stem from the idea that modularity and abstraction somehow makes things harder to read or understand.

no, that's not what I meant. PEG already contains a fantastic abstraction mechanism called rules

    = number operator number

in this case, both number and operator are abstractions, and I certainly love it.

The template syntax, on the hand, tries to abstract this rule by parameterizing expressions. But remember, PEG is a declarative language, the grammar part has no knowledge of the nature of expressions and the grammar part allows no conditional syntax. Parameterizing simply means replacing here. Comparing to rules, this hardly abstracts anything.

If the goal is to reuse the structure of the rule, you might as well write a more general rule and fail it in the action.

The situation becomes even more clear with even slightly more complicated rules

Could you provide a few real-world use cases where the template syntax could be useful, except for lists with separators or strings with different quotes, which essential deal with merging expressions and should have a more targeted syntax?

David, I don't understand the need for special syntax here.

Args = args:(Arg ",")* is different than Args = args:Arg % ",". The former allows the rule to end with ,, the latter doesn't.

Parameterizing simply means replacing here.

Couldn't you say the same for functions in any programming language? I'm sure we both agree functions are useful, so why not in a parser?

Could you provide a few real-world use cases

I've already wrote some examples in the main issue here: https://github.com/dmajda/pegjs/issues/45

The former allows the rule to end with ,

Ah I see, you're right. Anyways my point still stands that it could be done like this:
Args = first:Arg rest:("," Arg)* { return [first].concat(rest.map(function(v){v[0]})) }

The nice thing about functions is that if you're doing this a lot, you can create a function for it and knock it down to:
Args = first:Arg rest:("," Arg)* { return yourFancyFunction(first,rest) }

And if you had rule templates you could get even simpler: Args = list<Arg,",">

It seems you are talking about a totally different feature.

The syntax proposed by David is to use the parameters in the grammar part, and as I said, it simply replace things. You can't test its value, you can't specify different grammar structures for different parameter values.

What you are suggesting is to use them in the action (if I understand it correctly), but I'm not sure how it works. The first "count" example at #45 doesn't say where the initial value for count comes from, or you are just assuming all parameters have a default value 0?

Maybe you should open a new issue for it.

@dmajda, I'm thinking of another syntax to solve the merging problem, which behaves much like the $ expression syntax.

The $ expression syntax returns the matched string, no matter how the expression is structured. Likewise, how about introduce a, say # expression syntax (or anything of that sort), which returns an array of matched sub-expression, no matter how the expression is structured:

args = #(arg (',' a:arg {return a})*) // args is [arg, arg...]

args = #(arg (',' arg)*) // args is [arg, ',', arg, ',', ...]

However, if you write it this way

args = #(arg restArgs) // args is [arg, [',', arg, ',', arg, ...]]

restArgs = #(',' arg)* // restArgs is [',', arg, ',', arg, ...]

Doesn't behave exactly like $ expression

@curvedmark, he mentioned in an email to me that his proposal here matched what understood my proposal to be. But maybe you're right. Regardless, my proposal isn't "completely different" - its a generalization of what you think he's proposing. Maybe @dmajda can clarify himself. Should I open a new issue for this, David?

You were right about my count example. I updated my comment to have (hopfully) correct structure. Thanks.

For what it's worth, I'd be thrilled for either of David's two suggestions. Parametric rules are extremely powerful and would be useful for many things, so I think I'd lean towards that solution. While I agree with @otac0n in principle that it'd be nice to have builtin abstractions or even for people to be able to share abstractions modularly, I'd Keep It Simple and just start with the abstraction facility. You can solve those additional problems in the future. Just providing template abstraction would be a net improvement in conciseness and eliminating code duplication.

Perl's Regexp::Grammars does this with the modulus operator as well:

# a list of one or more "item", separated by "separator", returning an array:
<rule: list>
        <[item]>+ % <separator>

It's sort of an arguable use of the modulus. The floating point modulus defines a quotient group, a half-open interval of reals (IEEE 754 floats really). It's a sloppy analogy since the items returned by the Regexp::Grammar modulus are only identical up to a pattern (and more importantly since strings aren't a group) but it's close enough.

Instead of making it a built-in operator, I've just made parsers that could be parameterized by parsers.
Metagrammar here.

@futagoza Is there a ticket for parameterized parsers? I think there was one, but I couldn't find it.

The most related one I could find is #45 but the OP of that issue proposes a different syntax (similar to what you've implemented in your metagrammar @polkovnikov-ph), but I plan to go with parametric rules (like @dmajda suggested above) that use the more common syntax of < .. > (templates, generics, etc).

@futagoza Syntax doesn't really matter. Whatever you end up doing with this issue, I greatly appreciate that.

The choice of symbol matters a lot, as this can interfere with other high priority features, like @Mingun 's range patch

There isn't a particularly sensible alternative for ranges, so my inclination would be to preserve angle braces for those

Being honest I'm kind of a parser noob. My solution to this seems simple and I'm a little worried that it might not work at scale, but why not do something like

fname = "bob" / "dan" / "charlie"
namelist = (WS? fname ","?)+

If there's an actual need here, I support it - lists are amongst the most common things a parser has to do, and it seems like there probably is an actual need here, because several strong users are in this thread and didn't say that

So my solution probably isn't good enough, but I'd like to learn why

So my solution probably isn't good enough, but I'd like to learn why

Mostly because it accepts some inputs, that in real life must be unacceptable. For example, it parses bobdan.

Fair. That was naïve of me.

This allows bob, dan, and bob,dan, but not bobdan. What did I miss here?

Document = names
WS = [ \r\n]+

fname = "bob" / "dan" / "charlie"

nameitem = (WS? fname ",")+ 
namelastitem = (WS? fname)

namelist = nameitem? namelastitem

@polkovnikov-ph - there's also #36, which is similar to #45 but not identical

The author's explanation seems to suggest that 36 is closer to what you meant

Now it parses only what is expected, but results not is a single array. This is main motivation to would have special construct to parse delimited data

Okay, this is starting to look like something that genuinely should have an operator, as an ease of use thing. Most users won't have a Russian expert to ask dumb questions to 馃榿

What about

Document = names

WS = [ \r\n]+

fname = "bob" / "dan" / "charlie"

namelist = nl:(namelast ",")+ { return nl[0][0]; }
namelast = WS? fn:fname       { return fn; }

names = nl:namelist? na:namelast { return [].concat(nl, na); } 

Or Ukrainian or whatever. I apologize: I saw a Cyrillic name. I shouldn't guess like that. Getting that wrong can be offensive.

Works, of course, but here we approach the question put in the headline issue -- a _simpler way_. Separated data are used quite often to have a separate syntax for them. In my practice this is the second place for which I use ranges, the first is repetitions with variable data (peg = len:number someData|len|; in my range syntax)

And yes, I'm from Russia. Don't worry, no offense

Okay, this is starting to look like something that genuinely should have an operator, as an ease of use thing

here we approach the question put in the headline issue -- a simpler way.

Yeah, I agree now. It's just that my first wrong attempt looked really simple, and if it wasn't wrong, that would have been simple enough to not bother

but now that I see what's required, I agree

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Coffee2CodeNL picture Coffee2CodeNL  路  13Comments

ghost picture ghost  路  22Comments

emmenko picture emmenko  路  15Comments

log4b0at picture log4b0at  路  25Comments

dmajda picture dmajda  路  20Comments