When we have a right recursion we have to do something like this:
Statements
= head:Statement tail:(__ Statement)* {
var result = [head];
for (var i = 0; i < tail.length; i++) {
result.push(tail[i][1]);
}
return result;
}
I could find useful to be able to do the same as follow:
Statements
= st:Statement (__ st:Statement)* { return st; /*where st is an array*/ }
Related: #69
This is not really about right recursion helper, but about lists of items separated by a separator. This is a fairly common pattern in language grammars and as such it probably deserves a simplification. The question is how to do it.
I don't like the solution proposed in the issue description. This is just black magic and is not in line with label scope rules as outlined in #69. Instead, I am currently thinking about the following two solutions:
Imagine something like this:
Args = args:Arg % ","
The meaning would be "a list of Arg
s separated by ","
s. The args
variable would contain an array of whatever Arg
produces, the separators would get forgotten.
One question is how to distinguish separated lists allowing zero or more items from the ones allowing one or more items. Experience shows both are needed. A possible answer is to append or prepend *
or +
to the %
operator:
Args0 = args:Arg %* ","
Args1 = args:Arg %+ ","
Args0 = args:Arg *% ","
Args1 = args:Arg +% ","
The %
operator could be even implemented as a modifier of the existing *
and +
operators:
Args0 = args:Arg* % ","
Args1 = args:Arg+ % ","
Pros: Simple to implement, introduces no new concepts.
Cons: Specific solution to a specific problem, not generic.
The second solution is to use parametric rules, already proposed in #45. My current idea about the syntax:
// Template definition
List<item, separator> = head:item tail:(separator item)* { ... boilerplate ... }
// Template use
Args = List<Arg, ",">
This way boilerplate code would be repeated at most once in the grammar. The issue with two kinds of lists can be solved by two templates. These templates could even be built-in.
Pros: Generic, can eliminate other types of boilerplate too.
Cons: Complex to implement, introduces a new concept.
I have not decided which way to go yet. I'd love to hear any thoughts/suggestions/alternative proposals.
Template definition seems like the way to go for this, especially if there can be some (optional) built-in templates, like list.
I have a sister project to peg.js (otac0n/pegasus, it's basically a port for C#) that already uses angle brackets in that position for the data-type of the rule, but it seems like I could figure something out if you went with this.
Just my two cents:
This syntax should also be able to distinguish lists allowing two or more items. It's a pretty common pattern that when the list has two or more items, you'd want to wrap it inside a container node, whereas it only has one item, you just return that item.
Also, the separator might not always be dropped:
complexSelector = simpleSelectors:simpleSelector % (ws* [>+~] ws* / ws+)
This a CSS selector example, in which you probably want to know which combinators are used.
I like this idea, but it seems that what template offers is still pretty limited: only expressions can be parameterized. If you only want to swap *
with +
, you have to use a different template. You can of course nest templates like this:
AbstractList<head, tail> = head:head tail:tail { tail.unshift(head); return tail; }
List<item, separator> = AbstractList<item, (separator item)*>
List2<item, separator> = AbstractList<item, (separator item)+>
but:
List2
says little about its characteristic, and with its definition being abstracted, the situation is made worseListWithTwoOrMoreItems
{ var list = function(head, tail) { tail.unshift(head); return tail; } }
args = head:arg tail:(',' a:arg {return a})* { return list(head, tail) }
args2 = head:arg tail:(',' a:arg {return a})+ { return list(head, tail) }
I personally find this more explicit and hence more readable.
head
is an array, AbstractList
will fail. So it's likely that although expressions in a template matches the current rule, you won't actually use it.What this issue is really about is that users want pegjs to merge values of expressions for them, so they don't have to do it manually in actions.
I'm wondering if labels are reused like this, does it make it obvious that users want certain values to be merged?
args = args:arg args:(',' a:arg {return a})* { // args is an array of "arg"s }
args2 = args:arg args:((',' / ';') a:arg)* { //args is an array of "arg"s and separators "," or ";"
Notice the second rule flattens the second value of args
The first rule implies pegjs needs to test the types of values
items = items:item1 items:item2
If neither of item1
and item2
is an array, items
is [item1, item2]
, otherwise items
is the concatenation of the two.
The second rule, however, implies some strange behavior that might need to be changed.
items = items:item1 items:item2
If either of item1
and item2
is an array of arrays, it needs to be flattened, but stay as it is when its label is unique
items = items:item1 other:item2
But since when users use one label for two expressions, their minds are likely to have already switched on the "merge mode", so it might be less confusing than it appears.
curvedmark, I don't agree with your bulleted cons to parametric rules. It seems that your first two bullet points both stem from the idea that modularity and abstraction somehow makes things harder to read or understand. This is clearly false as 50 years of computer science has shown. Abstractions are the root of all power in programming
If you have problems naming variables, thats not the fault of the language. Expecting someone to read the internals of your rule to understand what your poorly named variables do is not a solution, it is a kludge. I would argue that you certainly do want names like "ListWithTwoOrMoreItems" - it documents your code meaning you don't then have to write a comment saying what "List2" means.
"is it really better than ..." - yes is it is, far cleaner, easier to read, and more maintainable. The situation becomes even more clear with even _slightly_ more complicated rules
David, I don't understand the need for special syntax here. This:
Args = args:Arg % ","
could be done like this:
Args = args:(Arg ",")* { return args.map(function(v){v[0]}) }
Map (and of course reduce as well) is a super useful function I've used in quite a few places when using PEG.js . Yes its a little longer than the special syntax, but A. Its far far more flexible, and B. doesn't require anyone to learn new PEG syntax. Certainly there is no need to create short and ugly loops for behavior like this.
It seems that your first two bullet points both stem from the idea that modularity and abstraction somehow makes things harder to read or understand.
no, that's not what I meant. PEG already contains a fantastic abstraction mechanism called rules
= number operator number
in this case, both number
and operator
are abstractions, and I certainly love it.
The template syntax, on the hand, tries to abstract this rule by parameterizing expressions. But remember, PEG is a declarative language, the grammar part has no knowledge of the nature of expressions and the grammar part allows no conditional syntax. Parameterizing simply means replacing here. Comparing to rules, this hardly abstracts anything.
If the goal is to reuse the structure of the rule, you might as well write a more general rule and fail it in the action.
The situation becomes even more clear with even slightly more complicated rules
Could you provide a few real-world use cases where the template syntax could be useful, except for lists with separators or strings with different quotes, which essential deal with merging expressions and should have a more targeted syntax?
David, I don't understand the need for special syntax here.
Args = args:(Arg ",")*
is different than Args = args:Arg % ","
. The former allows the rule to end with ,
, the latter doesn't.
Parameterizing simply means replacing here.
Couldn't you say the same for functions in any programming language? I'm sure we both agree functions are useful, so why not in a parser?
Could you provide a few real-world use cases
I've already wrote some examples in the main issue here: https://github.com/dmajda/pegjs/issues/45
The former allows the rule to end with
,
Ah I see, you're right. Anyways my point still stands that it could be done like this:
Args = first:Arg rest:("," Arg)* { return [first].concat(rest.map(function(v){v[0]})) }
The nice thing about functions is that if you're doing this a lot, you can create a function for it and knock it down to:
Args = first:Arg rest:("," Arg)* { return yourFancyFunction(first,rest) }
And if you had rule templates you could get even simpler: Args = list<Arg,",">
It seems you are talking about a totally different feature.
The syntax proposed by David is to use the parameters in the grammar part, and as I said, it simply replace things. You can't test its value, you can't specify different grammar structures for different parameter values.
What you are suggesting is to use them in the action (if I understand it correctly), but I'm not sure how it works. The first "count" example at #45 doesn't say where the initial value for count
comes from, or you are just assuming all parameters have a default value 0
?
Maybe you should open a new issue for it.
@dmajda, I'm thinking of another syntax to solve the merging problem, which behaves much like the $ expression
syntax.
The $ expression
syntax returns the matched string, no matter how the expression is structured. Likewise, how about introduce a, say # expression
syntax (or anything of that sort), which returns an array of matched sub-expression, no matter how the expression is structured:
args = #(arg (',' a:arg {return a})*) // args is [arg, arg...]
args = #(arg (',' arg)*) // args is [arg, ',', arg, ',', ...]
However, if you write it this way
args = #(arg restArgs) // args is [arg, [',', arg, ',', arg, ...]]
restArgs = #(',' arg)* // restArgs is [',', arg, ',', arg, ...]
Doesn't behave exactly like $ expression
@curvedmark, he mentioned in an email to me that his proposal here matched what understood my proposal to be. But maybe you're right. Regardless, my proposal isn't "completely different" - its a generalization of what you think he's proposing. Maybe @dmajda can clarify himself. Should I open a new issue for this, David?
You were right about my count example. I updated my comment to have (hopfully) correct structure. Thanks.
For what it's worth, I'd be thrilled for either of David's two suggestions. Parametric rules are extremely powerful and would be useful for many things, so I think I'd lean towards that solution. While I agree with @otac0n in principle that it'd be nice to have builtin abstractions or even for people to be able to share abstractions modularly, I'd Keep It Simple and just start with the abstraction facility. You can solve those additional problems in the future. Just providing template abstraction would be a net improvement in conciseness and eliminating code duplication.
Perl's Regexp::Grammars does this with the modulus operator as well:
# a list of one or more "item", separated by "separator", returning an array:
<rule: list>
<[item]>+ % <separator>
It's sort of an arguable use of the modulus. The floating point modulus defines a quotient group, a half-open interval of reals (IEEE 754 floats really). It's a sloppy analogy since the items returned by the Regexp::Grammar modulus are only identical up to a pattern (and more importantly since strings aren't a group) but it's close enough.
Instead of making it a built-in operator, I've just made parsers that could be parameterized by parsers.
Metagrammar here.
@futagoza Is there a ticket for parameterized parsers? I think there was one, but I couldn't find it.
The most related one I could find is #45 but the OP of that issue proposes a different syntax (similar to what you've implemented in your metagrammar @polkovnikov-ph), but I plan to go with parametric rules (like @dmajda suggested above) that use the more common syntax of < .. >
(templates, generics, etc).
@futagoza Syntax doesn't really matter. Whatever you end up doing with this issue, I greatly appreciate that.
The choice of symbol matters a lot, as this can interfere with other high priority features, like @Mingun 's range patch
There isn't a particularly sensible alternative for ranges, so my inclination would be to preserve angle braces for those
Being honest I'm kind of a parser noob. My solution to this seems simple and I'm a little worried that it might not work at scale, but why not do something like
fname = "bob" / "dan" / "charlie"
namelist = (WS? fname ","?)+
If there's an actual need here, I support it - lists are amongst the most common things a parser has to do, and it seems like there probably is an actual need here, because several strong users are in this thread and didn't say that
So my solution probably isn't good enough, but I'd like to learn why
So my solution probably isn't good enough, but I'd like to learn why
Mostly because it accepts some inputs, that in real life must be unacceptable. For example, it parses bobdan
.
Fair. That was naïve of me.
This allows bob
, dan
, and bob,dan
, but not bobdan
. What did I miss here?
Document = names
WS = [ \r\n]+
fname = "bob" / "dan" / "charlie"
nameitem = (WS? fname ",")+
namelastitem = (WS? fname)
namelist = nameitem? namelastitem
@polkovnikov-ph - there's also #36, which is similar to #45 but not identical
The author's explanation seems to suggest that 36 is closer to what you meant
Now it parses only what is expected, but results not is a single array. This is main motivation to would have special construct to parse delimited data
Okay, this is starting to look like something that genuinely should have an operator, as an ease of use thing. Most users won't have a Russian expert to ask dumb questions to 馃榿
What about
Document = names
WS = [ \r\n]+
fname = "bob" / "dan" / "charlie"
namelist = nl:(namelast ",")+ { return nl[0][0]; }
namelast = WS? fn:fname { return fn; }
names = nl:namelist? na:namelast { return [].concat(nl, na); }
Or Ukrainian or whatever. I apologize: I saw a Cyrillic name. I shouldn't guess like that. Getting that wrong can be offensive.
Works, of course, but here we approach the question put in the headline issue -- a _simpler way_. Separated data are used quite often to have a separate syntax for them. In my practice this is the second place for which I use ranges, the first is repetitions with variable data (peg = len:number someData|len|;
in my range syntax)
And yes, I'm from Russia. Don't worry, no offense
Okay, this is starting to look like something that genuinely should have an operator, as an ease of use thing
here we approach the question put in the headline issue -- a simpler way.
Yeah, I agree now. It's just that my first wrong attempt looked really simple, and if it wasn't wrong, that would have been simple enough to not bother
but now that I see what's required, I agree