When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.
What is multi-code point token means? An example would not interfere.
I have not found a definition that say what it is multi-code point token.
Any token that consists of more than one Unicode code point.
@mathiasbynens
If i understand correct, so multi-code point tokens may be:
_reserved words_ (this, true, null and etc)
_identifiers_ (foo,bar and etc, but not f, b and etc)
_literals_ (null, true, 123, "text", /ab+c/, string text ${expression} string text and etc, but not 1,"a" and etc)
_punctuators_ (<=, >=, ==, ..., === and etc, but not {, (, ), [, ], . and etc).
But I do not understand why this rule applies for alternative production of lexical grammar or the numeric string grammar?
What is special in alternative production of lexical grammar and numeric string grammar? And why cannot this rule be applied to generic grammar?
The sentence in question dates back to the first edition, and could probably be worded better. The phrase "appears to be" is sort of a signal that the sentence is somewhat casual. In particular, the precise meaning of "token" is not important here. What it's talking about is a run of multiple characters rendered in a fixed-width font in a production of either of the designated grammars. The wording problem arises because we don't have a name for such a thing. It's tempting to call it a terminal symbol, but it isn't, because the spec is clear that the terminal symbols of those grammars are individual Unicode code points. In fact, that distinction is basically what this sentence is addressing. Roughly speaking, it's saying that one of those multi-character runs is just a shorthand for a sequence of single-character terminal symbols. (It changed from "character" to "code point" in the 6th edition, for reasons that are not important to this question.)
E.g., it's saying that the lexical production:
NullLiteral ::
null
is equivalent to:
NullLiteral ::
null
This doesn't apply to the syntactic grammar because the syntactic-level parser allows whitespace and comments between tokens.
E.g., the syntactic production
BreakStatement :
break;
means the parser is looking for a "break" token and a ";" token, with possible whitespace and comments between. But the production
BreakStatement :
break;
would mean that the parser is looking for 6 single-character tokens, with whitespace and comments anywhere, which is certainly not what we want.
@jmdyck
As for the first part of your answer, I understood you.
Why not name it like _terminal sequence_ or _sequence of terminal symbols_?
As for the second part, there were questions:
Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar.
Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?
As for the first part of your answer, I understood you.
Why not name it like _terminal sequence_ or _sequence of terminal symbols_?
Those terms would seem to apply equally well to (e.g.)
null
as to
n u l l
whereas the sentence in question is only referring to the first. We could say something like "sequence of terminal symbols without embedded spaces", but I'm not sure it would be any clearer than the current wording.
In 5.1.2 it is said that spaces and comments do not fall into the syntactic grammar:
Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar.
Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?
Roughly speaking, whitespace and comments are allowed in the gaps between successive terminal symbols of the syntactic grammar. So if the spec were to say that the syntactic production:
BreakStatement :
break;
is equivalent to:
BreakStatement :
break;
then that would allow whitespace and comments between successive letters of the keyword break, which we certainly don't want to allow.
@jmdyck
I apparently do not understand something. As I wrote above, we cannot meet any spaces or comments in the input stream, since they are discarded when analyzed by lexical grammar. I still can not understand how we can still meet the gaps in the syntactic grammar if the lexical grammar saved us from spaces and comments?
That is, in production:
BreakStatement:
break;
we will not find any spaces or comments in the syntactic grammar, since the lexical analyzer saved us from this.
I apparently miss something in understanding this moment, please explain.
Say your input source is b r e a k ; // foo
The input elements that the lexical parser finds will be:
b)r)e)a)k);)// foo)The WhiteSpace and Comment elements will be discarded, leaving the tokens:
b)r)e)a)k);)But what the syntactic parser is looking for is:
break);)and that's not the same thing.
So even though the syntactic parser doesn't "see" whitespace and comments, their presence can affect the tokens that it does see.
If we said that
BreakStatement :
break;
were equivalent to:
BreakStatement :
break;
then the syntactic parser would be looking for:
b)r)e)a)k);)and so it would succeed on the input b r e a k ; // foo, which is not what we want.
@jmdyck
Hmm, if I correctly capture the essence, then you are trying to say that the input stream of elements affects the construction of tokens after lexical analysis.
For example: t h i/*text*/s;
The lexical analyzer will have to break it into the following tokens:
t)h)i)/*text*/)s);)But discarding some parts:
t)h)i)s);)As a result, we have for the syntactic analyzer a set of IdentifierName, which do not fit the grammar of the parser for the rule:
PrimaryExpression: this
But if the input stream is as follows: this;
That lexical analyzer will define it as:
this);)Conclusion: It all depends on the tokens received from the lexical analyzer.
And again I come back to the question I asked earlier:
Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?
@jmdyck
Hmm, if I correctly capture the essence, then you are trying to say that the input stream of elements affects the construction of tokens after lexical analysis.
No, that's not what I'm trying to say. For instance, I'd be more inclined to say that tokens are the result of lexical analysis, not "constructed after" it. I'm not sure exactly what you're trying to convey with that sentence.
For example:
t h i/*text*/s;
The lexical analyzer will have to break it into the following tokens:* IdentifierName (`t`) * WhiteSpace * IdentifierName (`h`) * WhiteSpace * IdentifierName (`i`) * MultiLineComment(`/*text*/`) * IdentifierName (`s`) * Punctuator(`;`)
Yes, except that those are "input elements", only some of which are tokens.
But discarding some parts:
* IdentifierName (`t`) * IdentifierName (`h`) * IdentifierName (`i`) * IdentifierName (`s`) * Punctuator(`;`)
Right. Those are the tokens.
As a result, we have for the syntactic analyzer a set of IdentifierName, which do not fit the grammar of the parser for the rule:
PrimaryExpression: this
Right.
But if the input stream is as follows:
this;
That lexical analyzer will define it as:* ReservedWord (`this`) * Punctuator(`;`)
Yes.
Conclusion: It all depends on the tokens received from the lexical analyzer.
Well, I'm not sure what you mean by "It" there, but probably yes.
And again I come back to the question I asked earlier:
Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?
It sounds like you now have enough understanding to answer that question. If you're unsure, please review my previous 3 answers.
@jmdyck Damn, it looks like I'm walking in a circle. I did reread your answers above, but it does not bring clarity.
As you could understand my answer above, I showed an understanding of what happens after processing with a lexical analyzer. But I do not understand how this relates to syntactic grammar.
This doesn't apply to the syntactic grammar because the syntactic-level parser allows whitespace and comments between tokens.
E.g., the syntactic production
BreakStatement :
break;means the parser is looking for a "break" token and a ";" token, with possible whitespace and comments between. But the production
BreakStatement :
b r e a k ;would mean that the parser is looking for 6 single-character tokens, with whitespace and comments anywhere, which is certainly not what we want.
Yes, we found out that the lexical analyzer will issue different tokens for b r e a k; // foo and break; because of comments and spaces. But just handling spaces and comments is the concern of a lexical analyzer, not a syntactic analyzer. When the syntactic analyzer reads tokens it will not find any spaces or comments anymore. Whether the syntactic analyzer will find the break token will depend on the input stream and the lexical analyzer, that is, when the syntactic analyzer comes in, it can no longer affect the tokens.
I then do not quite understand the restriction here:
When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.
Perhaps there is something that I did not catch, some thin moment?
we found out that the lexical analyzer will issue different tokens for
b r e a k; // fooandbreak;because of comments and spaces.
Correct.
But just handling spaces and comments is the concern of a lexical analyzer, not a syntactic analyzer.
Correct.
When the syntactic analyzer reads tokens it will not find any spaces or comments anymore.
Correct.
Whether the syntactic analyzer will find the
breaktoken will depend on the input stream and the lexical analyzer, that is, when the syntactic analyzer comes in, it can no longer affect the tokens.
Correct.
I then do not quite understand the restriction here:
When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.
Well, I'll give it one more try.
You understand that:
* IdentifierName (`this`) * Punctuator(`;`)
and:
* IdentifierName (`t`) * IdentifierName (`h`) * IdentifierName (`i`) * IdentifierName (`s`) * Punctuator(`;`)
are different token streams, right? And you understand that the parser is looking for the first, and is not looking for the second, right? If it sees the second, that's a syntax error.
So if we change the spec to say that the parser is looking for the second and not for the first, that just breaks everything. Suddenly pretty much every currently-valid source text would become invalid (and many vice versa).
I don't think I can make it any plainer. If that doesn't make sense to you, you either need to do a better job of explaining exactly what part you don't understand, or maybe just give up for now.
@jmdyck
Well let me point out parts of this sentence.
Why at the very beginning is the word alternative? Does this word have any meaning here? Which example will be alternative, and which will not?
In addition to lexical grammar, numeric string grammar is indicated, why? For example, why can't RegExp grammar be specified?
Multi-code point token is this what we talked about? - IdentifierName (this) Right?
Why at the very beginning is the word
alternative? Does this word have any meaning here? Which example will bealternative, and which willnot?
Presumably, "alternative" means the same here as it does in the rest of the "Grammar Notation" clause: one of possibly several right-hand-sides of a production.
E.g. the lexical production:
DivPunctuator ::
/
/=
has two alternatives.
(Note that a "one of" production like:
Keyword :: one of
await break case catch
is just an abbreviation for:
Keyword ::
await
break
case
catch
so it has multiple alternatives too.)
In addition to
lexical grammar,numeric string grammaris indicated, why? For example, why can'tRegExp grammarbe specified?
The RegExp grammar should be included. (The sentence dates back to the first edition, which didn't have RegExps, and wasn't updated when RegExps were added.) In general, the sentence applies to any grammar whose terminal symbols are individual Unicode code points, i.e. every grammar except the syntactic grammar.
Multi-code point tokenis this what we talked about? -IdentifierName (this)Right?
The latter is talking about a token in the normal, precise sense. However, the phrase "appears to be a multi-code point token" doesn't care about that sense. Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)
@jmdyck
As for the alternative: it was difficult to guess exactly what was meant by this word in this context. For a detailed explanation, thanks.
The latter is talking about a token in the normal, precise sense.
I assume that we are talking about IdentifierName (this).
However, the phrase "appears to be a multi-code point token" doesn't care about that sense.
The thing is that no matter what token is obtained?
Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)
I understood everything except the sentence in brackets. syntax for an ECMAScript token - talking about syntax grammar for ECMAScript token?
By the way, I still do not understand this in your very first answer:
E.g., it's saying that the lexical production:
NullLiteral ::
nullis equivalent to:
NullLiteral ::
null
Why are they equivalent? What are they in your example?
Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?
The syntactic grammar is not included in this sentence because it describes the correct construction of ECMAScript tokens, but not their formation. Right?
However, the phrase "appears to be a multi-code point token" doesn't care about that sense.
The thing is that no matter what token is obtained?
I don't understand that question.
Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)
I understood everything except the sentence in brackets.
syntax for an ECMAScript token- talking about syntax grammar for ECMAScript token?
Yes, the precise syntax for tokens defined by ECMAScript's lexical grammar is what's irrelevant in this particular context.
By the way, I still do not understand this in your very first answer:
You should have said so back then.
E.g., it's saying that the lexical production:
NullLiteral ::
nullis equivalent to:
NullLiteral ::
nullWhy are they equivalent?
They're equivalent because the sentence in question says they are. You asked what it means and I gave you an example of what it means.
What are they in your example?
I'm not sure what you're asking, but I'll try this:
NullLiteral :: null is a production of the lexical grammar.null there is a run of 4 characters rendered in fixed-width font.n u l l.n u l l.The syntactic grammar is not included in this sentence because it describes the correct construction of ECMAScript tokens, but not their formation. Right?
You're making a distinction between the words "construction" and "formation", but for me they're basically synonyms. The syntactic grammar is not included in the sentence in question because its terminal symbols are not individual Unicode code points.
@jmdyck
However, the phrase "appears to be a multi-code point token" doesn't care about that sense.
The thing is that no matter what token is obtained?
I don't understand that question.
Then I did not understand what your first phrase means (in quoted text).
They're equivalent because the sentence in question says they are.
Which sentence in question? I don't understand.
What is
multi-code point tokenmeans? An example would not interfere.
this is?
NullLiteral :: nullis a production of the lexical grammar.- The null there is a run of 4 characters rendered in fixed-width font.
- So it is an alternative in a production of the lexical grammar or the numeric string grammar that "appears to be a multi-code point token".
- The code points that make it up are
null.- So, according to the sentence in question, its occurrence in that production represents
null.
No, I still understand that NullLiteral:: null is a production of the lexical grammar.
But why would NullLiteral:: null be equivalent to NullLiteral:: n u l l? Either you explain something strange to me or I don鈥檛 understand something. We found out before that BreakStatement: break ; and BreakStatement: b r e a k ; are completely different things.
The syntactic grammar is not included in the sentence in question because its terminal symbols are not individual Unicode code points.
Yes, this is exactly what I wanted to say.
They're equivalent because the sentence in question says they are.
Which sentence in question? I don't understand.
Whenever I've said "the sentence in question" in this discussion, I've meant the sentence you quoted in your first post ("When an alternative in a production [etc]"). I'll abbreviate it as TSIQ from now on.
NullLiteral :: nullis a production of the lexical grammar.- The null there is a run of 4 characters rendered in fixed-width font.
- So it is an alternative in a production of the lexical grammar or the numeric string grammar that "appears to be a multi-code point token".
- The code points that make it up are
null.- So, according to TSIQ, its occurrence in that production represents
null.No, I still understand that
NullLiteral::nullis a production of the lexical grammar.
Good.
But why would
NullLiteral::nullbe equivalent toNullLiteral::null?
Well, I could say "because TSIQ says it is!", but clearly that doesn't work for you. So I'll try this: I can't think of anything else reasonable that it could mean. (i.e., I'd go so far as to say it goes without saying, and TSIQ could probably be deleted without incident.) If you're having trouble understanding that those 2 productions are equivalent, then presumably they mean something different to you. If so, say what that is.
We found out before that
BreakStatement:break;andBreakStatement:break;are completely different things.
Yup, because that production is in the syntactic grammar.
@jmdyck Well, I can paraphrase, this is not a problem.
We found out before that BreakStatement: break ; and BreakStatement: b r e a k ; are completely different things.
Yup, because that production is in the syntactic grammar.
Well, I heard what I wanted, although I could write about it in advance (it was important what you answer)
Since BreakStatement is a syntactic production, for it break ; and b r e a k ; are incompatible things (due to different tokens).
But as stated by you and also confirmed by you, that the sentence about which I ask here, productions null and n u l l are equivalent.
NullLiteral ::nullis equivalent to:
NullLiteral ::null
Then why do you make a distinction between them using the gaps in the second production? (While saying that they are the same) What do you show with these gaps? This is understandable in syntactic grammar, but confusing in lexical grammar.
NullLiteral ::null
is equivalent to:
NullLiteral ::nullThen why do you make a distinction between them using the gaps in the second production? (While saying that they are the same) What do you show with these gaps? This is understandable in syntactic grammar, but confusing in lexical grammar.
The gaps in the second production make it obvious that each letter is a separate terminal symbol. With the first production, the hypothetical reader might mistakenly think that null is a terminal symbol, or wonder what it means, so TSIQ is there to reassure that reader that it's just a convenient shorthand for the second production.
@jmdyck now everything is clear. Thank you very much for your efforts.
P.S I wish there were more people like @jmdyck here.
Closing as answered.
Most helpful comment
@jmdyck now everything is clear. Thank you very much for your efforts.
P.S I wish there were more people like @jmdyck here.