ecma262 🚀 - What is means multi-code point token in 5.1.5 Grammar Notation?

Any token that consists of more than one Unicode code point.

mathiasbynens on 15 Apr 2019

@mathiasbynens
If i understand correct, so multi-code point tokens may be:
_reserved words_ (this, true, null and etc)
_identifiers_ (foo,bar and etc, but not f, b and etc)
_literals_ (null, true, 123, "text", /ab+c/, string text ${expression} string text and etc, but not 1,"a" and etc)
_punctuators_ (<=, >=, ==, ..., === and etc, but not {, (, ), [, ], . and etc).

But I do not understand why this rule applies for alternative production of lexical grammar or the numeric string grammar?
What is special in alternative production of lexical grammar and numeric string grammar? And why cannot this rule be applied to generic grammar?

dSalieri on 16 Apr 2019

The sentence in question dates back to the first edition, and could probably be worded better. The phrase "appears to be" is sort of a signal that the sentence is somewhat casual. In particular, the precise meaning of "token" is not important here. What it's talking about is a run of multiple characters rendered in a fixed-width font in a production of either of the designated grammars. The wording problem arises because we don't have a name for such a thing. It's tempting to call it a terminal symbol, but it isn't, because the spec is clear that the terminal symbols of those grammars are individual Unicode code points. In fact, that distinction is basically what this sentence is addressing. Roughly speaking, it's saying that one of those multi-character runs is just a shorthand for a sequence of single-character terminal symbols. (It changed from "character" to "code point" in the 6th edition, for reasons that are not important to this question.)

E.g., it's saying that the lexical production:

NullLiteral :: null

is equivalent to:

NullLiteral :: n u l l

This doesn't apply to the syntactic grammar because the syntactic-level parser allows whitespace and comments between tokens.

E.g., the syntactic production

BreakStatement : break ;

means the parser is looking for a "break" token and a ";" token, with possible whitespace and comments between. But the production

BreakStatement : b r e a k ;

would mean that the parser is looking for 6 single-character tokens, with whitespace and comments anywhere, which is certainly not what we want.

jmdyck on 16 Apr 2019

@jmdyck
As for the first part of your answer, I understood you.
Why not name it like _terminal sequence_ or _sequence of terminal symbols_?

As for the second part, there were questions:

In 5.1.2 it is said that spaces and comments do not fall into the syntactic grammar:

Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar.

Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?

dSalieri on 20 Apr 2019

As for the first part of your answer, I understood you.
Why not name it like _terminal sequence_ or _sequence of terminal symbols_?

Those terms would seem to apply equally well to (e.g.)
null
as to
n u l l
whereas the sentence in question is only referring to the first. We could say something like "sequence of terminal symbols without embedded spaces", but I'm not sure it would be any clearer than the current wording.

In 5.1.2 it is said that spaces and comments do not fall into the syntactic grammar:

Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar.

Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?

Roughly speaking, whitespace and comments are allowed in the gaps between successive terminal symbols of the syntactic grammar. So if the spec were to say that the syntactic production:

BreakStatement : break ;

is equivalent to:

BreakStatement : b r e a k ;

then that would allow whitespace and comments between successive letters of the keyword break, which we certainly don't want to allow.

jmdyck on 20 Apr 2019

@jmdyck
I apparently do not understand something. As I wrote above, we cannot meet any spaces or comments in the input stream, since they are discarded when analyzed by lexical grammar. I still can not understand how we can still meet the gaps in the syntactic grammar if the lexical grammar saved us from spaces and comments?
That is, in production:

BreakStatement: break ;

we will not find any spaces or comments in the syntactic grammar, since the lexical analyzer saved us from this.

I apparently miss something in understanding this moment, please explain.

dSalieri on 21 Apr 2019

Say your input source is b r e a k ; // foo
The input elements that the lexical parser finds will be:

IdentifierName (for b)
WhiteSpace
IdentifierName (for r)
WhiteSpace
IdentifierName (for e)
WhiteSpace
IdentifierName (for a)
WhiteSpace
IdentifierName (for k)
WhiteSpace
Punctuator (for ;)
WhiteSpace
SingleLineComment (for // foo)

The WhiteSpace and Comment elements will be discarded, leaving the tokens:

IdentifierName (for b)
IdentifierName (for r)
IdentifierName (for e)
IdentifierName (for a)
IdentifierName (for k)
Punctuator (for ;)

But what the syntactic parser is looking for is:

IdentifierName (for break)
Punctuator (for ;)

and that's not the same thing.

So even though the syntactic parser doesn't "see" whitespace and comments, their presence can affect the tokens that it does see.

If we said that

BreakStatement : break ;

were equivalent to:

BreakStatement : b r e a k ;

then the syntactic parser would be looking for:

IdentifierName (for b)
IdentifierName (for r)
IdentifierName (for e)
IdentifierName (for a)
IdentifierName (for k)
Punctuator (for ;)

and so it would succeed on the input b r e a k ; // foo, which is not what we want.

jmdyck on 21 Apr 2019

@jmdyck
Hmm, if I correctly capture the essence, then you are trying to say that the input stream of elements affects the construction of tokens after lexical analysis.

For example: t h i/*text*/s;
The lexical analyzer will have to break it into the following tokens:

IdentifierName (t)
WhiteSpace
IdentifierName (h)
WhiteSpace
IdentifierName (i)
MultiLineComment(/*text*/)
IdentifierName (s)
Punctuator(;)

But discarding some parts:

IdentifierName (t)
IdentifierName (h)
IdentifierName (i)
IdentifierName (s)
Punctuator(;)

As a result, we have for the syntactic analyzer a set of IdentifierName, which do not fit the grammar of the parser for the rule:

PrimaryExpression: this

But if the input stream is as follows: this;
That lexical analyzer will define it as:

ReservedWord (this)
Punctuator(;)

Conclusion: It all depends on the tokens received from the lexical analyzer.

And again I come back to the question I asked earlier:

Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?

dSalieri on 22 Apr 2019

@jmdyck
Hmm, if I correctly capture the essence, then you are trying to say that the input stream of elements affects the construction of tokens after lexical analysis.

No, that's not what I'm trying to say. For instance, I'd be more inclined to say that tokens are the result of lexical analysis, not "constructed after" it. I'm not sure exactly what you're trying to convey with that sentence.

For example: t h i/*text*/s;
The lexical analyzer will have to break it into the following tokens:
* IdentifierName (`t`)
* WhiteSpace
* IdentifierName (`h`)
* WhiteSpace
* IdentifierName (`i`)
* MultiLineComment(`/*text*/`)
* IdentifierName (`s`)
* Punctuator(`;`)

Yes, except that those are "input elements", only some of which are tokens.

But discarding some parts:

* IdentifierName (`t`)
* IdentifierName (`h`) 
* IdentifierName (`i`)
* IdentifierName (`s`)
* Punctuator(`;`)

Right. Those are the tokens.

As a result, we have for the syntactic analyzer a set of IdentifierName, which do not fit the grammar of the parser for the rule:

PrimaryExpression: this

Right.

But if the input stream is as follows: this;
That lexical analyzer will define it as:
* ReservedWord (`this`)
* Punctuator(`;`)

Yes.

Conclusion: It all depends on the tokens received from the lexical analyzer.

Well, I'm not sure what you mean by "It" there, but probably yes.

And again I come back to the question I asked earlier:

Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?

It sounds like you now have enough understanding to answer that question. If you're unsure, please review my previous 3 answers.

jmdyck on 22 Apr 2019

@jmdyck Damn, it looks like I'm walking in a circle. I did reread your answers above, but it does not bring clarity.

As you could understand my answer above, I showed an understanding of what happens after processing with a lexical analyzer. But I do not understand how this relates to syntactic grammar.

This doesn't apply to the syntactic grammar because the syntactic-level parser allows whitespace and comments between tokens.

E.g., the syntactic production

BreakStatement : break ;

means the parser is looking for a "break" token and a ";" token, with possible whitespace and comments between. But the production

BreakStatement : b r e a k ;

would mean that the parser is looking for 6 single-character tokens, with whitespace and comments anywhere, which is certainly not what we want.

Yes, we found out that the lexical analyzer will issue different tokens for b r e a k; // foo and break; because of comments and spaces. But just handling spaces and comments is the concern of a lexical analyzer, not a syntactic analyzer. When the syntactic analyzer reads tokens it will not find any spaces or comments anymore. Whether the syntactic analyzer will find the break token will depend on the input stream and the lexical analyzer, that is, when the syntactic analyzer comes in, it can no longer affect the tokens.

I then do not quite understand the restriction here:

When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.

Perhaps there is something that I did not catch, some thin moment?

dSalieri on 22 Apr 2019

we found out that the lexical analyzer will issue different tokens for b r e a k; // foo and break; because of comments and spaces.

Correct.

But just handling spaces and comments is the concern of a lexical analyzer, not a syntactic analyzer.

Correct.

When the syntactic analyzer reads tokens it will not find any spaces or comments anymore.

Correct.

Whether the syntactic analyzer will find the break token will depend on the input stream and the lexical analyzer, that is, when the syntactic analyzer comes in, it can no longer affect the tokens.

Correct.

I then do not quite understand the restriction here:

When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.

Well, I'll give it one more try.

You understand that:

* IdentifierName (`this`)
* Punctuator(`;`)

and:

* IdentifierName (`t`)
* IdentifierName (`h`) 
* IdentifierName (`i`)
* IdentifierName (`s`)
* Punctuator(`;`)

are different token streams, right? And you understand that the parser is looking for the first, and is not looking for the second, right? If it sees the second, that's a syntax error.

So if we change the spec to say that the parser is looking for the second and not for the first, that just breaks everything. Suddenly pretty much every currently-valid source text would become invalid (and many vice versa).

I don't think I can make it any plainer. If that doesn't make sense to you, you either need to do a better job of explaining exactly what part you don't understand, or maybe just give up for now.

jmdyck on 22 Apr 2019

@jmdyck
Well let me point out parts of this sentence.

Why at the very beginning is the word alternative? Does this word have any meaning here? Which example will be alternative, and which will not?

In addition to lexical grammar, numeric string grammar is indicated, why? For example, why can't RegExp grammar be specified?

Multi-code point token is this what we talked about? - IdentifierName (this) Right?

dSalieri on 23 Apr 2019

Why at the very beginning is the word alternative? Does this word have any meaning here? Which example will be alternative, and which will not?

Presumably, "alternative" means the same here as it does in the rest of the "Grammar Notation" clause: one of possibly several right-hand-sides of a production.

E.g. the lexical production:

 DivPunctuator ::
    /
    /=

has two alternatives.

(Note that a "one of" production like:

Keyword :: one of
    await break case catch

is just an abbreviation for:

Keyword ::
    await
    break
    case
    catch

so it has multiple alternatives too.)

In addition to lexical grammar, numeric string grammar is indicated, why? For example, why can't RegExp grammar be specified?

The RegExp grammar should be included. (The sentence dates back to the first edition, which didn't have RegExps, and wasn't updated when RegExps were added.) In general, the sentence applies to any grammar whose terminal symbols are individual Unicode code points, i.e. every grammar except the syntactic grammar.

Multi-code point token is this what we talked about? - IdentifierName (this) Right?

The latter is talking about a token in the normal, precise sense. However, the phrase "appears to be a multi-code point token" doesn't care about that sense. Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)

jmdyck on 23 Apr 2019

@jmdyck
As for the alternative: it was difficult to guess exactly what was meant by this word in this context. For a detailed explanation, thanks.

The latter is talking about a token in the normal, precise sense.

I assume that we are talking about IdentifierName (this).

However, the phrase "appears to be a multi-code point token" doesn't care about that sense.

The thing is that no matter what token is obtained?

Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)

I understood everything except the sentence in brackets. syntax for an ECMAScript token - talking about syntax grammar for ECMAScript token?

By the way, I still do not understand this in your very first answer:

E.g., it's saying that the lexical production:

NullLiteral :: null

is equivalent to:

NullLiteral :: n u l l

Why are they equivalent? What are they in your example?

Hence the question, why all the same this statement cannot be applied to syntactic grammar, if all the same we do not find in the input elements for syntax grammar neither white spaces nor comments?

The syntactic grammar is not included in this sentence because it describes the correct construction of ECMAScript tokens, but not their formation. Right?

dSalieri on 24 Apr 2019

However, the phrase "appears to be a multi-code point token" doesn't care about that sense.

The thing is that no matter what token is obtained?

I don't understand that question.

Like I said a week ago, what it's talking about is a run of multiple characters rendered in a fixed-width font in a production of any of the designated grammars. (Whether that run of characters happens to conform to the syntax for an ECMAScript token is irrelevant.)

I understood everything except the sentence in brackets. syntax for an ECMAScript token - talking about syntax grammar for ECMAScript token?

Yes, the precise syntax for tokens defined by ECMAScript's lexical grammar is what's irrelevant in this particular context.

By the way, I still do not understand this in your very first answer:

You should have said so back then.

E.g., it's saying that the lexical production:

NullLiteral :: null

is equivalent to:

NullLiteral :: n u l l

Why are they equivalent?

They're equivalent because the sentence in question says they are. You asked what it means and I gave you an example of what it means.

What are they in your example?

I'm not sure what you're asking, but I'll try this:

NullLiteral :: null is a production of the lexical grammar.
The null there is a run of 4 characters rendered in fixed-width font.
So it is an alternative in a production of the lexical grammar or the numeric string grammar that "appears to be a multi-code point token".
The code points that make it up are n u l l.
So, according to the sentence in question, its occurrence in that production represents n u l l.

The syntactic grammar is not included in this sentence because it describes the correct construction of ECMAScript tokens, but not their formation. Right?

You're making a distinction between the words "construction" and "formation", but for me they're basically synonyms. The syntactic grammar is not included in the sentence in question because its terminal symbols are not individual Unicode code points.

jmdyck on 24 Apr 2019

@jmdyck

However, the phrase "appears to be a multi-code point token" doesn't care about that sense.

The thing is that no matter what token is obtained?

I don't understand that question.

Then I did not understand what your first phrase means (in quoted text).

They're equivalent because the sentence in question says they are.

Which sentence in question? I don't understand.

What is multi-code point token means? An example would not interfere.

this is?

NullLiteral :: null is a production of the lexical grammar.

The null there is a run of 4 characters rendered in fixed-width font.

So it is an alternative in a production of the lexical grammar or the numeric string grammar that "appears to be a multi-code point token".

The code points that make it up are n u l l.

So, according to the sentence in question, its occurrence in that production represents n u l l.

No, I still understand that NullLiteral:: null is a production of the lexical grammar.
But why would NullLiteral:: null be equivalent to NullLiteral:: n u l l? Either you explain something strange to me or I don’t understand something. We found out before that BreakStatement: break ; and BreakStatement: b r e a k ; are completely different things.

The syntactic grammar is not included in the sentence in question because its terminal symbols are not individual Unicode code points.

Yes, this is exactly what I wanted to say.

dSalieri on 25 Apr 2019

They're equivalent because the sentence in question says they are.

Which sentence in question? I don't understand.

Whenever I've said "the sentence in question" in this discussion, I've meant the sentence you quoted in your first post ("When an alternative in a production [etc]"). I'll abbreviate it as TSIQ from now on.

NullLiteral :: null is a production of the lexical grammar.

The null there is a run of 4 characters rendered in fixed-width font.

So it is an alternative in a production of the lexical grammar or the numeric string grammar that "appears to be a multi-code point token".

The code points that make it up are n u l l.

So, according to TSIQ, its occurrence in that production represents n u l l.

No, I still understand that NullLiteral:: null is a production of the lexical grammar.

Good.

But why would NullLiteral:: null be equivalent to NullLiteral:: n u l l?

Well, I could say "because TSIQ says it is!", but clearly that doesn't work for you. So I'll try this: I can't think of anything else reasonable that it could mean. (i.e., I'd go so far as to say it goes without saying, and TSIQ could probably be deleted without incident.) If you're having trouble understanding that those 2 productions are equivalent, then presumably they mean something different to you. If so, say what that is.

We found out before that BreakStatement: break ; and BreakStatement: b r e a k ; are completely different things.

Yup, because that production is in the syntactic grammar.

jmdyck on 26 Apr 2019

@jmdyck Well, I can paraphrase, this is not a problem.

We found out before that BreakStatement: break ; and BreakStatement: b r e a k ; are completely different things.

Yup, because that production is in the syntactic grammar.

Well, I heard what I wanted, although I could write about it in advance (it was important what you answer)

Since BreakStatement is a syntactic production, for it break ; and b r e a k ; are incompatible things (due to different tokens).
But as stated by you and also confirmed by you, that the sentence about which I ask here, productions null and n u l l are equivalent.

NullLiteral :: null

is equivalent to:

NullLiteral :: n u l l

Then why do you make a distinction between them using the gaps in the second production? (While saying that they are the same) What do you show with these gaps? This is understandable in syntactic grammar, but confusing in lexical grammar.

dSalieri on 26 Apr 2019

NullLiteral :: null
is equivalent to:
NullLiteral :: n u l l

Then why do you make a distinction between them using the gaps in the second production? (While saying that they are the same) What do you show with these gaps? This is understandable in syntactic grammar, but confusing in lexical grammar.

The gaps in the second production make it obvious that each letter is a separate terminal symbol. With the first production, the hypothetical reader might mistakenly think that null is a terminal symbol, or wonder what it means, so TSIQ is there to reassure that reader that it's just a convenient shorthand for the second production.

jmdyck on 26 Apr 2019

@jmdyck now everything is clear. Thank you very much for your efforts.

P.S I wish there were more people like @jmdyck here.

dSalieri on 29 Apr 2019

🎉1 👍1

Closing as answered.

ljharb on 29 Apr 2019

Ecma262: What is means multi-code point token in 5.1.5 Grammar Notation?

Most helpful comment

All 21 comments

Related issues