Html: Why does valid email definition mention Unicode?

Created on 13 Feb 2017  Â·  7Comments  Â·  Source: whatwg/html

Relevant excerpt from specification:

A valid e-mail address is a string that matches the email production of the following ABNF, the character set for which is Unicode.

RFC 5322 disallows non latin letters in local-part. Regex implementation of ABNF also disallows non latin letters in local-part.

I had read [whatwg] Comments on the definition of a valid e-mail address but found no relevant information.

Most helpful comment

It just means that if you see, e.g., "@" in the grammar, it means COMMERCIAL AT U+0040 as defined in Unicode.

All 7 comments

It says that the character set is Unicode, but the actual code points mentioned are within ASCII limits.

Could you explain what is the main takeaway for specification reader? Why it is not "..the character set for which is ASCII."?

It just means that if you see, e.g., "@" in the grammar, it means COMMERCIAL AT U+0040 as defined in Unicode.

There is an open issue for non-ASCII email addresses here: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489.

It's important to remember "Unicode" does not mean "non-ASCII", it means "the modern character set used by all specifications that aren't stuck in a pre-international era" :).

Am I correct that in terms of ABNF specification (RFC 5234) you can call "@" a unicode encoded terminal value?

2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.

Yeah, that's probably roughly correct.

Was this page helpful?
0 / 5 - 0 ratings