html 🚀 - Validating internationalized mail addresses in <input type="email">

Hey, coming here from this chrome bug.
If I understand correctly, this means that we would send the email user@ß.com to the server as user@ß.com instead of the punycoded version [email protected] like we do today, and we would also allow ß@ß.com to pass validation and send it as ß@ß.com.
After reading the concern in this comment, I have a hard time believing that we wouldn't break some servers somewhere. Just because mail servers tend to accept more unicode doesn't mean that every mail server everywhere does now, right?

josepharhar on 6 Jun 2020

@josepharhar I agree that some servers can break (old ones but f.x. in Poland most popular e-mail providers are ... not working as they should) but please remember that we are still saying about client-side e-mail field validation.

RFC 6532 was not supported for a long time in many software apps (f.x. Thunderbird makes really strange things when receives non-encoded UTF-8 mail compilant with RFC 6532 - it's still open in Bugzilla) but up-to-date mail servers allow to create such accounts and send such mails (Postfix has support for it since ~2015). It's complex problem as f.x. delivery of UTF-8 mail to old mailbox can lead to some problems but what can we do else other than progressively upgrade used technologies to support it? :)

Anyway, I don't think that it's browser responsibility to "protect backend from problematic e-mail addresses" so if RFC allows it and up-to-date software supports it, we should allow it.

rutek on 6 Jun 2020

👍1

It's more complex than that, and it's not about ß which is an odd special case.

EAI (internationalized) mail can handle addresses like пример@Бориса.РФ. While the domain part can turn into ASCII A-labels xn--80abvxkh.xn--p1ai (sometimes called punycode), the mailbox cannot, and only an EAI mail system can handle that address.
Common MTAs like postfix and exim have EAI support but it's not turned on by default, and there is no way a browser can tell what kind of MTA a remote server has or how it is configured. That's why we need a new input type="eaimail" that accepts EAI addresses, which web sites can use if their MTA handles EAI.

The treatment of ß has nothing to do with this. The obsolete IDN2003 and current IDN2008 internationalized domain names are almost the same but one of the few differences is that 2003 normalizes (not punycodes) ß to ss while 2008 makes it a valid character. An address with an ASCII mailbox like user@ß.com could turn into [email protected] but ß@ß.com is EAI only. This turns out to matter because there are German domain names with ß in them that your browser cannot reach if it uses the obsolete rules. See my page https://fuß.standcore.com to see what your browser does.

jrlevine on 6 Jun 2020

A few tiny additions and clarifications to John Levine;'s note (we do not disagree about the situation in any important way; the issues are just a bit more complex, with potentially broader implications, that one might infer from his message and they may call part of his suggestion into question). In particular, "eaimaill" or something like it may be the wrong solution to the problem and may dig us in even deeper. For those who lack the time or inclination to read a fairly long analysis and explanation, skip to the last paragraph.

First, while his explanation of the difficulty with ß is correct, it is perhaps useful to also note that the ß -> ss transformation is often brought about by the improper or premature application of NFKC, which may have been the source of the recent dust-up about phishing attacks using Mathematical special characters. In the latter case, IDNA2008 imposes a requirement on "lookup applications" (including browsers) to check for and reject such things but they obviously cannot do so if the characters the IDNA interface sees are already transformed to something valid. The current version of Charmod-norm discusses, and recommends against, general application of compatibility mappings. It is perhaps also worth noting that UTS #46 is still recommending the use for NFKC (as part of NFKC_Casefold and its associated tables (see Section 5 of that document)) but also calls out the problem of reaching some IDNA2008-conformant domain names if the IDNA2003 rules are followed. Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications .

John mentions that a browser cannot tell what the MTA and configuration a remote server might have, but it is even worse than that. In general, the browser is unlikely to know very much about the precise capabilities of the local MTA or Message Submission Server (MSA() unless those functions are actually built into the browser. The web page designer is even less likely to know and is in big trouble if different browsers behave differently. If the browser does not know, or cannot be configured to know, the distinction between an input type="email" and one of ""eaimail" (which I hope would be called something else, perhaps "i18nemail") would not be as useful as his message implies.

Thinking about these issues in terms of what mail systems do with the addresses my miss an important issue. In many cases, web pages are trying to accept and validate something that looks like an email address but is not headed immediately into a mail system. Instead, it is destined for insertion into a database or comparison with something already there, validation by some other process entirely, or is actually an email address (or something that looks like one) used as a personal identifier such as a user ID. For the latter case, conversion of the part of the string following the "@" via the Punycode algorithm may not produce a useful result whether IDNA2008, IDNA2003, or UTS #46 rules are used. I would think it would be dumb, but if someone wanted to allow 3!!!\@#$%^&.ØØØ as a user ID and some system wants to allow that, we should probably stay out of their way (perhaps by insisting they use a type that does not imply an email address). However, the other side of that example is probably relevant to the discussion. The operator or administration of a mail server, or the administrator of a system that uses email addresses as IDs, gets to pick the addresses they will allow. Especially in the ID case, if they use a set of rules narrower than what RFC 5821 allows (and that are allowed in addresses on many mail systems), then they open themselves up to many frustrations and complaints from from users whose email addresses are valid according to the standards and work perfectly well on most of the Internet but that are rejected by their systems. Internationalized addresses open up a different problem. As an example, I don't know many mail servers identified by domains subsidiary to the 公益 TLD have allowed registration of local parts in Tamil or Syriac scripts, but I suspect that "zero" wouldn't be a bad guess. Someone designing a web site for users in China might know that and, for the best quality user experience, might want to reject or produce messages about non-Chinese local parts for that domain or perhaps even for any Chinese-script and China-based TLD. Similar rules might be applied in other places to tie the syntax of the local part to the script of the TLD but, for example in countries where multiple scripts are in use and "official", such rules might be a disaster. And, because almost anyone can set up an email server and there are clearly people on the Internet who prioritize being clever or cute or exhibiting a maximum of their freedom of expression over what others might consider sensible or rational, most of us who have been around email for many years have seen some truly bizarre (but valid) local parts of all-ASCII addresses and see no reason to believe we won't see even worse excesses as the Internet becomes increasingly internationalized.

This leads me to a conclusion that is a bit different from when this was discussed at length over a year ago. As we have seen when web sites reject legitimate ASCII local parts because people somehow got in into their heads that most non-alphanumeric characters were forbidden or were stand-ins for something else and, more broadly, because it is generally impossible to know what a remote MTA with email accounts on it will allow in those accounts, trying to validate email addresses by syntax alone is hard and may not be productive. When one starts considering email addresses (or things that look like them) that contain non-ASCII characters, things get much more difficult. IDNA2008, IDNA2003, and UTS#46 (in either profile) each have slightly different ideas about what they consider valid. Whatever any of them allow is going to be a superset of what any sensible domain or mail administrator or will allow in practice. In general, a browser does not know what conventions back-end systems or a mail system at the far end of the Internet are following, much less whether they will be doing the same thing next month. So my suggestion would be that Input type="email" be interpreted and tested only as "sort of looks like an all-ASCII email address", that a new input type="i18nmail" be introduced as "looks like 'email' but with some non-ASCII characters strewn around", and that the notion of validating beyond those really general rules be left to the back-end systems, the remote "delivery" MTAs, and so on. In addition, to the extent to which one cares about the quality of the user experience, it may be time to start redesigning the APIs associated with various libraries and interfaces to that they can report back real information about why putative email addresses didn't work for them more precise than "failed" or "invalid address".

good luck to us all,
john

klensin on 8 Jun 2020

❤1

FYI, new installs of Postfix get EAI enabled by default.

My take is that a new input type is not required. An attribute by which to reject EAI is fair (e.g., because the site's MTAs don't support EAI on outbound.

nicowilliams on 10 Jun 2020

s/reject/accept/ and I agree

jrlevine on 10 Jun 2020

Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities.

So I'm just not very keen on the browser doing much validation here. If the site operator has / does not have a limitation as to outbound email, I'm fine with stating it, but I'm also fine with allowing whatever, and making it the backend's job (or any scripts' on the page) to do any validation.

My take is that the default should be permissive. This should be how it is in general. Consider what happens otherwise. You might have a page and site that can handle EAI just fine but a developer forgot to update their email inputs on their pages to say so: now you have a latent bug to be found by the first user who tries to enter an internationalized address. This might mean losing user engagement, and you might never find out because why would the users tell you? But, really, why do we need the input to do so much validation? The input has to be plausibly an email address -- a subset of RFC5322, [email protected] is plenty good enough for 99.999% of users, and there is no good validation to apply to the mailbox part. This is how users get upset that they can't have [email protected]. We should stop that kind of foot self-shooting.

nicowilliams on 11 Jun 2020

👍1

The user should able to enter an email address verbatim, with no second-guessing by input forms. If that address is known to be a-priori unworkable by the server's backend system, it can be rejected with an appropriate error message on the initial POST. Otherwise, if the address vaguely resembles mailbox syntax, it should be accepted and used verbatim. It may not be deliverable, but that's also true of many addresses that are syntactically boring [email protected] may bounce while виктор1βετα@духовный.org may well be deliverable...

vdukhovni on 11 Jun 2020

👍2

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

masinter on 11 Jun 2020

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

jrlevine on 11 Jun 2020

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.
The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

The PCRE pattern behind the link is rather busted. It fails to properly validate dot-atoms, allowing multiple consecutive periods in unquoted local-parts (invalid addresses), while disallowing quoted local-parts (valid addresses). EAI-aside, this sort of fuzzy approximation of the actual requirements is harmful.

vdukhovni on 11 Jun 2020

👍1

Hil Maybe it would be helpful to back up a little bit an look at this from the perspective of a fairly common use case. Suppose I have a web site that sets up or uses user accounts and that I've decided to use email addresses as user IDs (there are lots of reasons why that isn't a good idea, but the horse has left the barn and vanished over the horizon). Now, while it would probably not be a good practice, there is no inherent requirement that my system ever send email to that address -- it can be, as far as I'm concerned, just a funny-looking user ID. On the other hand, if I tell a user who has been successfully using a particular email address for a long time that their address is invalid, I am going to have one very annoyed user on my hands. If I am operating in an environment in which "user" is spelled "customer", and I don't have a better reason for rejecting that address than "W3C and WHATWG said it was ok to reject it" I may also be able to have various sales types, managers, and executives in my face.

The fact that email address is being used as a user ID probably answers another question. Suppose the user registers with an email address using native Unicode characters in both the local part and the domain part. Now suppose they come back a few weeks later and try to sign in using the same local part but a domain part that contains A-labels. Should the two be considered to match? Remembering that this is a user ID that has the syntax of an email address, not something that is going to be used exclusively in an email context, I'd say that is a business decision and not some HTML (or browsers, or similar tools) should get into the middle of. There is one exception. One of the key differences between IDNA2003 and IDNA2008 is that, in the latter, U-labels and A-labels are guaranteed to be duals of each other. If the browser or the back-end database system are stuck in IDNA2003 or most interpretations of UTR#46, then the fact that multiple source labels can map to a single punycode-encoded form opens the door to a variety of attacks and anyone deciding that the two are interchangeable in that environment has best be quite careful about what user names they allow and how they are treated.

It may also be a reasonable business decision in some cases for a site to say "we don't accept non-ASCII email addresses as user IDs/ account identifiers" or even "we accept addresses that uses these characters, or characters from a particular set of scripts, and not others". But nothing in the HTML rules about the valid syntax for email address should be in the middle of that decision.

Beyond that, as others have suggested, one just can't know whether an email address is valid without somehow asking the server that hosts the relevant mailbox (or its front end). It may not be possible to ask that question in real time and, even if it is, doing so is likely to require significantly more time (user-visible delay) than browser implementers have typically wanted to invest. So let's stick to syntax

That scenario by itself argues strongly for what I think John, Nico, and others are suggesting: the only validation HTML should be performing on something that is claimed to be an email address is conformity to the syntax restrictions in RFC 6531. Could one be even more liberal than that? Yes, but why bother.

klensin on 7 Jul 2020

👍2

I was actioned by the W3C I18N WG with replying to this thread with a sense of the group.

Generally, we concur with @kleinsin's comment just above ⬆️.

We think that type=email should accept non-ASCII addresses the better to permit adoption of EAI and IDNA. One reason for low adoption of these are barriers to using them across the Web/Internet. Removing these types of artificial barriers will not only encourage adoption, but will support those users who are already using these.

Users of this feature in HTML expect that the input value follow the structural requirements of an email address but don't expect the value to be validated to be an actual valid address. At best this amounts to ensuring that there is an @ sign and maybe some other structure that can be found with a regex. Users who want to impose an ASCII restriction or do additional validation are free to do so and mostly have to do this anyway. In our opinion, HTML would thus be best off to provide minimal validation. User agents can use type=input as a hint for additional features (such as prompting the user with their own email address or providing access to the user's address book), but this is outside the realm of HTML itself.

aphillips on 15 Jul 2020

I played with this a bit and it seems the current state is rather subpar, though that also leaves more room for changes. Example input: x@ñ. Firefox submits as-is (percent-encoded). Chrome submits x@xn--ida. Safari rejects (asks me to enter an email address). If you use ñ before the @ all reject (as expected).

One thing that would help here is a precise definition of the validation browsers would be expected to perform if we changed the current definition as well as tests for that. I can't really commit for Mozilla though if we can make this a bit more concrete I'd be happy to advocate for change.

annevk on 15 Jul 2020

@aphillips @annevk just about the only thing worth validating here is the RHS of the @ -- everything else should be left to either the backend (which does or does not support internationalized mailbox names) or the MXes ultimately identified by the RHS of the @, or any MTAs in the path (which might not support internationalized mailbox names, but damn it, should).

What is the most minimal mailbox validation? Certainly: that it's not empty. Validating that the mailbox is not some garbage like just ASCII periods, and so on, _might_ help, but getting that right is probably difficult.

So that's my advice: validate that the given address is of any RFC 5322 form that is ultimately of the form ${lhs}@${rhs}, that the RHS is a domainname, supporting U-labels because this is a UI element, as well as A-labels, and validate that the LHS is not empty, and keep any further LHS validation to the utter minimum, in particular not rejecting non-ASCII Unicode.

nicowilliams on 15 Jul 2020

@annevk, I think your examples actually point out the problem. In order: it would be rare, but not impossible (details on request but I want to keep this relatively short) to see on on the RHS of the "@", and % is prohibited by the syntax in RFC 5321 , but I'd generally recommend the use of percent-encoding in any part of email addresses. Pushing a domain-part through Punycode is prohibited by IDNA unless the labels it contains are validated to be U-labels. I can't tell from your example but if, e.g., the domain -part of the mailbox was \u1D7AA\u1D7C2 then it should be rejected, not encoded with punycode: doing otherwise invites errors down the line, errors for which the user get obscure and/or misleading messages.

The problem is that email addresses with non-ASCII characters in the local-part and/or domain part are now valid and increasing numbers of people who can use them for email are expecting to use them through web interfaces.
Keeping in mind that a browser cannot ever fully "validate" an email address (something that would require knowing that the mailbox [email protected] exists but [email protected] does not) I suggest:

(1) If a mailbox consists of a string of between 1 and 64 octets, an "@", and at least 2 and up to 255 more octets, treat it as acceptable and move on, understanding that all sorts of things may apply additional restrictions in actual email handling.

(2) In addition, if you wanted to and the domain-part contained non-ASCII characters, you could verify that any labels were valid ISDNA2008 U-labels and reject the name if they were not ("invalid domain name in email address:" would be a much better message than "invalid email address") AND, optionally iff the local-part was entirely ASCII, convert those U-labels to A-labels. The SMTPUTF8 ("EAI") specs strongly recommend against making that conversion if the local-part is all-ASCII. When the local part is all-ASCII, the conversion will allow some valid cases to go through but, over time, it seems likely that those cases will become, percentagewise, less frequent so whether it is worth the effort is somewhat questionable.

FWIW, the above was written in parallel with @nicowilliams's comment rather than after studying it, but that his recommendation and mine are not significantly different except for that one marginal case of an ASCII local-part and a non-ASCII (but IDNA2008-valid) domain part.

klensin on 15 Jul 2020

I should have added, as @vdukhovni more or less points out, if one is going to try to validate the syntax of the local-part (even all-ASCII local-parts) if it important to actually get it right. As he shows, getting it right is a moderately complicated process, perhaps best left to email systems that are doing those checks anyway (which is what @nicowilliams and I essentially suggest above). But, if one is going to try to do it, it should be done right because halfway attempts (fuzzy approximations) are harmful, including letting some local-parts with invalid syntax through and prohibiting some valid ones.

klensin on 15 Jul 2020

👍1

@klensin I'm not sure what you're trying to convince me of. I was offering to help. (Percent-encoding is just part of the MIME type form submission uses by default, it's immaterial. Chrome's Punycode handling is what is encouraged by HTML today. That browsers do incompatible things suggests it might be possible to change the current handling.)

annevk on 16 Jul 2020

@annevk I drew an action item (during part of I18N's meeting when @klensin was not available) to propose changes and I'd appreciate your thoughts on how to approach this. Looking at the current text, I guess a question is whether we should attempt to preserve the current behavior for ASCII email addresses (or their LHS/RHS parts) while simultaneously allowing labels in that use non-ASCII Unicode? I18N WG participants seem to agree that we don't want to get into deep validation of the address's validity and limit ourselves to "structurally valid" addresses.

aphillips on 16 Jul 2020

Right, e.g., at a minimum we should probably require that the string contains a @ and no surrogates. But currently we also prohibit various types of ASCII labels, e.g., quoted ones, and allowing those to now go through might not be great either.

annevk on 16 Jul 2020

It certainly has to be valid Unicode (e.g., no unpaired UTF-16 surrogates, no invalid UTF-8 bytes), and follow the rules like no unpaired quotes. Restricting it more than that is not likely to help.

jrlevine on 16 Jul 2020

👍1

Even if people are just using things that look like email addresses for purposes other than sending email, do you really want to allow unnormalized Unicode or leading or trailing white space in the LHS?
for sites that use email addresses as user IDs, changing HTML validation to allow entry of different sequences that are visually identical opens up new security concerns.

masinter on 16 Jul 2020

@masinter Absolutely this must allow unnormalized Unicode because users cannot be counted to produce normalized Unicode. Regarding whitespace, trimming it is fine. I don't think there are any security concerns regarding client-side validation -- if there is a site where relaxing client-side validation of email addresses creates a security concern, then the site is already vulnerable.

nicowilliams on 16 Jul 2020

Mailbox names are pretty much arbitrary UTF-8. It doesn't have to be normalized, for that matter, it can be a sequence of ZWJ and Arabic combining marks. While I agree that no sensible mail provider would use names like that, we don't get to tell people to be sensible.
White space has to be quoted so unquoted trailing whitespace isn't valid, although unquoted NBSP and NNBSP is.

jrlevine on 16 Jul 2020

👍1

@aphillips @annevk See above. Do less validation. Validate only:

Unicode well-formedness
balanced quotes
that there's an @
that the RHS of the @ is a ~~domainname~~ hostname, allowing both, U-labels and A-labels (well, that's hard enough to do -- basically, there has to be at least one .)

In all cases allow Unicode throughout.

Trim whitespace, sure.

Anything else?

nicowilliams on 16 Jul 2020

the RHS has to be a hostname, which limits the characters to the ones valid in U-labels

jrlevine on 16 Jul 2020

👍1

Validating internationalized mail addresses in
Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

masinter on 16 Jul 2020

@masinter User's don't distinguish between entering [email protected] and персон@еџампле.ру when using email. If we create indistinguishable input boxes for this, users and content authors will be confused by the difference. It creates another barrier to more-widespread adoption of IDN and SMTPUTF8. The end-to-end folks have been pestering us (I18N) for years about this. Since browsers are inconsistent anyway and users need to process the values they are sent (which already have spoofing or other garbage injection possibilities), this is an opportunity to be done with the problem.

Would an alternative be to add a "legacy" attribute?

@nicowilliams foo@localhost doesn't have a dot. That's one reason (among several) that the current regex makes * ('.' label) on RHS optional.

aphillips on 16 Jul 2020

👍1

@aphillips Really, users input foo@localhost into these elements? Fine.

I agree with you regarding not wanting to type EAI vs. not-EAI. Users don't and shouldn't have to know.

@masinter

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

Again, if relaxing client-side validation "causes" a security problem, then the security problem already exists. Relaxing client-side validation _cannot_ cause a security problem on the server side!

Also, the server-side that gets a form with email address inputs should NOT normalize the mailbox part. Leave that to mail software, specifically the last hop MTA should normalize the mailbox part _if at all_ (it could use form-insensitive matching of mailbox names). The mailbox part is for all intents and purposes opaque to all relays.

nicowilliams on 16 Jul 2020

I forgot -- form fields (including those with type="email") are encoded using the charset of the form, not utf8.
so anyone trying to enter an EAI into an input-field in a (non utf8) form will have trouble because there is no way to represent the characters.

masinter on 17 Jul 2020

Well, certainly you can (and should) set the charset to be UTF-8. If the charset is something other than UTF-8, well, I'm not sure I care what happens then to non-ASCII input that can't be represented as whatever the chosen charset was, but certainly EAI addresses that use only characters that can be represented in whatever that charset is will survive the POSTing of the form, and then the server can convert to UTF-8 or UTF-16 as needed.

The fact that you could set the form's charset to anything other than a Unicode encoding does not mean we can't internationalize form inputs.

nicowilliams on 17 Jul 2020

I find it hard to care about people who expect EAI addresses but use an encoding other than UTF-8 or (for backward compatibility only) UTF-16.

jrlevine on 17 Jul 2020

Browsers encode characters not supported by the charset of the form as decimal NCRs (i.e. Ӓ)--appropriately percent encoded as needs be. Note that the accept charset of the form does not need to match the page's encoding. Actual user interaction with a page is always in Unicode--charset is just a wire encoding phenomenon. I can't quite find the reference in the html spec where form submission does this, but you can test it for yourself easily enough :-).

aphillips on 17 Jul 2020

on unicode normalization. Let's suppose there are two systems (one for a phone and another for a desktop) that handle the encoding differently, one produces unnormalized unicode and the other produces normalized unicode on entry. (mac and windows with Vietnamese?)
The two forms are visually completely indistinguishable.
The server accepts the form data and displays a confirmation
"Is this the email address you meant?" and displays it in a font that distinguishes between I and l and 1 and |.
The problem is that even if downstream mail software handles the equivalence, the end user will be unhappy if they subscribe on one device and try to unsubscribe with the other.
In this case there is no particular "security" problem, but it's a usability problem that the form and server software wasn't prepared to deal with back when type="email" implied ASCII.
The URL standard starts with normalization, why not for EAI?

masinter on 17 Jul 2020

In reply to @aphillips "Browsers encode characters not supported by the charset of the form as decimal NCRs (i.e. Ӓ)--appropriately percent encoded as needs be. "
not true; browsers may accept NCRs with unicode code-points but they don't generate them when POSTing form data:

https://url.spec.whatwg.org/#application/x-www-form-urlencoded
https://github.com/whatwg/url/issues/452#issuecomment-658639752

masinter on 17 Jul 2020

Because there is an EAI mail standard in RFC 6531 and that's not what it says. Surely this is not a surprise.

See Klensin's comment about using addresses as account identifiers.

jrlevine on 17 Jul 2020

👍1

The problem is that even if downstream mail software handles the equivalence, the end user will be unhappy if they subscribe on one device and try to unsubscribe with the other.

There is an easy answer to this: normalize for comparison (form-insensitive comparison) but store as given if you store at all. I.e., be form-insensitive, but form-preserving. Just as one typically does with case in case-insensitive systems.

Form equivalence issues are very similar to case equivalence in case-insensitive systems!

When you design a case-insensitive system, the simplest thing to do is to: "normalize" case (i.e., case-fold) during string comparison and for indexing tables, but otherwise storing with the CaSe aS gIvEn.

The problem you mention happens as to case with all-ASCII email addresses today because even though mailbox names _are_ case-sensitive, often they are implemented as case-insensitive, such that [email protected] == [email protected] == [email protected] == ... But that problem doesn't have to happen at all as to form because where it matters the comparisons/lookups really have to be form-insensitive, and IMO normalizing at the UI is not a good answer. Though I won't be upset if browsers do normalize mailbox names, I don't think they should have to, and I would much prefer that they not normalize mailbox names at all.

nicowilliams on 17 Jul 2020

@masinter: Larry, as John Levine points out, that is just not what the specs say. Could the trade-offs have been evaluated differently circa nine years ago, leading to a set of rules you would like better now? Yes, probably. But they weren't and I note that, IIR, you did not participate significantly in the EAI WG nor raise these issues on IETF Last call. There are several things in the specs the EAI WG produced that ended up that way because no one considered alternatives; this is not one of them. If you think we got it wrong, you know how to proceed: create an I-D explaining what was wrong about it, propose a change, and see what traction it gets.

I don't see what arguing for a different treatment here accomplishes. I do think that, for this particular effort, the principal consideration should be that, if users have email addresses that conform to the relevant standards and work well in the Internet mail system, HTML should neither tell those users that those addresses are invalid nor map them into something that the mail system might consider different. If you disagree with that as a principle, let's discuss it, not whether the specs should be different (or counterfactual ideas about how they work).

As Nico points out, if someone wants or needs to do a back-end comparison, that may be entirely reasonable (with normalization before comparison being an obvious possibility) as long as it is remembered (if it might be relevant) that, as far as the mail system is concerned, such comparisons are a bit fuzzy and might produce false matches.

I could write much more about this and some of the details and trade-offs (and actually did but decided to not send it). I hope I don't have to.

klensin on 17 Jul 2020

👍1

@klensin , I want to be sure I understand your 15. July comment, but there are a couple of phrases I am struggling to parse:

it would be rare… to see on on the RHS of the "@"…

I don't understand the repeated words "on on". To see what on the RHS? Did the system eat a word or a character which you intended to be there?

… but I'd generally recommend the use of percent-encoding in any part of email addresses.

I don't understand if you are recommending for or against percent-encoding. "I'd generally recommend… in any part" seems to mean you are in favour, but the wording "in any part" fits better with a negative, and I read the context to imply you recommend [against] "the use of percent-encoding in any part…".

This is a fascinating discussion, and I am learning a lot. I want to be sure I am understanding. My apologies if I am being dense.

JDLH on 17 Jul 2020

The only thing I can find is this presentation which wasn't helpful.
It's fine with me to introduce a new feature that it now accepts IDN and Unicode strings where it didn't before. Usually the browsers like to warn people when they might suddenly get form values they weren't expecting (especially if they used multipart/form-data with text/plain;charset="utf8" ).

Getting the form to accept Unicode in email addresses is just opening the front door to making the rest of the infrastructure actually work. People who maintain those web sites will have to test, and testing isn't easy.

Usually this kind of thing is staged, people are at least warned (like with dropping ftp:). Better would be to define new and deprecate old.

masinter on 17 Jul 2020

@JDLH: First, you are not being dense. The comment suffers from two problems: (1) I'm tired, short on time, preoccupied with other things, and a tad frustrated with aspects of the conversation including feeling like we've had parts of it more than once before. (2) Sadly, some of the specs involved and their provisions that bear on this subject are more complicated then one might wish and, in a more perfect world, probably one in which everything had been developed at the same time, all the pieces might fit together in a much more simple and elegant way. The combination of the two, and really not wanting to explain the whole history of non-ASCII email addresses and headers and non-ASCII domain name s here results in my writing too rapidly and making silly typographical or pasting errors. To try to answer your questions.

The part of the phrase with the double "on" should have read something more like "... to see one (a "%") on the RHS of the "@", and % is prohibited...". That was intended to be a short way to explain that, while many of us would use a sequence of rude words to describe the wisdom of taking advantage of it, foo%bar.example.com is a perfectly valid name as far as the DNS specifications are concerned. Because RFC 5321 (the SMTP spec, originally RFC 821) will not allow it, the DNS specs consider someone who sets up a name like "foo%bar" imprudent, but, again, it is not invalid.

"but I'd generally recommend the use of percent-encoding in any part of email addresses" should, as you surmised, have been "but I'd generally _not_ recommend..." This is one of those places where the pieces don't quite fit together. In URIs and related contexts, "%", as you know, introduces two hex digits to represent an octet. Historically, many mail systems have interpreted an email address like xyz%example.[email protected] or, more to the point, user%example.[email protected]" as an indication that the message be delivered, using SMTP to example.com or mitivma.mit.edu (or wherever their MX records point) with the expectation that they will figure out how to deliver to xyz at example.net or user at example.earn, whatever those "hosts" mean to that system However, because those delivery hosts (more or less the RHS of the "@" are free to interpret local parts any way they like, xyz%example.com could be treated as an atomic mailbox name on the local host, example.earn could be mapped into something else entirely, and, indeed, abc.def%joe.[email protected] could be interpreted as Joe Smith at def.abc, whatever that might mean -- entirely up to that delivery system. While none of those examples are particularly problematic relative to the URI use of %, consider xyz%40example.[email protected] or jos%c3%a8.[email protected] and the number of different ways that can be interpreted in an environment that supports non-ASCII local-parts.

Is that a bit more clear?

john

klensin on 17 Jul 2020

👍1

@klensin Thank you for the clarification. I understand the 15. July comment much better now.

And I understand the frustration with the conversation, and the fact that the pieces don't quite fit together. I notice, for example, that Github's comment-formatting software is doing it's best to linkify email addresses, but still leaves off the prefix of an email address like "user%example.[email protected]" because of the '%' character.

JDLH on 17 Jul 2020

@masinter Do you disagree with the assertion that relaxing client-side validation cannot _cause_ a server-side vulnerability?

nicowilliams on 17 Jul 2020

Side note: One might think that normalizing to compare strings is expensive, but that is not so. There are a number of optimizations that can make comparison of mostly-ASCII and mostly-already-normalized strings very fast indeed. First off, normalize one character at a time. Second, normalize only when needed -- a character that consists of just an ASCII codepoint requires no normalization, and an ASCII codepoint cannot combine with a preceding one, thus in a sequence like ab, the second codepoint makes it clear that the first requires no normalization. Third, memcmp() equality means no normalization is needed -- normalize only when you stumble onto codepoints that might be parts of characters that require normalization, but only when the other string differs at these codepoints. Third, one can implement glibc-style optimization where if possible due to alignment, you load and compare 4 octets at a time, only here you can mask with 0x80808080 and if that's 0 == then you can take a fast path for the first three bytes and if it is not then you take a slow path. The downside is that the worst case will be somewhat slower than normalizing the full input strings would have been to begin with. Point is, if you think form-insensitivity must be slow, think again.

nicowilliams on 17 Jul 2020

@nicowilliams I disagree with the idea that there are "server side" and "client side" vulnerabilities. Most of the vulnerabilities are due to the human user in the loop, and what a person would expect and enter in the overall situation, as mediated by the user agent.
Is phishing client side or server side?

masinter on 17 Jul 2020

@masinter Earlier you were saying that relaxing client-side validation would expose server-side issues, but now you're just changing the topic. Are you saying there's a phishing issue?

nicowilliams on 17 Jul 2020

In working on a PR (so we can discuss text directly), the main change needed is in the ABNF for "valid email address". There are different choices for how to approach this, so I thought I'd seek input (this may help other parts of this discussion as well). Here's the current ANBF (notice that it doesn't allow either non-ASCII domain names or local parts):

email         = 1*( atext / "." ) "@" label *( "." label )
label         = let-dig [ [ ldh-str ] let-dig ]  ; limited to a length of 63 characters by RFC 1034 section 3.5
atext         = < as defined in RFC 5322 section 3.2.3 >
let-dig       = < as defined in RFC 1034 section 3.5 >
ldh-str       = < as defined in RFC 1034 section 3.5 >

One approach would be to fix atext by moving to the definition in RFC6532. This has the advantage of being simple to read and keeping the definitions in the RFCs (rather than extracting them), although it hides the change somewhat. Note that the rest of the text in the section would make plain what happened:

email         = 1*( atext / "." ) "@" label *( "." label )
atext         = < as defined in RFC 6532 section 3.2 >
label         = 1*63( atext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

If we import all of the definitions directly, we get a fair bit of (byte-oriented) gunk:

email           = 1*( atext / "." ) "@" label *( "." label )
atext           = ALPHA / DIGIT /         ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials or a valid UTF-8 
                       "&" / "'" /        ;  non-ASCII sequence of
                       "*" / "+" /        ;  2 to 4 bytes
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~" / UTF8-non-ascii
UTF8-non-ascii  =   UTF8-2 / UTF8-3 / UTF8-4
UTF8-2          =   <Defined in Section 4 of RFC3629>
UTF8-3          =   <Defined in Section 4 of RFC3629>
UTF8-4          =   <Defined in Section 4 of RFC3629>
label           =   1*63( atext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

A cleaner solution might be to use Unicode code points like so:

email         = 1*( utext / "." ) "@" label *( "." label )
atext         = < as defined in RFC 5322 section 3.2.3 >
utext         = atext / %x80-D7FF / %E000-10FFFF ; unreserved printable ASCII characters or any non-ASCII Unicode code points
label         = 1*63( utext )  ; limited to a length of 63 characters by RFC 1034 section 3.5

I think I prefer the last one by a fraction over the first one. What do others think?

aphillips on 3 Aug 2020

After a great deal of discussion I think we have agreed that at this point we have no idea what people will actually allow in EAI addresses, and it is not easy to describe likely possibilities as REs. Useful addresses will likely be in a single script or a small set of compatible scripts, but good luck describing that. So I think you're pretty close.
The local part is limited to 64 octets and some MTAs enforce that so the first rule should be:
email = 1*64( utext / "." ) "@" label *( "." label )
Some character combinations in local parts need to be quoted, such as two dots in a row, but nobody uses addresses like that so don't bother.

The label length limit is actually the limit on the ASCII A-label which can be longer or shorter than the corresponding U-label. For example, this 63 octet A-label:
xn--fiqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
corresponds to this U-label:
中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中
which is 57 utexts or 171 octets. That means 63 is wrong, but it's no wronger than any other number. The set of characters allowed in a U-label is much smaller than your utext but again, it's not practical to describe in an RE. You really need to run it through something like libidn2 to try to normalize and convert it, but I don't think that exists in Javascript.

jrlevine on 3 Aug 2020

@jrlevine Thanks. I knew about the 64 octet limit, but the existing ABNF didn't implement it (and UTF-8 non-ASCII code points are not octets either). I could test if browsers limit the LHS before implementing limits in the ABNF.

I'm also aware of the encoding efficiency relationship between A-labels and U-labels in terms of the 63 octet limit. As you note, we're not going to describe the actual limit using regex. For those not familiar with how punycode works, 57 code points is the upper limit for a non-ASCII-containing label and occurs when the same non-ASCII character is repeated. If confining oursleves to planes 0 through 0x3, a U-label can reach the 63 octet limit in as few as 14 code points by choosing code points that are evenly spaced apart. This would make the label illegal in other ways [crossing script boundaries mainly], although a Han or Hangul label might get close to this number. We might be able to describe the upper limit since it is structural:

label = 1*63( atext ) / 1*57( utext )

... although it's pretty weird (the 57 isn't a guarantee of anything, while the 63 would. I note that I flubbed here, since let-dig and ldh-str in the original are more restrictive than the atext production (most of that punctuation). I need to fix that.

aphillips on 4 Aug 2020

I'd just leave the labels as 1*63( utext ), since utext is a superset of atext. This pattern can only be an approximation of what's legal so I wouldn't try too hard to be clever.
I think it matches actual addresses pretty well but there are valid addresses it'll reject like "...."@example.com and invalid ones it'll accept like [email protected] or anything with a non-existent domain name.

jrlevine on 4 Aug 2020

--On Monday, August 3, 2020 17:06 -0700 John L
notifications@github.com wrote:

I'd just leave the labels as 1*63( utext ), since utext is a
superset of atext. This pattern can only be an approximation
of what's legal so I wouldn't try too hard to be clever. I
think it matches actual addresses pretty well but there are
valid addresses it'll reject like "...."@example.com and
invalid ones it'll accept like [email protected] or anything
with a non-existent domain name.

This seems reasonable to me... and, if we change the quoting
rules, both will become invalid and hence properly rejected.

Again, the principle should be that addresses that the protocols
allow and that real people, in the real world, are likely to
actually use do not get rejected by an HTML-based mechanism.
That principle will allow a certain amount of nonsense, but the
precise rules are too hard and, will, as you more or less point
out, miss the many cases in which an address has valid syntax
but just doesn't exist no matter what is done.

best,
john

klensin on 4 Aug 2020

On Mon, Aug 03, 2020 at 05:06:58PM -0700, John L wrote:

I'd just leave the labels as 1*63( utext ), since utext is a superset
of atext. This pattern can only be an approximation of what's legal so
I wouldn't try too hard to be clever. I think it matches actual
addresses pretty well but there are valid addresses it'll reject like
"...."@example.com and invalid ones it'll accept like [email protected]
or anything with a non-existent domain name.

Client-side validation is helpful to detect errors, but not to protect
servers. So how about: if an address does not validate, allow it
anyways but display in in some way as to indicate the likely error?

nicowilliams on 4 Aug 2020

Client-side validation is helpful to detect errors, but not to protect servers. So how about: if an address does not validate, allow it anyways but display in in some way as to indicate the likely error?

it is my impression that the main point of the validation RE is to reject nonsense addresses like nobody@here. We nerds know all the corner cases and how to persuade our mail servers to handle wacky addresses but for normal people, the reasonable approach is to insist that the user provides an address that validates against the RE before proceeding.

jrlevine on 4 Aug 2020

I don't really understand some of the above remarks. We're not confined to regular expressions or JavaScript. At the same time I don't think browser email validation should be stricter than URL validation when it comes to domain names (no host length checking when parsing URLs).

@aphillips how do atext and utext mix? UTF8-non-ascii are byte sequences, not code points.

annevk on 4 Aug 2020

@annevk That's the reason I prefer the last ABNF, where I used code points and not byte sequences:

utext         = atext / %x80-D7FF / %E000-10FFFF ; unreserved printable ASCII characters or any non-ASCII Unicode code points

... and then just used utext instead of atext. As noted above, I need to provide a fix for label, since atext isn't appropriate for the right hand side, so let's do that here:

email               = 1*( utext / "." ) "@" label *( "." label )
atext               = < as defined in RFC 5322 section 3.2.3 >
utext               = atext / %x80-D7FF / %E000-10FFFF
label               = label-start [ *61[label-part] label-start ]
label-start         = ALPHA / DIGIT / %80-D7FF / %E000-10FFFF
label-part          = [ label-start / "-" ]

I could also do away with atext by relisting the code points (since importing the definition in RFC6532 gets us bytes):

utext           = ALPHA / DIGIT / "!" /                    ; unreserved printable ASCII
                      "#" / "$" / "%" / "&" / "'" / "*" /  ; as defined in RFC5322 section 3.2.3
                      "+" / "-" / "/" / "=" / "?" / "^" / 
                      "_" / "`" / "{" / "|" / "}" / "~" /
                      %80-D7FF  / %E000-10FFFF             ; or any non-ASCII Unicode

If you think we shouldn't impose host length checking, we can remove the 61 from the label production. As noted in preceding comments, the length of a label might have a shorter limit if any non-ASCII are used (as few as 14 or so if supplementary characters are used). By keeping a length of 63 we're being more-or-less compatible with any existing length checks.

I also suspect I should exclude the C1 controls by changing %80 to %A0

We about ready for text? 😉

aphillips on 4 Aug 2020

I still don't understand. You're defining utext in terms of atext yet one is code points and the other is bytes. (The other thing to look into regarding length might be to check what browsers do today for type=email. I suspect they don't check it.)

annevk on 4 Aug 2020

The text above the ABNF says that the character set is Unicode and the treatment amounts to code points rather than bytes. If I just adopt the utext production from my comment above (and get rid of atext altogether), that would make the definition clear, no?

From a quick check, FF and Chrome both length check the label (right hand side). Neither appear to check the left hand side (which is consistent with the ABNF's 1*). In fact, both length check the A-label length of non-ASCII domain names, so it's more complex already.

aphillips on 4 Aug 2020

I see, that would work. And yeah, looking at https://searchfox.org/mozilla-central/source/dom/html/input/SingleLineTextInputTypes.cpp#181-243 I guess the email validation method invokes IDN differently from the URL parser. Good times. In general I hope that if we tighten this up reuse of https://url.spec.whatwg.org/#host-parsing or https://url.spec.whatwg.org/#concept-domain-to-ascii is feasible. It doesn't seem good to add more primitives just for email validation (and potentially some normalization, as Chrome appears to be doing).

annevk on 4 Aug 2020

--On Tuesday, August 4, 2020 09:53 -0700 Anne van Kesteren
notifications@github.com wrote:

I see, that would work. And yeah, looking at
https://searchfox.org/mozilla-central/source/dom/html/input/Si
ngleLineTextInputTypes.cpp#181-243 I guess the email
validation method invokes IDN differently from the URL parser.
Good times. In general I hope that if we tighten this up reuse
of https://url.spec.whatwg.org/#host-parsing or
https://url.spec.whatwg.org/#concept-domain-to-ascii is
feasible. It doesn't seem good to add more primitives just for
email validation (and potentially some normalization, as
Chrome appears to be doing).

Yes.

My other suggestion, which may have little to do with the
immediate problem, is at least to use caution in applying
https://url.spec.whatwg.org/#concept-domain-to-ascii because the
decoding it specifies is based on [Unicode toASCII] which, in
turn, depends on RFC 3490, which was a March 2003 document that
because obsolete a decade ago. I see at least two issues:

(1) The ToASCII operation in UTR #46 and the one in the
referenced WHATWG specification use different sets of flag
settings. Even if one wants to rely on Unicode specifications
(particularly UTR#46) rather than IETF ones, this is an
invitation to "works some places and not others" confusion.

(2) While I assume browsers are operating consistent with the
WHATWG spec (and hence UTR#46), many, probably most, email
systems that allow non-ASCII addresses are conformant to the
IDNA2008 specs instead. Since deliverability --an important
component of actual validity-- of email depends on the latter,
the difference may result in false negatives and unnecessary and
inappropriate rejection of names in various edge cases. I
hope we are agreed that is a bad idea.

best,
john

klensin on 4 Aug 2020

John, UTR 46 can be used in IDNA2008-compatible mode (even if not immediately apparent) and apart from Chrome browsers use it in that way. (Edit: to be clear, the URL Standard also uses it that way.)

annevk on 5 Aug 2020

This seems like a viable approach and would describe what browsers actually do (since it seems that we're catching the spec up to actual practice, not spurring browser vendors into action). The bottom part of the input type=email section needs more work that I originally thought, since items like the perl/JS regex example would need to be removed.

For the ABNF, perhaps:

email      = localpart "@" domain
localpart  = 1*( utext / "." )
utext      = ALPHA / DIGIT / "!" /                    ; unreserved printable ASCII
                 "#" / "$" / "%" / "&" / "'" / "*" /  ; as defined in RFC5322 section 3.2.3
                 "+" / "-" / "/" / "=" / "?" / "^" / 
                 "_" / "`" / "{" / "|" / "}" / "~" /
                 %80-D7FF  / %E000-10FFFF             ; or any non-ASCII Unicode
domain     = < a "valid host string", see URL section 3.4 >

The text can reference URL 3.5 (#host-parsing, as suggested). I don't have time today, but I'll work on a pull request later in the week to see what this looks like as a draft.

aphillips on 5 Aug 2020

Html: Validating internationalized mail addresses in <input type="email">

Most helpful comment

All 61 comments

Related issues