Current specification:
https://html.spec.whatwg.org/multipage/forms.html#limiting-user-input-length:-the-maxlength-attribute
... the code-unit length of the element's value is greater than the element's maximum allowed value length, then the element is suffering from being too long.
https://html.spec.whatwg.org/multipage/forms.html#the-textarea-element:concept-textarea-api-value
... The API value is the value used in the value IDL attribute. It is normalised so that line breaks use U+000A LINE FEED (LF) characters. Finally, there is the value, as used in form submission and other processing models in this specification. It is normalised so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, ...
So, a single linebreak should be counted as 2 on applyingmaxlength. Also, a non-BMP character such as an Emoji should be counted as 2 because it consists of two code units.
Current implementations:
The following table shows how characters are counted in maxlength computation in major browsers.
| | Google Chrome | WebKit | Firefox | Edge, IE |
| --- | --- | --- | --- | --- |
| A linebreak | 2 | 2 | 1 | 1 |
| Non-BMP character | 2 | 1 | 2 | 1 |
| Letter + combining char | 2 | 1 | 2 | 2 |
Only Google Chrome conforms to the specification.
Issues
<input maxlength=2> and <input pattern=".{2}"> are not equivalent now because pattern attribute counts an Emoji as 1.Proposal
Introduce new enumerated attribute maxlengthmode to specify maxlength counting algorithm. Its value should be something like codeunit-submission (Google Chrome) codeunit-api (Firefox, etc.) codepoint-submission (Non-BMP is 1, linebreak is 2) codepoint-api (Non-BMP is 1, linebreak is 1).
Hmm.
I think web authors need both of limiting submission value and limiting API value. It should be configurable.
I'd be interested in hearing from more web authors before we add complexity for this. To me it seems better to limit submission value only, like other constraints.
Non-BMP characters are getting popular because of Emoji.
I agree this is an issue. I wonder if we could just update the spec to measure code points, in the same spirit as changing pattern="" to use Unicode. I think it would be more intuitive for authors.
In total, I would weakly prefer saying that maxlength always has what you called codepoint-submission behavior (non-BMP is 1, linebreak is 2), and work to converge all browsers to that. I am eager to hear from others on this subject, both implementers and authors.
Maybe codepoint-api is better, actually...
It seems to me that maxlength is used to guard against values going over the database limit of a field, which is defined in bytes most of the time. For this reason, I believe it should reflect the number of bytes (so code units, right?) and not the number of displayed characters or glyphs.
See https://bugs.webkit.org/show_bug.cgi?id=120030 and https://lists.w3.org/Archives/Public/public-whatwg-archive/2013Aug/thread.html#msg184 for prior discussion and debate.
Safari uses grapheme clusters which seems a little heavy handed and probably not something anyone expects. Using code points by default seems reasonable. @rniwa any new thoughts on this meanwhile?
(The way Firefox and Edge count linebreaks seems like a bug that should just be fixed.)
Chromium project has received multiple bug reports (less than 10, I think) that textarea[maxlength] miscounts characters. All of them expected a linebreak was counted as 1.
@tkent-google none of those folks found it weird they got two line breaks on the server? Or where they all extracting the value client-side?
I don't know the detail of their systems. They just mentioned they expected maxlength was applied to API value.
They might normalize CRLF to LF in server-side.
Let's forget about non-BMP characters and grapheme clusters, and focus on the pain of linebreak counting.
I guess if that's the main problem you see and neither Firefox nor Edge is planning on switching, it seems reasonable to change the standard for linebreaks.
FYI.
I searched bugzilla.mozilla.org and bugs.webkit.org for related issues.
https://bugzilla.mozilla.org/show_bug.cgi?id=702296
This says a linebreak should be counted as 2.
https://bugs.webkit.org/show_bug.cgi?id=154342
This says a linebreak should be counted as 1.
https://bugzilla.mozilla.org/show_bug.cgi?id=670837
https://bugzilla.mozilla.org/show_bug.cgi?id=1277820
They are codepoint counting issues.
It seems to me that
maxlengthis used to guard against values going over the database limit of a field, which is defined in bytes most of the time. For this reason, I believe it should reflect the number of bytes (so code units, right?) and not the number of displayed characters or glyphs.
Bytes ≠ code units. The term “code units” generally refers to UTF-16/UCS-2 blocks, where each BMP symbol consists of a single unit and each astral symbol of two code units.
Byte values can be anything, as they depend on the encoding used.
Bytes ≠ code units. The term “code units” generally refers to UTF-16/UCS-2 blocks, where each BMP symbol consists of a single unit and each astral symbol of two code units.
Byte values can be anything, as they depends on the encoding used.
So this should be bytes according to the page encoding, I guess, since this is what is transferred to the server (in regular form submissions).
Correction: Edge and IE count a non-BMP character as 1.
I am one of the web developers who would like to see this work consistently in all browsers. I work in the assessment industry. We use text areas for long form responses. These can be in any languages as we support IME.
What most web developers do to get around this maxlength vs. field.value.length inconsistency is they don't use maxlength. Instead most use JavaScript to enforce a character length of the response. This introduces a maintenance burden into the software since you have to consider not only normal character entry (which can be handled via key events or input event or any of the browser specific events), but also composition events, and cut/paste events. Why do all that in your web app when the maxlength attribute should be able to do that for you?
There would be more bugs filed, but most web developers have, unfortunately, accepted that this will never be consistent and given up. Search stack overflow for this issue and you'll find plenty of discussion.
In the work I do it is not acceptable to have a different number of characters available to a user using one browser over another. We also display a counter to the user, using the field length, to tell them how many characters they have left. We end up having to adjust the maxlength in Chrome and WebKit to achieve this parity, which is a silly thing to have to do.
maxlength cannot be used to enforce that a certain number of bytes get sent to the server so that should not be a consideration. And introducing encoding as a factor is way more complicated than this needs to be.
It sounds like, based on Chromium bugs filed and other web developers anecdotes, web developers prefer linebreak as 1 character. Partially because it makes sense; partially because it matches textarea.value.length. There does not seem to be a lot of people favoring Chromium's/WebKit's/the spec's behavior of linebreaks being 2 characters, to match what is sent to the server. (I also note that Twitter counts linebreaks as 1 character against the 140 limit.)
I am not sure what to do about non-BMP characters. I think users might prefer grapheme clusters. Some developers might prefer consistency with pattern="", which would be code points. Some developers might prefer consistency with textarea.value.length, which would be code units. :-/ Maybe it is best to focus on linebreaks first.
@tkent-google, it sounds like you are willing to change linebreaks to 1, which would make it 3/4 browsers. WebKit people, any thoughts on that? Would you prefer to keep your current linebreak behavior, which matches the current spec, or do you agree with @tkent-google that the current spec behavior is causing problems for users? /cc @cdumez as I feel he has been working on lots of compat and spec compliance cases like this...
I'll defer to @rniwa on this one since he is already familiar with this (based on https://bugs.webkit.org/show_bug.cgi?id=120030).
My preference is to make the behavior configurable as I wrote in the first message. I'm not sure what's the majority of web developer demand.
I don't particularly like the idea of being able to configure the behavior and have lots of modes. I think it is better to reason about what is the best design, and choose that. It seems to me most people expect maxlength to behave like codepoint-api, and _but_ it seems useful to me to match up with elm.value.length so JS code doesn't get confused. (edit: hmm elm.value.length is code units...)
If we want to solve the "guard against values going over the database limit of a field" use case, we could add a separate attribute like maxbytelength, which counts the number of bytes that will be sent (in the encoding the form data will be sent).
Searching on github for maxbytelength I find at least one (broken) JS implementation of such an attribute:
https://github.com/jiangyukun/2015July/blob/709a347f3fe9b80674dcabab7d26004626826529/src/main/webapp/helper/h2/jsPlumbDemo/js/jquery-extend/jquery.maxbytelength.js
I agree that most of web developers wants codepoint-api.
Web developers can re-implement their own user-input restriction by hooking keypress and paste events, right? If so, providing only codepoint-api behavior would be reasonable.
I started looking into fixing this. A couple interesting points:
My preferences:
I think these preferences are consistent with the idea that we are not trying to control the number of bytes on the wire (the element's value) but instead something closer to the user visible characters.
I changed my mind on textLength. We appear to already have interop there. So let's not change anything, even if it is a bit inconsistent. http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4305
Oops, I made a wrong reply.
I agree that most of web developers wants codepoint-api.
I don't think web developers want codepoint-api. I think it's codeunit-api, which is consistent with textarea.value.length.
I don't think that matches your earlier statement that non-BMP characters should be counted as one character due to the popularity of emojis.
I think what we really need is a new JavaScript API that lets author count the number of grapheme clusters.
I will reply to @rniwa's https://github.com/whatwg/html/pull/1517#issuecomment-231283996 here so we can keep the discussion in one place. Sorry for splitting it; I'm not sure why I thought that was a good idea.
I don't understand why we want to use code points at all if we're not trying to count the number of bytes. What's the point?
I don't think we're trying to count the number of bytes. That seems to match nobody's interests: not the web developer's, and not the user's. If you want to match bytes then you have to start worrying about encodings and your answer changes depending on your <meta charset> and so on. No fun.
It doesn't match what length property returns in JavaScript either since it counts code units. e.g. '😀'.length returns 2.
Yeah, this is the biggest problem. I thought we were going for consistency with pattern (which uses code points) and with some of the newer JS APIs (such as for-of over a string or Unicode regexes). But per https://github.com/whatwg/html/issues/1467#issuecomment-231284438 I guess @tkent-google thought consistency with textarea.value.length was more important.
Given that the user visible number of characters is only related to the number of grapheme clusters, I don't think it's acceptable to treat them differently. e.g. "é" and "é".
I agree there is a definite tension between what users might expect (grapheme clusters) and what developers might expect (either code points or code units depending on what other platform and JS APIs they are using). I'm not sure which you are arguing for.
I thought code points was a good medium (it at least allows emoji entry to count as 1 character, and matches Twitter at least). But it seems like both you and @tkent-google are not a big fan of that. I can't tell if you are arguing that minlength/maxlength should use code units, or grapheme clusters, or bytes.
I think what we really need is a new JavaScript API that lets author count the number of grapheme clusters.
I think ideally we'd want a whole new JS string library that deals with grapheme clusters, instead of the current one that mostly deals with code units but in some cases code points. Or maybe we just expect developers to normalize? I'm not sure. I think such wishings are a bit off-topic for discussing maxlength/minlength, though.
I'm unsure where this leaves us. My PR was clearly premature. If we went with "codeunit-api", that would require:
@rniwa, do you have thoughts on moving from grapheme clusters to code units for minlength/maxlength (matching the current spec), and on moving from 2 characters to 1 character for linebreaks (a spec change)?
I'm arguing for using grapheme clusters for maxlength and minlength. I can be convinced either way for line breaks.
Is grapheme cluster defined somewhere? And I guess its definition changes when Unicode is updated (e.g., the emojis composed of eleven code points or so are presumably a single cluster)?
Maybe we should try to get consensus on linebreaks first... if WebKit and Chrome are willing to move to 1 character there, I'll send a separate PR for that.
Is anyone else willing to move to grapheme clusters to align with WebKit? I think it will be surprising for developers (e.g. it does not match what Twitter's devs have chosen), but probably good for users. @tkent-google, @smaug----, @travisleithead?
I agree on the change to count a linebreak as 1. We don't have a usage data of <textarea maxlength> yet, but the change would be safe.
I don't think switching codeunit to grapheme cluster is reasonable. I'm afraid it would make the interoperability situation worse.
Why would that make it worse? WebKit and Edge are already counting non-BMP character as 1, and that's the only unit that matches the user expectation.
Could we try to clarify we're using the same terminology first?
"Counting non-BMP character as 1" only indicates that it's either code point or 32-bit code unit, it may or may not be grapheme cluster.
IIUC the most common use case for maxlength is to limit the string to match to a field in a database on the server. From that perspective, I think 16-bit code unit is still the most common. Has the use case changed since last I knew?
My mild preference atm is 16-bit code unit, but I'm ok with either of 8/16/32-bit code unit or code point. I'm less comfortable with grapheme cluster (it depends on which definition we pick, but I assume we're talking about UAX#29) since:
lang attribute, it's hard to predict how it's counted.Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document.
These characteristics are ok for CSS to use for rendering or line breaking, but I think it's not suitable for this purpose.
IIUC the most common use case for maxlength is to limit the string to match to a field in a database on the server. From that perspective, I think 16-bit code unit is still the most common. Has the use case changed since last I knew?
I think we need to be very clear about our use cases. If this is the case then linebreaks should count as 2, and we should leave the current spec as-is.
Earlier in the thread we said this was not the primary use case, given the many bug reports against Chrome for this behavior, and also because actual byte length will depend on encoding, and nobody thinks it's a good idea to make maxlength encoding-dependent.
There is a real tension here between users (who probably expect grapheme clusters) and developers (who probably expect 16-bit code units). There is also a tension between client-side developers (who probably expect linebreaks to be 1, to match .value.length) and server-side developers (who expect it to be 2, to match what is submitted with the form---and may have further expectations around non-ASCII characters depending on encoding).
I think the points you give against grapheme clusters are strong, and are good reasons to believe that even if we wanted to have this API favor users over developers, it would be hard to do so and probably not the best choice.
I originally thought that code points would be a better way to favor users without having to deal with those complexities; it would match what Twitter's developers have chosen, and at least handle emoji "correctly" from a user's perspective. But that did not seem popular.
But if we are to give up on users and favor developers instead, I think favoring client-side developers over server-side developers is preferable.
As for "worse interoperability", my concern is the complexity of grapheme cluster algorithm, and Unicode/UAX versions.
Also, at this moment the specification, Firefox implementation, and Google Chrome implementation agree on code-unit length. Changing them isn't reasonable. <textarea maxlength> is useless now, but <input maxlength> is popular.
I have never seen bug repots on code-unit length of maxlength calculation, and maxlength exists for 20+ years since HTML 2.0. Changing it is very risky, and may break existing web applications.
Given Chrome didn't exist for a good chunk of that 20 years of lifespan since HTML 2.0 was released, it's somewhat disingenuous to say using code-unit has never been contended given Non-BMP characters are counted as 1 in IE according to the table above.
FYI.
There was s same topic in WHATWG mail list three years ago, and it seems Hixie's decision was to keep code-unit length. https://lists.w3.org/Archives/Public/public-whatwg-archive/2013Aug/0322.html
Examples of confusion caused by grapheme cluster length counting of WebKit.
https://bugs.chromium.org/p/chromium/issues/detail?id=28838#c7
http://stackoverflow.com/questions/36259949/input-elements-maxlength-attribute-works-weird-in-safari-browser-when-inputting
@domenic I don't think encoding matters, it's just for transfer. Server decodes it before storing into database. ANSI standard, MS, Oracle, all using UTF-16 for Unicode data field drives me to favor UTF-16 code unit. I'm ok with 8/32 or code point, because JS can convert between them accurately, so web developers can work around.
Your points on client v.s. server makes sense and I agree. But client developers can create JS to do whatever they want, and the tailorable nature of grapheme cluster is best suited for JS to handle. I think it'd make client developers much happier if JS adds Unicode/grapheme libraries which web developers can tailor as needed, and leave maxlength for non-JS/server developers.
I originally thought that code points would be a better way...at least handle emoji "correctly"
Emoji is much deeper than that, and IIUC that's why @rniwa wants grapheme cluster. See emoji ZWJ sequences for example, we keep adding new sequences of grapheme cluster on every new version of Unicode. That part actually makes sense to me, but as above, I believe this is best handled by framework, and platform should provide tools to build such framework.
I don't think we want to rely JS to handle grapheme clusters precisely because we keep adding new ones. It's best if the platform/browser that added the support for those new emojis (and thus grapheme clusters) handled them. Otherwise, you end up having to keep updating your website whenever there's a new version of Unicode that introduced new grapheme clusters.
I also don't understand the point of using 8, 16, or 32 code points because they don't really mean anything for the end users. Using code points instead of grapheme clusters would be a terrible UX mistake in the name of developer ergonomics. Imagine we said that we'd use 8-bit code unit and made all non-latin8 characters (e.g. the entirely of CJK) take lengths of two. That would be really confusing for end users because then they can only type in 64 characters for an input field with maxlength=128. Emojis and accented characters are no different from them.
@rniwa Your points do make sense to me, since the beginning actually, but the fact that changing it would harm other users makes me hard to agree.
I don't understand the point of using 8, 16, or 32 code points because they don't really mean anything for the end users.
It guarantees that s/he won't see a server error page on submit because of the database field overflow. That is a huge value to end users I believe.
As @domenic said, two different worlds have different requirements. I wished to pick one and let the other to solve by framework, but maybe we need to seek for a solution that satisfies both?
@kojiishi : I don't think seeing a server error is nothing to do with maxlength. There could be hundreds of other ways in which a server can yield an error. e.g. existence of an unknown Unicode character.
If authors wanted to check for something like that, they ought be writing their own verification code. And that doesn't need to rely on grapheme clusters or new emoji characters being introduced and supported by a random web browser because it's paired with the backend. It's a lot more well defined problem that developers can solve easily.
As @domenic said, two different worlds have different requirements. I wished to pick one and let the other to solve by framework, but maybe we need to seek for a solution that satisfies both?
I just don't understand why backend database text encoding is anything to do with this feature. A backend could be using anything to store strings. e.g. MongoDB's default string encoding is UTF-8, not UTF-16. e.g. Postgres supports UTF-8 but doesn't support UTF-16/32 as far as I know. Heck, a backend could be using mixed multibyte encodings based on locale. There is no point in trying to solve this problem in the browser engine.
"write their own verification code" can apply to both of us, so I don't think we can break the web with that argument. UTF-16 is defined in ANSI SQL, one product not supporting it doesn't convinced me either. You're right about other choices are available, but that's what people do in reality, and browser used to support it.
We already agreed to add. Do you want to keep discussing until we're happy to break the web?
What do you mean by "break the web"? There is no interoperable behavior here. If interoperability and backwards compatibility is an issue (please provide a list of websites on which either behavior is required for compatibility, not some stack overflow post), we should implement what IE does.
UTF-16 is defined in ANSI SQL, one product not supporting it doesn't convinced me either.
Why is that even remotely relevant? Nobody implements ANSI SQL and nobody uses it. Also, with the increasing popularity of no-SQL database, it's almost irrelevant what popular RDMS uses as its primary text encoding. And NoSQL databases such as MongoDB and Cassandra use utf-8 as their text encoding of choice. May I remind you that your own AppEngine uses utf-8 as the text encoding of choice?
You're right about other choices are available, but that's what people do in reality, and browser used to support it.
I don't follow what you're saying here. In particular, "that" in "that's what people do" is ambiguous. Could you clarity?
We already agreed to add.
We agreed to add what?
What do you mean by "break the web"?
When only one browser started to change the behavior recently does not guarantee that it's safe for all browsers to change the behavior.
If interoperability and backwards compatibility is an issue ... we should implement what IE does.
IIUC that's what tkent originally proposed, and what Hixie resolved before. Do you have data why it's safe to revert the previous resolution?
Why is that even remotely relevant?
I already mentioned I'm ok with other encoding if preferred. Whichever encoding is used, developers can compute max length after conversion. Grapheme is a different animal, developers cannot compute the max.
We agreed to add what?
Agreed to add a mode that handles grapheme clusters in the original proposal of this issue. We haven't seen the data that shows it's safe to change yet.
What do you mean by "break the web"?
When only one browser started to change the behavior recently does not guarantee that it's safe for all browsers to change the behavior.
Which browser changed what behavior? WebKit's behavior to use the grapheme clusters for maxlength existed as long as maxlength was supported in WebKit. Blink is the one that recently changed its behavior in August 2013 to use the number of code units for maxlength.
Why is that even remotely relevant?
I already mentioned I'm ok with other encoding if preferred. Whichever encoding is used, developers can compute max length after conversion. Grapheme is a different animal, developers cannot compute the max.
This is precisely why grapheme clusters is more useful. If author needed to limit the number of characters using UTF-7, 8, 16, or 32, he/she can easily implement that with a dozen or so lines of JavaScript. Counting grapheme clusters is a lot harder, and it's better implemented by UA.
Agreed to add a mode that handles grapheme clusters in the original proposal of this issue. We haven't seen the data that shows it's safe to change yet.
Well, we can make the same argument that nobody has shown to us that changing our behavior to not use graphene cluster is safe. Again, WebKit has always used the number of grapheme clusters for maxlength.
At this point, I don't see any value in continuing this discussion until someone comes back with data.
I'm getting tired of this discussion without anyone providing any new information. I don't think there is any use in continuing this discussion until someone comes back with data.
It sounds like WebKit is not interested in arriving at consensus with the other browsers on this issue :(. In other words, I guess Darin Adler's "Anyway, we should match the standard" from https://bugs.webkit.org/show_bug.cgi?id=120030#c4 is no longer the WebKit team's policy. It may be best to seek consensus among the majority for the purposes of creating a spec, instead.
Right now the best we can spec with regard to non-linebreaks is the existing spec (code units), which gets 2/4 browsers. If Edge is willing to change, we can get 3/4. If Firefox and Chrome are both willing to change to code points, that is another path to 3/4.
(Linebreaks, it sounds like there is still hope of consensus on at 1 instead of 2.)
I'm getting tired of this discussion without anyone providing any new information. I don't think there is any use in continuing this discussion until someone comes back with data.
It sounds like WebKit is not interested in arriving at consensus with the other browsers on this issue :(. In other words, I guess Darin Adler's "Anyway, we should match the standard" from https://bugs.webkit.org/show_bug.cgi?id=120030#c4 is no longer the WebKit team's policy. It may be best to seek consensus among the majority for the purposes of creating a spec, instead.
WTF? That's not what I'm saying at all. What I'm saying that there is no point in keeping discussion without anyone showing any compatibility data given @kojiishi's repeated argument is that we can't use grapheme clusters for Web compatibility but without any data showing as such.
I find your commentary here extremely offensive. I've spent so much of my time replying to the discussion here for the last couple of weeks. Why on the Earth would I do that if I wasn't interested in coming to a consensus.
Right now the best we can spec with regard to non-linebreaks is the existing spec (code units)
Why? In what criteria is that "best"?
I also find all these arguments about what being "best" and "good" for developers extremely distractive and counterproductive. They're extremely subjective measures and of no merits to anyone who doesn't share the same idea of what's "good" and "bad" for the Web.
Here's the summary of the discussion thus far for UTF-16 code units versus grapheme clusters. Feel free to add more arguments / counter-arguments.
| Argument | Counter |
| --- | --- |
| Web developers want code unit because it matches what input.value.length returns | |
| Matches what popular databases such as ANSI SQL, Microsoft SQL Server, and Oracle use | Postgres, MongoDB, Cassandra, AppEngine use UTF-8 |
| Chrome, Firefox, and spec match this behavior | Chrome changed its behavior in August 2013. WebKit always used (and still uses) grapheme clusters. |
| Argument | Counter |
| --- | --- |
| Emoji is getting popular, and accented characters such as "é" and "é" should have the same behavior for end users | |
| WebKit always used and still uses | Chrome changed its behavior in August 2013 and hasn't gotten complaints |
| Checking the number of grapheme clusters is harder than doing the same for UTF-8, 16, and 32 or any other text encodings | Web developers can't easily compute the length of text using grapheme clusters. |
WTF? That's not what I'm saying at all.
My sincere apologies then @rniwa. I must have misinterpreted. Perhaps you can see how people might interpret your statements that way. By all means, let's continue discussing; hopefully we can all move toward a compromise.
Why? In what criteria is that "best"?
In this sentence, I meant "best" as in "has the most browsers implementing it".
If a form with <input maxlength> is submitted to a server, the server should validate the submitted data length again anyway. I don't think servers built in the last 20 years count grapheme cluster length with the algorithm same as WebKit.
If a form with
<input maxlength>is submitted to a server, the server should validate the submitted data length again anyway.
Right. This is why I don't think guarding against the length of some UTF encoded text is a valid use case for maxlength.
I don't think servers built in the last 20 years count grapheme cluster length with the algorithm same as WebKit.
I don't think so either. However, I don't think that's an argument for or against using grapheme clusters for maxlength given the first point you just made. Since the server needs to do some sort of validation orthogonal or in addition to what Web browser does, maxlength can do whatever it needs to do regardless of how the backend server counts the number of characters.
In relation to what @domenic notes
There is a real tension here between users (who probably expect grapheme clusters) and developers (who probably expect 16-bit code units). There is also a tension between client-side developers (who probably expect linebreaks to be 1, to match .value.length) and server-side developers (who expect it to be 2, to match what is submitted with the form---and may have further expectations around non-ASCII characters depending on encoding).
Would it work to experiment with some new DOM and IDL attributes on textarea and relevant input, something like:
lengthinchars for grapheme clusterslengthincodes for code unitsWith each having two constraint setters prefixed by max and min, and also the unprefixed getter that returns the relevant view of the length?
As for adding API or spec to clarify the newline behaviour for textarea, it's not clear what could work better than what's implemented.
I agree on @rniwa and @tkent that servers will check it again anyway for errors, but probably it'll give different experiences.
So from rniwa'a comment above, arguments that look reasonable to me are compat with other parts of DOM v.s. using maxlength for Emoji and non-pre-composed characters?
I changed my position to be neutral since my understanding on what cases authors use maxlength for is not backed by the data, and hope there were a consensus among impls.
@rianby64 - you just bumped into an old WebKit based bug - https://crbug.com/450544 - fixed in Chrome 56 (might still exist in Safari - see https://trac.webkit.org/browser/trunk/Source/WebCore/html/HTMLInputElement.cpp#L91 for the code).
I believe there is no restriction in the specification, it was an implementation detail of the browser.
Off topic, anyway.
I have created a pen to make it easy to check a browser against this issue:
https://codepen.io/thany/pen/zmRZKM
I feel maxlength should check characters, not bytes or whatever. Every non-BMP character is specified in unicode as singular characters and should be treated as such. We have some legacy issues regarding non-BMP characters, but I don't see a problem in allowing maxlength to count emoji as a single character. Systems that don't 'support' emoji will break anyway. There are legacy issues with BMP characters as well, so systems that are badly designed are bad systems, and I don't think we should accomodate them with a HTML standard that is expected to greatly outlive those systems.
I feel maxlength should check characters, not bytes or whatever. Every non-BMP character is specified in unicode as singular characters and should be treated as such.
What about emoji that are encoded as multi-character sequences (but rendered as single glyphs)? Something like 👩❤️👩 "two women with heart" appears to the user as a single emoji, but is encoded as a sequence of 6 unicode characters (or 8 code units in UTF-16). Counting "characters" (as encoded in Unicode) won't really help here.
Let's say it should count codepoints.
It's going to be an absolute nightmare (in html, javascript, and on servers alike) to count characters with combinating characters as one. Remember zalgo? That's too gnarly to deal with, if not impossible. And that goes for legitimate combinating characters as well.
You might see a combined emoji like that, or just a flag, as a ligature. They're not quite ligature afaik, but they're at least closely related. So would you count the "fi" ligature as a single character too? Of course not.
So I think a combined emoji should be counted as 6 (or however many codepoints), because it's 6 codepoints, as per the unicode spec. This is not great experience for the end-user, who might assume a them to be 1 character, but at least counting codepoints is more correct than counting UTF-16 characters (surrogate pairs?).
When dealing with normal text that is so international that its characters sit both inside and outside the BMP, it's not easy to know which character is going to count as two. It's not just emoji.
When dealing with normal text that is so international that its characters sit both inside and outside the BMP, it's not easy to know which character is going to count as two. It's not just emoji.
It's not just whether characters are in the BMP or not, either; what about precomposed accented letters vs decomposed sequences? They may be indistinguishable to the user (assuming proper font support), yet their lengths will be counted differently.
Perhaps we need to back up and reconsider what the use-cases are, and whether minlength/maxlength attributes on form fields can really make sense in an internationalized world.
Due to the name (length) and the tight relationship the web already has with JavaScript, I think that maxlength should use the JavaScript definition of String.prototype.length, as awkward as it is. In addition, maxunitlength and maxcharacterlength and whatever else needs be added for the other (reasonable) use cases. Use cases -
maxcharacterlength).maxunitlength).maxlength).(I also advocate for similar properties on String.prototype)
I realize the use cases I mentioned do not yet handle ligatures and similar, but those are just initial examples for fleshing this out.
@jfkthame
It's not just whether characters are in the BMP or not, either; what about precomposed accented letters vs decomposed sequences?
This has been addressed previously in this thread. And I personally believe maxlength should count codepoints. Otherwise it'll just simply get too hairy with all that zalgo text floating around. Moreover because maxlength is oftenly used to limit the amount of data sent over the line, and to limit the number of characters written to some database or file. If we allow combinating characters to be counted as one with their base character, we could make a virtually infinitely large text that is 1 character in length. I don't think it's a good idea to support that. Because abused it will be :)
Can I also say that ligatures and certain combinating characters like combined emoji (flags, families, etc) are heavily dependent on the font and therefor the browser & OS as well. That makes it ever harder to detect whether they are supposed to look like (and counted as) a single character or not. To me that's just one more reason to drop trying to account for such things, and simply count codepoints.
See https://github.com/whatwg/html/issues/1467#issuecomment-235100917 for the summary of discussions. Keep making the same anecdotal arguments wouldn't help us sway discussion one way or another short of some hard cold data on web compatibility.
I mean, right now the evidence seems to suggest that any of code units, code points, or grapheme clusters are web-compatible. (Or, if not, then people are doing browser sniffing and assuming the status quo, which is problematic in a different way.) So I am not sure any web-compatibility data can help us here.
I strongly believe that the best approach would be to count code points.
It is true that some glyphs are made up of several code points and they should be treated as an aggregation of few characters e.g. 🇫🇷 is often displayed as French national flag, but is written with 2 Unicode characters each 4 byte long.
HTML should not be restricted to JS implementation. In fact I think that this is wrong because ECMAScript uses UTF-16 whereas HTML should adhere to the META charset as specified by the web developers. Because of this, the behaviour is very inconsistent across browsers and web technologies.
Users using web applications where maxlength was specified can only enter half as much data into the HTML form when using CJK scripts, despite the fact that the web server would correctly recognize the length and would be able to store the full length string. Emojis are becoming more prevalent with the use of mobile technologies. While some emojis are a combinations of few characters, many of them are just a single character 4 bytes in size (2 code units).
HTML's maxlength turns out to be useless at the current state of the spec, as the developers can't rely on it to count the characters as they would be counted by MYSQL's VARCHAR for example. If a character limit needs to be applied to prevent the SQL query from failing additional validations need to be added.
I have compiled a table of sample characters and what is their perceived length in regards to maxlength attribute, JS string.length and MariaDB VARCHAR length restriction.
As you can see Chrome and Firefox adhere to the HTML spec (i.e. maxlength is same as JS string.length), Edge/IE is counting it as close as possible to what the user sees, and VARCHAR counts them as UNICODE characters. PHP's mb_strlen returns same results as MariaDB.
Character|Chrome|Firefox|Edge/IE|JS CodeUnit length|MariaDB utf8mb4 CHAR_LENGTH|
|--|--|--|--|--|--|
¢|1|1|1|1|1
é|1|1|1|1|1
é|2|2|2|2|2
없|1|1|1|1|1
€|1|1|1|1|1
𐐷|2|2|1|2|1
𐍈|2|2|1|2|1
𤭢|2|2|1|2|1
💩|2|2|1|2|1
🇫🇷|4|4|2|4|2
👩|3|3|2|3|2
❤️|3|3|2|3|3
👩|2|2|1|2|1
👩❤️👩|8|8|5|8|6
Maxlength attribute is an UX attribute, and it should be treated as such. Consider how the caret navigates through the text when typing in input field. The behaviour varies with fonts/software. In Visual Studio code the caret moves exactly the same as VARCHAR in MariaDB. I have to move the caret 6 times for the character 👩❤️👩, not 5 or 8 times! Edge treats this as 3 caret movements, Firefox 2 caret movements, and Chrome only 1. Which is right?
Keyboard arrow navigation is also different when using backspace to delete the Emoji. Edge 3 backspaces, FF 4 backspaces, Chrome only 1 backspace, VS code/Notepad++ 6 backspaces.
The whole thing can be tried at this JSFiddle. (Ignore the fact the Edge/IE ignores maxlength when the initial value of the input exceeds it already.)
Despite the fact that the current spec doesn't break the internet, it should still be changed IMHO as it is a very bad experience for the end user. In conclusion I believe HTML should treat Unicode characters in maxlength calculation same as they are treated by PHP, MariaDB, VS Code or Notepad++. Treating them otherwise is an incorrect behaviour.
@kamil-tekiela that rationale seems rather motivated by particular technologies. E.g., if you used Node.js, the answers would be different. We'd have to look at much more languages and databases if we wanted to match some kind of common way of counting the length of strings on servers.
(Also, HTML as delivered by the server is decoded early on into Unicode scalar values, so what <meta charset> declares is hardly relevant at this much later stage and really shouldn't influence the way we count things here.)
What should the spec say? Characters or bytes? I'm bloody sure it's not bytes. It's got to be characters then, so let's count characters.
I'm pretty much convinced we shouldn't arbitrarily count some characters as 2 or more, and some as 1. We should definitely not have some list of characters that we mustn't count as 1 character. That just doesn't make sense no matter how issue issue is twisted.
Web development is weird enough as it is, so let's not have string length be counted as something arbitrarily in between character length and byte length.
What should the spec say? Characters or bytes? I'm bloody sure it's not bytes. It's got to be characters then, so let's count characters.
This doesn't really help, because it's unclear what "character" means.
Hi @annevk, Thanks for taking time to read my points.
You are right that we should consider other technologies too, but at the same time we can't be held back by their tardiness in keeping up. That is why I decided to do some more research on this topic. Here are some of my findings:
Language|Ideone link|Counts ...
|----------|:-------------:|------:|
Java| https://ideone.com/QoJWPK |Code units
Python 3.5| https://ideone.com/EWlOHk |Code points
Python 2.7| https://ideone.com/Hy61HF | Bytes
Perl | https://ideone.com/ZoX0da | Bytes
Ruby | https://ideone.com/d7cjsx | Code points
Go | https://ideone.com/kb1c8f | Code points
Rust | https://ideone.com/UIcCqL | Code points
Bash | https://ideone.com/Tq826k | Code points
This might not be exhaustive enough testing, but I think it proves my point. Modern high level languages report the length of strings in Unicode code points. Out of the ones on my list only Java counted Code units in UTF-16 with complete disregard for the surrogate pairs. I have found a nice code sample on github: Java and handling of UTF-8 codepoints in the supplementary range. This is insane, as the surrogate pairs could easily get garbled if not handled with care. And of course the same thing happens in JavaScript. You can easily break the surrogate pair because the language allows that e.g. "𧺯".substr(0, 1).
So the question really is (@jfkthame) : What is a character?
Historically the character was synonymous with bit octet i.e. a byte. This can be observed in some language for example C or C++ has elementary type char. However that was only true when we were using ASCII.
With the adoption of Unicode the definition of character has changed. Languages which use Unicode should follow the definition of character by Unicode Consortium. Now, I am no expert, but I was able to Google the glossary of Unicode Consortium:
or more in depth glossary: http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212
However what is the best description of this ambiguous term that I could find is this: Code Points and Characters
My Conclusion:
Most of the developers are used to the UTF-8 character encoding because as Wikipedia explains: "UTF-8, the dominant encoding on the World Wide Web (used in over 92% of websites)". It is quite obvious that in UTF-8 encoding if I wanted to retrieve a single character I would get the whole code point. Similarly the length of the string must be counted in code points. I know things get tricky when it comes to UTF-16, but no so much. As with the other one, this one is also a variable-length encoding, as code points are encoded with one or two 16-bit code units. Displaying or counting half of the surrogate pair makes no sense from the semantic point of view; that single code unit is not a valid Unicode character. Whether UTF-8 or UTF-16 is used, as far as the user/developer is concerned, the '𧺯' is understood to be a single valid Unicode character. There is no point in telling the user that this is 4 bytes or 2 code units, or letting them copy/select/delete half of it. It is a single character. Treating the surrogate pairs otherwise is technically a bug, see the excerpts below, and reminds me more of UCS-2 (which was a fixed-length encoding) than UTF-16.
Combining Character Sequence should be counted according to how many Unicode characters it is made up of e.g. len(é) = 2 or 👩(U+1F469 & U+200D) = 2. Despite the fact that the end user might be counting the characters as grapheme clusters, the tech savvy users understand that in Unicode they are made up of multiple encoded characters.
Some excerpts from the document linked above:
@thany Could you tell me of how many characters are those sentences made up of, without the help of a computer?
"வணக்கம் உலக!"
"!مرحبا بالعالم"
"नमस्ते दुनिया!"
@mathiasbynens I would like to link your 2 articles here, as I believe they are relevant.
https://mathiasbynens.be/notes/javascript-unicode
https://mathiasbynens.be/notes/javascript-encoding
As for the counting of grapheme clusters, which should be totally out of the question, the only language I could find which does it by default is Swift. And here is the reason why that would be wrong:
https://ideone.com/xCcYu3
FWIW, Perl 6 also counts grapheme clusters, but I'd consider this a mistake. Also, Perl 5 counts code points if you use utf8 (which you should if you have UTF-8 in string literals).
https://hsivonen.fi/string-length/ by @hsivonen may be relevant here (in particular the section "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?")
https://hsivonen.fi/string-length/ by @hsivonen may be relevant here (in particular the section "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?")
As that section explains, if consistency between implementations is a goal, using extended grapheme clusters pretty much guarantees _inconsistency_ between implementations of different ages for _some inputs_ even if we reset them all to do the same thing right now.
I think the section "Why Do We Want to Know Anyway?" is even more important, though. Do we know why we want to know? Deciding things without use cases in this space is bad and leads to counterproductive things like isdigit in C.
Looking upthread, I saw three use cases:
How well case 1 can work depends a lot on script. In general the topic of how much fits visually would implicate East Asian Width, but we can't spec anything based on that, because it would be a radical departure how browsers currently count BMP CJK characters for maxwidth purposes. Since the definition of extended grapheme cluster changes over time, using it will not lead to browsers agreeing precisely, but there would always be a frontier where browsers disagree depending on the age of their Unicode data.
Due to the variety of what can exist on the server side, trying to estimate case 2 on the browser in a generic way seems hopeless except for the simplest case of a single-byte encoding in the database and the server rejecting browser-generated numeric character references. For that case, counting UTF-16 code units works. (Changing maxlength do count UTF-8 bytes would likely break things, so we can't accommodate databases whose limits are in terms of UTF-8 bytes.)
For use case 3, extended grapheme clusters don't make much sense, since it would take into account graphical features of scripts in ways that aren't necessarily reflective of how much is being said information-wise. Since e.g. Latin-script languages vary widely in the string length (however counted) relative to amount of information conveyed, the limit needs to be set depending on what natural language the form expects to receive. In that case, if the language uses an astral script (e.g. Adlam), as long as you know whether UTF-16 code units or Unicode scalar values are being counted, you can calibrate the limit just as easily either way. For e.g. Cantonese, you might have an occasional astral Hanzi, but probably rarely enough that it doesn't matter in practice whether you count UTF-16 code units or Unicode scalar values. I don't know what to say if someone's writing is so heavy on emoji that the details of how emoji counts towards the total matters.
If we can't agree on what this feature is even for, it doesn't make sense to design anything complicated and it makes more sense to treat this as a misfeature that can't be removed. The simplest thing that doesn't deviate from existing practice too much seems to be DOMString length (UTF-16 code unit length) with line break counting as one.
IMO, Henri is right - there's no sense in making this more complicated, given that there are so many differing - and often ill-defined - "use cases" that have wildly incompatible requirements.
Perhaps the most reasonable actual use of maxlength I can think of would be as a first line of defence against a malicious user who might try pasting megabytes of text into something like a Username field, hoping to break or DoS something in the site's backend. Setting maxlength to a value much greater than anyone should realistically need, yet moderate enough to avoid risks of overflow or out-of-memory in the underlying code, might be useful so that further processing doesn't need to worry about indexes overflowing, etc.
Beyond that, if the application requires a specific limit to be imposed, it needs to implement that limit itself based on its knowledge of what kind of limitation - pixel width, graphemes, characters, code units, bytes, etc - it cares about in the particular context.
cc @whatwg/forms
Most helpful comment
As that section explains, if consistency between implementations is a goal, using extended grapheme clusters pretty much guarantees _inconsistency_ between implementations of different ages for _some inputs_ even if we reset them all to do the same thing right now.
I think the section "Why Do We Want to Know Anyway?" is even more important, though. Do we know why we want to know? Deciding things without use cases in this space is bad and leads to counterproductive things like
isdigitin C.Looking upthread, I saw three use cases:
How well case 1 can work depends a lot on script. In general the topic of how much fits visually would implicate East Asian Width, but we can't spec anything based on that, because it would be a radical departure how browsers currently count BMP CJK characters for
maxwidthpurposes. Since the definition of extended grapheme cluster changes over time, using it will not lead to browsers agreeing precisely, but there would always be a frontier where browsers disagree depending on the age of their Unicode data.Due to the variety of what can exist on the server side, trying to estimate case 2 on the browser in a generic way seems hopeless except for the simplest case of a single-byte encoding in the database and the server rejecting browser-generated numeric character references. For that case, counting UTF-16 code units works. (Changing
maxlengthdo count UTF-8 bytes would likely break things, so we can't accommodate databases whose limits are in terms of UTF-8 bytes.)For use case 3, extended grapheme clusters don't make much sense, since it would take into account graphical features of scripts in ways that aren't necessarily reflective of how much is being said information-wise. Since e.g. Latin-script languages vary widely in the string length (however counted) relative to amount of information conveyed, the limit needs to be set depending on what natural language the form expects to receive. In that case, if the language uses an astral script (e.g. Adlam), as long as you know whether UTF-16 code units or Unicode scalar values are being counted, you can calibrate the limit just as easily either way. For e.g. Cantonese, you might have an occasional astral Hanzi, but probably rarely enough that it doesn't matter in practice whether you count UTF-16 code units or Unicode scalar values. I don't know what to say if someone's writing is so heavy on emoji that the details of how emoji counts towards the total matters.
If we can't agree on what this feature is even for, it doesn't make sense to design anything complicated and it makes more sense to treat this as a misfeature that can't be removed. The simplest thing that doesn't deviate from existing practice too much seems to be DOMString length (UTF-16 code unit length) with line break counting as one.