Csswg-drafts: [css-text-3] Segment Break Transformation Rules for East Asian Width property of A

Created on 21 Jul 2016  Ā·  55Comments  Ā·  Source: w3c/csswg-drafts

https://drafts.csswg.org/css-text-3/#line-break-transform

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

As this rule, common use cases of quotation marks in Chinese

ā€œå¼•å·ā€
äø¤č¾¹äøåŗ”čÆ„ęœ‰ē©ŗę ¼ć€‚

will have unexpected spaces, because quotation marks are _A_.

Ideally, we should consider the language information of the context. If the context is East Asian language, _A_ should be treat as _W_. Even in the unknown language context, if any side of the line feed is _A_ and other side is _F_, _W_ or _H_, the segment break should also be removed.

Testing Unnecessary Tracked in DoC css-text-3 i18n-clreq i18n-jlreq i18n-tracker

Most helpful comment

I felt the hairs on the back of my neck tingle. 🄃

I recently conveyed to @fantasai and @frivoal in a private exchange that the UTC is extremely reluctant to make changes to EAW, and the latest substantive change was to add a note at the end of Section 2, _Scope_, in hopes that it would discourage change requests:

The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

What is being discussed here is sufficiently different than what is conveyed in the note that is quoted above, but the premise remains the same.

Prior to that, characters that have the _Emoji_Presentation_ property were changed to EAW=W, along with a note about treating _emoji presentation sequences_ as EAW=W (because the property deals in characters, not sequences). Keep in mind that characters that fall into _emoji presentation sequences_ are ambiguous as to their emoji presentation, and require (according to Unicode) an explicit Variation Selector, VS16 (U+FE0F), to indicate emoji presentation. Without an explicit Variation Selector, the EAW property value for such characters is ambiguous without drawing on one or more other properties, or through tailoring.

EAW is about resolving character width as a binary condition—sometimes necessarily via tailoring—in terms of whether to treat a particular character as half-width or full-width in the context of East Asian text processing. It really has nothing to do with the treatment of spaces, which makes me feel very uneasy about the use of this property in such a context. Ignoring the 800-pound gorilla that is represented by the CJK Unified Ideographs blocks, an extraordinary large number of characters are completely outside the scope of East Asian text, either because they are completely unrelated scripts or fall outside of the half-width/full-width paradigm.

Two other statements in the same section of EAW should be considered (emphasis mine):

It does not provide rules or specifications of how this property might be used in font design or line layout, because, while a useful property for this purpose, it is only one of several character properties that would need to be considered.

Instead, the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

Anyway, I have asked a couple of Unicode experts who are better versed than me about properties related to segmentation and line breaking in hopes that they can offer solutions that don't involve EAW. The holidays are upon us, so there is likely to be delays in getting substantive or helpful feedback.

All 55 comments

My concerns here are:

  • Removing spaces where they currently aren't removed can break existing pages.
  • The proposed behavior is more complex to understand and more complex to implement for what is a fairly low-level operation.

I'm happy to make the change if i18n recommends it and implementors agree, but I am hesitant to do so for these reasons.

@fantasai Current segment break rule in the draft already change the traditional behavior, and up to now no browser implement it.

Do you mean you just want to drop the rule totally?

And I don't think my proposal is much complex than current.

Current rule:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A)

My proposal:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, H, or A, except both is A.

@r12a waiting on i18n feedback before we get this on the CSSWG agenda again

Just to clarify, the proposal is that if lang=zh|ja|yi then A->W otherwise A->N for the purpose of line-break transformations?

I think that should probably be okay. I would be against making A->W the general case.

@fantasai In fact there are two proposals:

  1. If the context is East Asian language, A should be treat as W.
  2. If context is not available (or if proposal 1 is not accepted), modify current Segment Break Transformation Rules for A:
    If one side of line break is A and other side is F/W/H (which means it's very likely in east asian context), then treat A as W.

I think the motivation is reasonable, and some A should be treated as W in CJ context, especially quotations in Chinese. But I am concerned about A+A case, especially given that there are lots of letters in A.

The safest thing to do is probably this: if the context is Chinese or Japanese, and one side of line break is a punctuation in A, and the other side is F/W/H, then the segment break is removed.

@upsuper @kojiishi @hax Checked in a fix, based on Xidorn's suggestion. A+A still keeps the space, but A+F or F+A will delete the space if the A's language context is Chinese/Japanese/Yi. This is more conservative than the original request, because we don't want to break existing pages and A+A is reasonably common on non-CJK pages. An interesting question is, should we be checking the language on the segment break instead of on the A?

@fantasai
In my original suggestion I also didn't think A+A should delete space if we don't know whether we are in East-Asian context.

An interesting question is, should we be checking the language on the segment break instead of on the A?

It's basically same as my "If the context is East-Asian language, A should be treat as W", but I believe checking language on the segment break is much more precise and clear.

@fantasai I think our discussion was concluded that we do that only for punctuations in A in that language context? It doesn't seem to me other A should have that behavior.

OK, switched to checking the language of the segment break (rather than the A character), and restricted that rule to punctuation only.

More fun: Unicode decided to categorize emoji as Wide for some reason. >:[

Fixed to treat Emoji the same as an Ambiguous character: a6aa4d856a34b7c87f23a7215635a0b4f353f63b
Why it's not Ambiguous to begin with, I don't know.

Fixed to treat Emoji the same as an Ambiguous character

I think Emoji is too much; it's sometimes surprising and unexpected. The data here:
http://unicode.org/Public/emoji/latest/emoji-data.txt

U+0023, U+002A, U+0030-0039 are probably not desired.

@kojiishi Can you explain what you think the spec should say about this? Definitely we can't rely on EAW for emoji, they are totally inconsistent. E.g. U+1F600 Grinning Face is EAW=Wide while U+263A Smiling Face is EAW=Neutral. Our rules need to treat them the same somehow, and definitely we can't treat emoji as Wide here.

I prefer not to mention. It has historical reasons to be inconsistent afaiu, Emoji is hard because it's sometimes Emoji but sometimes is not, depends on fonts. In this case, it affects only when author inserted a segment break before or after. Also even though there might be cases where it looks strange, it's interoperable, right?

To avoid the problem mentioned by @kojiishi (U+0023, U+002A, U+0030-0039), we could for this purpose treat W emoji and N emoji as A.

N emoji
W emoji

The only ones in all that that don't seem to me to really be "emoji" as commonly understood by people are:

  • U+00A9 COPYRIGHT SIGN
  • U+203C DOUBLE EXCLAMATION MARK
  • U+2049 EXCLAMATION QUESTION MARK

But even then, U+00AE REGISTERED SIGN and U+2122 TRADE MARK SIGN are A as well, so lumping COPYRIGHT SIGN with them doesn't bother me.

As for U+203C, U+2049, they both are sentence ending punctuation. In Chinese and Japanese typesetting, spaces are generally not inserted around sentence ending punctuation, so treating them as A and discarding the spaces seems OK too.

Thank you for thinking through how to avoid problems. While I understand there are ways to avoid problems and there are cases where it is helpful, I think troubles win over the benefit.

The CSS Working Group just discussed Segment Break Transformation Rules for East Asian Width property of A.

The full IRC log of that discussion
<dael> Topic: Segment Break Transformation Rules for East Asian Width property of A

<dael> github: https://github.com/w3c/csswg-drafts/issues/337

<dael> fantasai: This issue started as being about using [noise issues]

<dael> fantasai: using language information for doing break transformation. As I edited that in I ran into problems around emoji. They're wildly inconsitent about which characters are which east asian width and that's what we use to determine if line break is transformed

<dael> fantasai: To work around that we're taking a subset of emoji with a width of n or w and treating as ambig

<dael> fantasai: This issue is about if it's okay to make that change. If we don't characters like smile face has one behavior around line break and grinning face has a different behavior

<dael> fantasai: Sent email to author of east asian width spec to ask why inconsistent and they said they don't rec this for emojia nd use the emoji property so that's sorta what we're doing

<chris> It sounds like thir spec should be fixed

<dael> astearns: I'm not clear on koji's comments. Seems he's against. His last comment about troubles win over benefit

<dael> myles: Is email to unicode guy public?

<dael> florian: No

<chris> s/thir/Unicode/

<dael> myles: Would love to read

<dael> florian: I'll check with him before forwarding

<fantasai> https://unicode.org/cldr/utility/character.jsp?a=1F600&B1=Show vs https://unicode.org/cldr/utility/character.jsp?a=263A&B1=Show

<dael> florian: Not sure I understand koji either. We need to define somehow. East asian width of unicode is messy so I'm not going to be surprised if we come back. Pushing back means leave undefined or what?

<dael> fantasai: [reads unicode email]

<dael> myles: did we explore emoji related properties?

<dael> fantasai: That's what we did here.

<dael> astearns: Devining koji's intent. "Even though there are cases it looks strange it's interop, right" So is there interop behavior?

<dael> fantasai: Not sure how interop this set of rules are. Mox impl previous line break transformation and not all impl had. I do think case of line break transformation in emoji I don't think we have a web compat problem if we make an exception. I think we get weird results either way.

<dael> fantasai: Purpose of transformation rules is to make it easier to format source code of doc rather then put all text that can't have spaces on one line. If rules are unpredictable that's not helpful. Should treat all smileys same.

<dael> florian: I checked, there is no interop

<dael> florian: And that comment was about processing of line breaks entirely

<dael> astearns: "that comment" being which?

<dael> florian: Supression of space introduced by line break is done by FF and not by Chrome.

<dael> astearns: Given all this I think remaining question is if anyone is interested in impl this change

<dael> Rossen: It sounds like koji is not interested in impl based on comments

<dael> florian: I don't get if he disagrees with impl this feature in this way or if he doesn't want to impl the entire feature.

<dael> florian: Given Chrome doesn't impl feature and I can't tell if he's against implementing...

<dael> myles: I won't impl until I do another pass through spec and try and understand

<dael> Rossen: Try to cover when koji is on?

<dael> florian: Poss. If everyone up to speed on this feature in the first place?

<dael> astearns: I'm not sure a quick summary on the call is right. myles is going to look through spec to get up to speed. Maybe we leave this open. It's in the spec and we have issue with intent to keep in but leave it to review and raise objections.

<dael> astearns: I know we just closed an issue where we left it in thatstate for 6mo but coming back in Jan might let people get up to speed

<dael> fantasai: Or next week

<dael> myles: I could have something to say by next week

<dael> astearns: Let's leave to next week. I'll ping koji to clarify his comments

Despite its classification as EAW=N, U+263A Smiling Face in Chrome (on macOS with default fonts) looks like screenshot 2018-12-13 03 04 30 if lang=ja and like screenshot 2018-12-13 03 04 37 if not. That looks like the behavior you'd expected of EAW=A, so I think we're going to have to do some overriding of EAW for emoji anyway to get things to make sense.

@kojiishi given the discussion in the working group above, can you clarify your opinion on the change that's in the editor's draft? Are you OK with this, or would you prefer to revert it?

Fwiw I would strongly disagree with leaving it as-is. Treating ☺ U+263A SMILING FACE and šŸ˜€ U+1F600 GRINNING FACE differently for white space collapsing is hostile to authors imho.

Are you OK with this, or would you prefer to revert it?

Sorry it seems I wasn't clear enough. By "revert" do you mean there was already a resolution? I guess I missed it then.

In short, I'm not interested in implementing it and rather negative, but if other impls are interested in or if there was already a resolution, I would not object.

Longer version; IIUC this issue contains two related but separate issues; EAW=A and Emoji.

Blink currently implements segment transformation rules including CJK cases in LayoutNG, and IIUC it makes Blink interoperable with Gecko. I think that it will not get updated further for the first phase of LayoutNG.

For EAW=A, I'm neutral on this. If other implementers ship it, I may consider in future. fantasai and I discussed this several years ago, and she was negative at that point, because it can solve some cases but never can solve all cases, leaving inconsistency. Now it looks like she wants to try to solve some cases. It's still inconsistent and rules is probably not clear to authors, but solving some non-rare cases maybe a win.

For Emoji, I'm rather negative. I'm not interested in implementing it, probably in a foreseeable future. I'm not sure how far we plan to go on this feature, but as I understand from the motivations, what we really want is the consistency for color Emoji, correct? Then what we need is the logic to determine color Emoji, because both U+263A SMILING FACE and VS16 U+0030 DIGIT ZERO should look like Emoji. We will also need to take CSS Fonts properties we define to switch between text and color Emoji. Blink currently determines the used font in the shaping phase, and we don't want to shape during the white-space collapsing.

If we were to give up at some point, it will be implementable, but it means we will accept to live with inconsistencies at some point, so the benefit is limited. The user cost still exists, since ASCII range is included, the logic for every line end runs on every page of every user, consuming loading time, batteries, etc. Current other rules can be optimized out for plane 0 (U+0000-U+00FF) content.

So the question to me is that, assuming we will accept inconsistencies at some point, is improving the inconsistency a little more worth the user cost? It may depend on where we will give up, but I prefer saving user cost, because I think authors can figure it out if the rules are interoperable across all browsers.

One more example of inconsistency, from [EastAsianWidth.txt]:

2295;A # Sm CIRCLED PLUS
2296..2298;N # Sm [3] CIRCLED MINUS..CIRCLED DIVISION SLASH

EAW is about legacy encoding. As Ken suggested, when we start using EAW=A in whitespace collapsing, or maybe even W too, we will need to live with some level of inconsistencies.

I mean, I'm still fine whether we use EAW=A or not to use EAW=A, each has improved cases and each has additional inconsistencies. Haven't investigated which is net plus. But I wish us not to try more.

As Ken suggested

Who is Ken and can where can we read their suggestion?

Who is Ken and can where can we read their suggestion?

Sorry, Ken is the editor of the Unicode East Asian Width spec, and I thought I read fantasai's comment saying she contacted the editor, but can't find now. Probably I'm confused with some other issues, please disregard that part.

The spec says "browser implementations do not currently follow these rules". From my testing, Chrome, WebKit, and Edge don't seem to follow these rules at all, whereas Firefox seems to follow some of them.

The entire section smells like a heuristic, and given the lack of implementations, and the disagreement in this thread, I don't think we (collectively, the Web community) are ready to enshrine a particular behavior in the spec. Right now, the spec seems to describe a particular behavior that solves a collection of problems, but without more implementor experimentation, I don't have much confidence that this particular solution is the best solution.

I believe we can, however, agree on the intent of this section (to keep scripts which don't use spaces free from those harmful spaces), so perhaps the spec should be worded that way, rather than prescribing a particular solution just yet.

@litherum The spec had a definite heuristic using EAW, which is afaik, what Firefox implemented. This issue was opened asking for spaces to disappear in more cases, specifically when one side was ambiguous in cases where the content language was known to be CJY. The CSSWG discussed and resolved on that in https://lists.w3.org/Archives/Public/www-style/2016Oct/0068.html

I unfortunately tagged the Emoji discrepency against this issue in https://github.com/w3c/csswg-drafts/issues/337#issuecomment-379776875 ; it's actually a separate issue, that the various wingdings have different behavior because the EAW is not consistently assigned, and thus our transformation rules are weirdly inconsistent from an author's point of view. @frivoal proposed a solution to this by tailoring the EAW of Emoji as he describes in https://github.com/w3c/csswg-drafts/issues/337#issuecomment-444316214

I'm not entirely sure which part @kojiishi is objecting to. I'm OK if our space-collapsing heuristics are more limited than requested in the original issue, but having Emoji behave inconsistently is really weird and imho not acceptable. https://github.com/w3c/csswg-drafts/issues/337#issuecomment-446775351

I'm not entirely sure which part kojiishi is objecting to.

Because I didn't object!! Jokes aside, I really didn't. Let me try again.

  1. I'm neutral on EAW=A issue. Need more investigation and experiments.
  2. I'm rather negative on Emoji issue.
  3. Either way, I will not object if other impls are interested in, or WG has consensus.

Implementation-wise:

  1. Blink has implemented the spec one year or so ago in LayoutNG.
  2. May consider EAW=A after we ship LayoutNG.
  3. Emoji has some technical difficulties that we may not implement.

Please let me know if this is not still clear, I'll increase the time at English class.

Koji said:

I'm not sure how far we plan to go on this feature

And this is more-or-less what I’m saying, too.

  • Having a simple heuristic that doesn’t work well would be unfortunate
  • Adding more and more special cases on top of a heuristic that didn’t work well in the first place would just end up being a pile of hacks
  • Having a super complicated heuristic would be a headache for both authors and implementors

Instead of somehow trying to synthesize whitespace collapsing rules from the East_Asian_Width property, the Emoji property, the Unicode block to determine if it’s Hangul, and who knows what else in the future, the spec should allow for more experimentation to try to come up with a more elegant solution (perhaps in the Unicode consortium). We have an opportunity now because there isn’t compact risk.

The CSS Working Group just discussed Segment Break Transformation Rules for East Asian Width property of A.

The full IRC log of that discussion
<dael> Topic: Segment Break Transformation Rules for East Asian Width property of A

<dael> github: https://github.com/w3c/csswg-drafts/issues/337#issuecomment-446842105

<dael> Rossen: Brought back from week before if I recall

<dael> Rossen: Additional comments from koji. Wanted to have koji comment.

<dael> Rossen: Do we have koji or enough from his feedback that we can discuss?

<dael> myles: I think I understand koji's feedback

<dael> Rossen: So we can make progress and see if can resolve

<dael> florian: Context is suppressing segment breaks in source code. If you have word space word space the segmenet break is converted to a space, but in lang without spaces we're having a part of the spec deal with suppressing. Non-controversial part has been shipped. Characters on both sides of break are unambig CJK

<dael> florian: What do we do when one side is ambig like ". Initial proposal was when seg break is lang tag as CJ and one side is ambig and the other unambig we suppress.

<dael> florian: emoji, though, was inconsistant. Some wide, some narrow, some ambig. WE proposed in spec to treak all emoji as ambig so if you had unambig Asian on the other side you suppress the break

<dael> florian: koji pushed back and myles agrees with pushback

<dael> myles: I think we can all agree on goal. If you have Chinese text line break in the middle shouldn't turn into a space.

<dael> myles: When I was reading spec the whole section on how to determine if suppress space proposal looks at EA Width and then emoji and then elsewhere and it seemed this wouldn't work in a lot of cases we haevn't through of. The more we try and fix this section the more complex it gets and the more we'll miss

<dael> myles: I think that's similar to koji where if you add a case for emoji you'll have to add a case to reduce set of emoji b/c unicode says more is emoji then people think. Instead of spec behavior only one browser impl we should let browsers experiment and try and come up with a better way, perhaps involving unicode consortium.

<dael> florian: Agree with part, but not all. Languages are complicated so if we want to cover all cases rules will be complicated. If we are not careful here and we add to many things we later want to remove that would be problematic.

<dael> florian: Being cautious about what to add, I would agree.

<dael> florian: On the other hand letting UA experiment that's not reliable for authors so they can't do anything. If both sides are clearly Asian there's no worry and we should do it. Ambig on one side and break is Asian and other side is Asian we're safe.

<dael> florian: Emoji we went through everything and found that we thought adding all of it was safe. I'd be okay with you double checking.

<dael> florian: I suspect there will be more areas of inconsistant. We will at some point say this is rare enough and we're not handling it. I think we should solve enough that East Asian can have linebreaks.

<fantasai> proposal wrt emoji is in https://github.com/w3c/csswg-drafts/issues/337#issuecomment-444316214

<fantasai> you can see the entire list of affected characters

<dael> florian: I would say let's be careful with what we add. I thinkw e have been with emoji. There is a slope here, but we can decide how far down we go

<dael> myles: Rather then going half way down and saying no more, we should investigate another approach

<dael> florian: Do you have a suggestion on antoher type of approach? I feel this will be about subsets of unicode things. How to do it may have strategies.

<dael> myles: I don't have a specific prop and that's why I think more room to experiment. I don't think we're at a point where we can say some should and some should not react this way. I think we're at an early phase.

<dael> fantasai: I don't think that's the case. Spec has used EA prop and no one has said we shouldn't use that. THe details of how we're using it we're finding in some cases it needs to be tailored to do it in the same way. Smyle face is neutral and grinning is wide, but author won't expect that.

<dael> fantasai: I don't think it makes sense for us to have an env where they can't know that their space will get eaten by changing smily. Rule florian has is there are subsets of emoji where we don't know why they're wide.

<dael> florian: They're mostly classified due to what legacy encoding they came from

<dael> myles: I agree EA Width doesn't work well. Possible solution is don't use EA Width and I'd like to persue that

<dael> fantasai: Alternative is script + script extensions propery. Other then that it's creating a custom list which we won't do.

<dael> florian: That's because mantenence.

<dael> florian: EA Width spec says it should be tailored

<dael> koji: I agree with myles. EA Width is designed to be compat with encoding. Not designed for this purpose. We'll see lots of inconsitancies. Options are live with inconsitancies. If we don't want that, don't use EA Width

<dael> florian: My feeling is in terms of web compat if we add more cases to suppress it's safe. Removign is bad. If we find a more efficient approach later that characterizes more characers we can move to that. WE should be careful to not suppress spaces that really should be there. Even if way we reach char set is more complicated then you wish, that's nto a long term problem. If we find a better way in the future we can do that as long as we didn't include so many

<dael> florian: I think marking some of it at risk we can do that. But it's not going to do wrong behavior in a way that we can't walk back.

<dael> florian: So I propose mark as at risk, but leave it, and welcome experimentation

<dael> koji: If we find myles point logic on EA Width wasn't great it's not backwords compat

<dael> florian: Suggesting there are currently characters classified as wide that shouldn't suppress spaces? Because if there isn't any we're safe

<dael> myles: One happy medium is to say there are some sets of triggers that will or won't cause suppression. Other then that it's up to browser. Kinda like line breaking with some ristrictions

<dael> fantasai: Where you break the line isn't a big deal. It doesn't look really wrong with slight differences. But if there is a space in one impl and not antoher that's a real problem for the typesetting. If there isn't interop the user can't check their text, looks fine, load in another browser and there's lot sof space. We need interop and this isn't a good place for everyone decides.

<dael> myles: WE've go through to today

<dael> florian: No author uses line breaks in East Asian. We're trying to make it better

<dael> myles: Why not solve now because they're not using it

<dbaron> I think there are real web compatibility problems as a result of line breaking differences.

<dael> florian: Any solution will be a superset of what's speced today so I don't see why can't spec today. I'm willing to put part that you think is overkill at-risk

<dael> myles: As spec right now there is an algo where every string produces yes or no suppress.

<dael> florian: What I mean is if we say yes suppress to more things it won't cause web compat. If we say yes to fewer things it will. That's why I'm talking about supersets. If we add more things authors will be able to add more line breaks. So we can expand. Reducing is bad. So if there's a different solution later with a same size or larger set that's okay.

<dael> myles: I'd like to expand that set w/o going through WG

<dael> florian: I don't see how that works. Regardless of how we spec if browsers aren't interop it's not usable

<dael> myles: Already not

<dael> florian: Trying to make it usable

<dael> myles: So wait

<dael> florian: Wait until what? You say don't standardize and I say do.

<dael> Rossen: We're getting too argumentative and I'm not sure we're ready to resolve. Discussion is valuable and brings us closer to something where we can resolve. Doesn't feel we're there yet.

<dael> Rossen: Perhaps we can continue to work on this as part of the text inline focus group that will be proceeding F2F unles syou feel strongly we can resolve

<dael> florian: I don't feel we can. Taking offline for now and next time we meet we keep talking sounds...not as good as resolving, but we can't resolve

<dael> Rossen: But this conversation was great and gives room for people to continue

<dael> fantasai: I want to say I insist on 2 things. 1: we have defined rules all UAs must follow. 2: We're using unicode prop of some kind and not having CSS spec create a custom list

<tantek> +1 to fantasai's two rules

<dael> Rossen: We can rec. to people what they can do, we can't require it

<dael> fantasai: Then you'll be non-compat with spec

<dael> myles: I'd like to hear what unicode consortium has to say

<dael> fantasai: Their feedback is EA Width is not something they're putting effort into maintaining

<dael> florian: That convo goes off topic, it's hard to share it all.

<dael> fantasai: And we're explaining what we're doing and they ask if we're using UAX-14 properties and that's not helpful

<dael> Rossen: Your point is valid. This won't bring us closer to resolution.

<dael> Rossen: Let's table this and work more to get to something better for interop and for the web.

That convo

I'd really like to hear what Ken has to say on this topic. If Ken is Ken Lunde, the editor of the East Asian Width property, his feedback on our use of the East Asian Width property would be very valuable.

I felt the hairs on the back of my neck tingle. 🄃

I recently conveyed to @fantasai and @frivoal in a private exchange that the UTC is extremely reluctant to make changes to EAW, and the latest substantive change was to add a note at the end of Section 2, _Scope_, in hopes that it would discourage change requests:

The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

What is being discussed here is sufficiently different than what is conveyed in the note that is quoted above, but the premise remains the same.

Prior to that, characters that have the _Emoji_Presentation_ property were changed to EAW=W, along with a note about treating _emoji presentation sequences_ as EAW=W (because the property deals in characters, not sequences). Keep in mind that characters that fall into _emoji presentation sequences_ are ambiguous as to their emoji presentation, and require (according to Unicode) an explicit Variation Selector, VS16 (U+FE0F), to indicate emoji presentation. Without an explicit Variation Selector, the EAW property value for such characters is ambiguous without drawing on one or more other properties, or through tailoring.

EAW is about resolving character width as a binary condition—sometimes necessarily via tailoring—in terms of whether to treat a particular character as half-width or full-width in the context of East Asian text processing. It really has nothing to do with the treatment of spaces, which makes me feel very uneasy about the use of this property in such a context. Ignoring the 800-pound gorilla that is represented by the CJK Unified Ideographs blocks, an extraordinary large number of characters are completely outside the scope of East Asian text, either because they are completely unrelated scripts or fall outside of the half-width/full-width paradigm.

Two other statements in the same section of EAW should be considered (emphasis mine):

It does not provide rules or specifications of how this property might be used in font design or line layout, because, while a useful property for this purpose, it is only one of several character properties that would need to be considered.

Instead, the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

Anyway, I have asked a couple of Unicode experts who are better versed than me about properties related to segmentation and line breaking in hopes that they can offer solutions that don't involve EAW. The holidays are upon us, so there is likely to be delays in getting substantive or helpful feedback.

Thank you Ken, this is really helpful and great start, and thank you @litherum for involving Ken into the discussion.

I think what we need to agree first is how much accuracy we expect in this feature. No matter what we do, we will not get 100% accuracy. Even if we define a new Unicode property, the property cannot provide the perfect result because of unifications, Emoji presentations, etc. Remember the discussion on the vertical orientation property, it took over 2 years to define but we knew we need a CSS property to override because it will never be perfect.

The strategy I prefer is to pick one existing Unicode property, and live with inconsistencies the property cannot distinguish well. It is easy to explain, author can learn what are the common inconsistencies, fast to compute, easy to implement, and authors are unlikely to be troubled by implementation bugs.

It is very tempted to include a few characters, such as quotation marks as reported in this issue, but that is likely to start an endless discussion whether to include another character or not, and often it is ambiguous due to the nature of Unicode.

If the strategy works for us, I still think EAW=F/W does a reasonable job, but if people disagrees, other properties that are more appropriate to use in line layout can provide different combination of consistencies and inconsistencies.

Note, given the last discussion, in order for the discussion not be constrained by existing content, I'm thinking to disable it in LayoutNG until we get better consensus.

I have asked a couple of Unicode experts who are better versed than me about properties related to segmentation and line breaking

This is not about line breaking though. This is about processing U+000A that are present in the source code of a document, to decide if a space should be inserted in its place (as would be appropriate for English, where words are space separated), or not (as would be appropriate for Chinese or Japanese, where they are not). This is not about line breaking in the rendered layout.

Finding a rule that would work for all languages in all situations is unrealistic, and is not needed because authors can just avoid using line breaks in their source code to guard against such space insertion. However, finding a subset where we can reliably determine that inserting a space would be the wrong thing to do enables authors of space-less languages to format their source code freely in more situations, and enjoy some of the benefits that users of space-separated languages already enjoy.

Due the languages from which EAW=F/W/H characters come, I believe the proposed rules safely identify such a subset.

By safely, I mean "will not fail to insert a space where one is expected".

I still think EAW=F/W does a reasonable job, but if people disagrees[...]

I agree that this rule (the second bullet point in section 4.1.2) does a reasonable job indeed. I won't die on this hill if the consensus is that doing more than that is overkill.

I do think we could do a better job, and that the additional rule about EAW=A (the third bullet point) brings significant benefits (see the situation discussed in the initial comment on this issue) at moderate cost. I'd prefer if we did it. I won't object if we don't.

Whether we do that part or not, Emoji will be inconsistent anyway, and I think it would be nice to fix as that's annoying, and I believe the proposed tailoring to do so is safe. But I'm ok if that's where you want to draw the line.

To me, these are the "appropriate tailoring" that UAX11 is calling for.

I would not be opposed to using some other unicode property (or combination of properties) than EAW (with the proposed tailoring) if we had a realistic candidate of another classification that works for reliably identifying a safe subset, but I don't think it is helpful to drop EAW based rules for either of the following reasons:

  • Even if it provides the information we want, it wasn't designed for that.
  • Even if none exist, we can imagine that there should be a theoretical classification that would cover a larger safe subset.

If the result is wrong (as in "inserts spaces where there shouldn't be"), or if there's an easier way to get an equally good (or better) result, then sure. But let's keep in mind the priority of constituencies:

consider [...] authors [...] over theoretical purity


Digression:

EAW is about resolving character width as a binary condition [...] half-width or full-width

Even if its goal is to classify things between narrow and wide, given that EAW is a property with 6 values (F/H/W/Na/A/N), describing it as binary undersell the amount of information it caries.

Here's an attempt at stating the problem independently of the solution:

  • We want to identify cases where the fact that a line break is allowed, as evidenced by the fact that one was used in the source code, does not imply that a space would be acceptable in the rendered document if the text was wrapped elsewhere (or not at all).

  • A rule that only finds a subset of cases where this is true is both acceptable and unavoidable. Acceptable because authors can refrain from using line breaks in the source code even if that's inconvenient, unavoidable because of too much linguistic and typographic diversity and too many corner cases.

  • On the other hand, false positives must be avoided, as it would likely break existing pages.

  • This is only useful to authors to the extent that it is consistent across browsers, as otherwise the result would be unwanted spaces in some browsers.

  • The results should be unsurprising, otherwise authors risk shying away from using it.

  • To the extent it relies on classifying characters, it must do so based on properties and classifications maintained by Unicode (or combination thereof), as the CSS-WG cannot maintain character by character classifications while Unicode continues to expand.

  • It may rely on things other than classifying characters, such as the content language.

  • It must be practical to implement performantly.


On the last point, I trust implementors' judgement better than mine. But that aside, I believe each of the proposed rules in section 4.1.2 satisfy all these criteria.

I also don't think I've heard any other proposal (except adopting a subset of these rules) that does.

@frivoal What you wrote just above is helpful. In an effort to confirm my understanding, please confirm that the following are accurate and coherent statements:

_The intent is to un-break lines that are marked with an explicit line-break (U+000A and friends), and which may have one or more intervening white-space characters (such as spaces or tabs) that serve as indentation for easier source viewing. The problem is determining whether to altogether remove the line-break and any intervening white-space characters, or to replace them with a space (U+0020)._

@kenlunde s/that are not marked/that are marked/ but otherwise yes. (If there is no line feed, then all spaces and tabs are collapsed to a space, they never disappear. If there is a line feed, then we strip any adjacent spaces and tabs and then contemplate whether to remove the line feed or leave behind a space.) The purpose of this is indeed to unbreak lines of text that are broken in the source code, but form a continuous paragraph when rendered.

Right. I edited my statement above just now to remove "not."

Also, with regard to the Korean problem, or its special-casing, I think that you can safely assume that a "space" can be used to un-break such lines. While Korean text can break on spaces and between hangul syllables, I see no reason why such lines would break on anything other than spaces in this context. Of course, you may have already considered this, but I am conveying the above suggestion on the off chance that you didn't.

Currently, web authors writing in Chinese or Japanese just put all their source text on a single line because most browsers do the newline->space conversion unconditionally.

Also, with regard to the Korean problem, or its special-casing, I think that you can safely assume that a "space" can be used to un-break such lines. While Korean text can break on spaces and between hangul syllables, I see no reason why such lines would break on anything other than spaces in this context.

Agreed. This is why the current spec text specifically doesn't make the line-breaks-turned-into-spaces disappear when either side of the source-code line break is Hangul.

The CSS Working Group just discussed Segment Break Transformation Rules for East Asian Width property of A.

The full IRC log of that discussion
<dael> Topic: Segment Break Transformation Rules for East Asian Width property of A

<dael> github: https://github.com/w3c/csswg-drafts/issues/337

<dael> fantasai: Leftover from F2F so prob get rid of

<dael> astearns: This is def a better int he room discussion. DOn't know if we want to wait 3 months

<dael> fantasai: I'd like to get text to CR before that but this is main issue blocking CR

<dael> astearns: How do we get to resolution

<dael> fantasai: I think current text is fine. I don't know if anyone has issues. I think myles not happy

<dael> myles: Yep

<dael> fantasai: Dunno where to go from there

<dael> florian: Support current text

<dael> astearns: myles do you have a suggestion to improve

<dael> myles: Not concretely but using a solution from unicode where experts agree won't work well and doesn't have symantic meaning seems wrong direction. WE would work with unicode to come up with classes we need

<dael> florian: We're the only user of that thing b/c relates to how css and html wrap lines or source code. So if we're not doing effort they'r enot either. Solving this usefully for most cases is sufficient to be useful. I hear what you're saying but it seems like this is the perfect as the enemy of the good

<dael> myles: We're using this to determine where to unbreak lines. Knowing if char or parts of scripts use spaces is useful for any text editor. And I think that, you're right that perfect enemy of good is not the best way to create the web but I'm worried we'll come up with a busted solution and be stuck forever

<dael> florian: Not too worried because we're not trying to solve netire problem but instead a useful chunk. I think the defined part is safe. If we look for areas of unicode we can find where we should have used more but we're using a safe subset. If unicode does better in the future it will include what we defined

<dael> myles: If we encourage authors in cjk to add new lines in source code we need interop

<dael> florian: Oh yeah, we do. The part we have is safe to get interop on. What I mean is that I don't forsee given the subset we're defining I don't see us having to remove from it. Having to add to it maybe, but not remove.

<dael> myles: If we're considering this solved and say to authors add line breaks where you want it will be wrong in many places and those in the future will change when we have better solution

<dael> florian: That's the thing I think will not happen. If we later expand we will create new places where they can break, but the places we're offering here will stay. If you find such a thing please call it out. I think we will find other places in unicode where we can add breaks, but I don't think we'll remove from the current list. So there's no risk in that sense

<dael> astearns: Risk is limited to if the current attempt at fixing a subset is a true subset of the eventual good solution or not

<dael> astearns: I don't have a way of evaluating if what we have now is a true subset

<dael> astearns: One question - does anyone know if current prop rules are enough for a native Japanese author who has no idea of anything to do with unicode and they're just writing Japanese is there a set of rules for this author that would make sense? Can we tell them you can put line breaks where you want or is it complicated?

<dael> florian: For Japanese it's mostly straightfoward when writing text. If you're doing dingbats in the middle it gets weird. In text it's fine. Dingbats isn't Japanese specific, they just get weird.

<dael> astearns: I have not read whole thread, it is long. Have we had anyone from unicode give a thumbs up that this is a safe subset?

<dael> florian: Had conversation with unicode but mostly off topic because they didn't understand what we were trying to do

<dael> myles: That convo the expert didn't comment about the specific modification in the spec right now, but said prop these modifications are based off of....I don't want to put words in mouth but said it was fairly broken

<dael> fantasai: And they broke it even more recently

<dael> myles: Seems like the wrong thing to base on

<dael> fantasai: We're working around that by ignoring a subset of characters they decided to change

<dael> fantasai: If we're gonna do anything like this this is the set of properties. Alternative is ask unicode for a new prop which seems unlikely

<dael> myles: Not nec just this purpose. Knowing if char are part of a script that uses space as line breaks is useful Spec says [reads]

<dael> myles: It jsut seems like we're patchign abroken system and we should try and solve properly

<dael> florian: Not jsut special cases. A large chunk of the text is the necessary conditionals to make sure we're in the right language. THis new fabled unicode thing wouldn't know what lang you're in. THat part would stay in CSS. If we're keying off [missed] it could be better from unicode, but keying from language cannot

<dael> astearns: Agree with myles that I'm a bit worried about referring to a thing in unicode that they want deprecated and don't care enough to prevent it from having changes. If this is something unicode is not interested in keeping stable we shouldn't reference it

<dael> florian: Therefore we don't solve?

<dael> astearns: We work with unicode to come up with something they want to maintin and contribute expertese so we can come up with better handling that's maintained and can rely on

<dael> florian: It's not marked as deprecated. It's in the spec they're theoretically maintained and they're doing a bad job of it. We cna start paying attention and raise issues. Their opinion is that in current shape it's not useful, but it's still in the spec

<dael> astearns: This will need to kick back to the issue for discussion. I'll see if we can get Ken to engage on the particular problem we're trying to resolve

<dael> fantasai: Maybe reach out to other unicode people too

<dael> astearns: Suggestions on who?

<dael> fantasai: myles had Apple's contact. We've got other people from unicode. I'll try and reach out

<dael> astearns: I think that's where we'll leave this

The CSS Working Group just discussed segment-break rules.

The full IRC log of that discussion
<TabAtkins> Topic: segment-break rules

<astearns> github: https://github.com/w3c/csswg-drafts/issues/337

<TabAtkins> fantasai: Close it somehow...?

<TabAtkins> myles: I think this is worth some discussion.

<TabAtkins> astearns: Did you find anyone at Apple to talk to?

<fantasai> Section under discussion https://drafts.csswg.org/css-text-3/#line-break-transform

<TabAtkins> myles: I started a discussion; same story happened, we tried to describe it to Ken and then he had no opinions, same thing happened with me.

<TabAtkins> myles: So in light of that I'm willing to somewhat amend my previous position

<TabAtkins> myles: The spec lists a collection of segment-break rules, writing-system rules, general category rules, and the word "hangul"...

<TabAtkins> myles: I'd like the criteria of this to be listed somewhere that isn't CSS.

<TabAtkins> myles: I'd ultimately like this to go into Unicode somehow.

<TabAtkins> myles: Ultimately I don't think browsers should be in the business of making these sorts of character decisions.

<TabAtkins> myles: If we can do that, I'm willing to accept it.

<TabAtkins> florian: I don't have a problem with that in theory.

<TabAtkins> florian: To the extent I've tried to discuss this with unicode, I didn't sense any interest on their side that this is a problem worth solving.

<TabAtkins> florian: Or maybe not even a willingness to understand the problem.

<TabAtkins> florian: If we were doing codepoint-by-codepoint I'd be concerned, but this is category based.

<TabAtkins> myles: I'm an implementor here; we have different ideas about "complicated".

<TabAtkins> myles: Also their lack of interest is a signal. We're not the only language that uses text.

<TabAtkins> fantasai: We're one of the only that takes broken lines and unbreaks them.

<TabAtkins> myles: Unicode has taken on work to describe all the linebreaking in CSS. So if they don't care abuot this, that's a signal!

<TabAtkins> fantasai: They have no spec for line unbreaking.

<TabAtkins> koji: We've tried to combine multiple proeprties, the WG rejected the idea, unicode started the spec for CSS. So I think I agree with Myles; we either convince Unicode, or stick with what we had before and not combine multiple proeprties.

<TabAtkins> astearns: What's the current state of the spec?

<TabAtkins> fantasai: There's a bunch of rules in the spec based around East-Asign Width property and General category.

<TabAtkins> fantasai: Started with EAW, made an exception for Hangul because it's wide, and I think that's what's implemented in Gecko right now.

<TabAtkins> fantasai: This issue was opened on "I want you to handle ambiguous characters better"

<TabAtkins> fantasai: So in response we added "if the ambiguous character is in a context we know is wide, like Chinese, treat it as wide; otherwise as narrow".

<TabAtkins> fantasai: Then Unicode redefined some characters that were previously narrow/ambiguous into wide, because of emoji.

<TabAtkins> fantasai: Then we reopened the issue to treat emojis as ambiguous.

<TabAtkins> florian: When we complained to Unicode about that change, they said this property is for terminal rendering, nobody should use it.

<TabAtkins> koji: I agree the emoji issue is bad.

<TabAtkins> koji: So my preference is from before, take behavior based on encoded block. That might be slightly less accurate than your current proposal, but as long as it's consistent across browsers authors will be happy enough.

<TabAtkins> fantasai: So instead of using EAW/General properties (or others), we should evaluate unicode blocks and declare how to treat each?

<TabAtkins> [discussion of how unicode blocks work]

<TabAtkins> fantasai: That would probably work.

<TabAtkins> astearns: That sounds great.

<TabAtkins> myles: So somebody in this group, not me, should come up with a list of blocks. If it's very large, we can revisit, but if it's small, then ok.

<TabAtkins> myles: My criteria here is maintainability.

<TabAtkins> fantasai: I can take that action.

<TabAtkins> myles: Ok, we can discuss it then.

<TabAtkins> jfkthame: For maintainability, you'd have to recheck each version, approximately yearly.

<TabAtkins> fantasai: Sure. General criteria is like "if it's more than 80% han characters, it's on the list", easy.

<TabAtkins> koji: As the editor of UAX 50, I'm doing that every year. We will assume VerticalOrientation to U for CJK characters, you can check that.

<dbaron> $ grep ';' Blocks.txt | grep -v "^#" | wc -l

<dbaron> 291

<TabAtkins> fantasai: The set of chars we want here are pretty much exactly Chinese and Japanese characters.

<TabAtkins> jfkthame: Base it on Script, then?

<TabAtkins> fantasai: We do that today, but we have to remove punctuation, etc, thus the current complexity.

<fantasai> s/remove/add/

<TabAtkins> astearns: So proposal is fantasai looks at the blocks, and comes back later.

<TabAtkins> myles: Parting word, text started elegant, got full of exceptions over time. If that happens again, we should just cut it off.

For what it's worth we've just implemented this as currently specified, and it makes a real mess of some tests, e.g. CSS2/generated-content/content-counter-004-ref.xht - spaces between U+25FE (black square) are removed, due to the EAW property being "W".

I think basing the decision on Unicode Block rather than EAW property is certainly the way to go. In an effort to roll this forward I've had a hunt through the Unicode blocks and come up with a list that could be used as a starting point:

CJK Radicals Supplement
Kangxi Radicals
Ideographic Description Characters
CJK Symbols and Punctuation
Hiragana
Katakana
Bopomofo
Kanbun
Bopomofo Extended
CJK Strokes
Katakana Phonetic Extensions
Enclosed CJK Letters and Months
CJK Compatibility
CJK Unified Ideographs Extension A
Yijing Hexagram Symbols
CJK Unified Ideographs
Yi Syllables
Yi Radicals
CJK Compatibility Ideographs
Vertical Forms
CJK Compatibility Forms
Small Form Variants
Halfwidth and Fullwidth Forms
Kana Supplement
Kana Extended-A
Small Kana Extension
Tai Xuan Jing Symbols
Counting Rod Numerals
Enclosed Ideographic Supplement
CJK Unified Ideographs Extension B
CJK Unified Ideographs Extension C
CJK Unified Ideographs Extension D
CJK Unified Ideographs Extension E
CJK Unified Ideographs Extension F
CJK Compatibility Ideographs Supplement

However, this process leads me to think that we're still going go have to distinguish based on Script as well - some of these blocks will be used with Hangul:

CJK Symbols and Punctuation
Enclosed CJK Letters and Months
Small Form Variants
Halfwidth and Fullwidth Forms
Vertical Forms

and some, eg:

Small Form Variants
Halfwidth and Fullwidth Forms
Yijing Hexagram Symbols
Tai Xuan Jing Symbols
Counting Rod Numerals

are likely to be used in any script.

Not doing any segment break transformation _unless_ the script (edit: writing-system) is Chinese, Japanese or Yi is going to limit the impact of any side effects.

We need to add CJK Unified Ideographs Extension G to that list, which is in the upcoming Unicode 13.0.

I'm also told that 99.9% of Korean text that you will find on the web uses Western (aka ASCII) punctuation, so perhaps not as much as an issue as first thought, at least in practical terms.

The "ghost of christmas past", who whispered in my ear about CJK Unified Ideographs Extension G, also suggests that anything Lisu and Khitan Small Script, the latter of which is new in Unicode Version 13.0, should be on the list. šŸ‘»

Thank you Mike for the investigation, this is really helpful. This code snippet in WebKit may help developing the list too.

Maybe we should agree on the expected accuracy first. My basic idea is:

  1. I do not expect the perfectly accuracy in all cases. It is a heuristic algorithm, which can't be perfect anyway. This is ok, as long as browsers are interoperable, it should not be too hard for authors to find the failures of the heuristic algorithm.
  2. I would like this algorithm to be simple and fast. This logic is a bit "hot"; runs on every source line for all pages including English pages, consumes battery and CPU cache spaces for all users.
  3. In my opinion, I think we should avoid excessive removal for non-CJK scripts. This means that when ambiguous, choose not to remove. Not ideal for CJK authors, and may not look consistent in some cases, but still a great improvement, and authors can learn as long as implementations are interoperable.
  4. The basic philosophy we applied for text-orientation is to use one Unicode property, or ask Unicode to add a new property. If you found other properties than the Unicode Block that can produce better results, I don't insist on the Unicode Block at all, but I prefer to stick on using one property.

Do these look reasonable? Any opinions, additions, or change suggestions?

The VerticalOrientation property could also be a good data to develop the list.

A wild idea came up, maybe the VerticalOrientation property is better than the Unicode Block for this purpose, rather than just using it as a reference to develop a list of blocks?

I presume the reason we're discussing heuristics at all, and not simply adding another value to text-space-collapse in css-text-4 which says "collapse segment breaks to nothing" is because (like https://github.com/w3c/csswg-drafts/issues/4576) this behaviour is supposed to be context-dependent - i.e. it depends on the characters on either side?

And the intention is, roughly, if the segment break is between two CJK ideographs, or between a CJK ideograph and punctuation, collapse it to nothing?

I ask because - if this behaviour must remain context-dependent - I'm wondering if it would be easier to add a value to text-space-collapse to turn this behaviour on, and instead list the contexts where it _wouldn't_ apply. So make the rule "if this flag is on, collapse all segment breaks _except_ those either side of characters of class (AL|HL|SA)".

Alternatively: it looks like Gecko is currently collapsing segment breaks between two ideographs, but not between ideographs and punctuation, and Blink/Webkit are doing neither. Perhaps segment breaks could always collapse between two ideographs (an easy and unambiguous test for UAX#14 class ID), but only collapse according to the more complex heuristics if the appropriate property was set? In other words, "always collapse segment breaks between ID characters. Only collapse other (some? all?) segment breaks if text-space-collapse: collapse-break is set"

(both of these are attempts to reduce both the processing cost of evaluating the heuristic, and the cost of getting the heuristics wrong).

I'm ok to add a new value, but what are the benefits of the new value? Is it to prevent regressing non-CJK content? How is it different from choosing conservative heuristics?

A wild idea came up, maybe the VerticalOrientation property is better...

I take this back. I remember VerticalOrientation is still too aggressive.

Is it to prevent regressing non-CJK content?

Yes, exactly.

How is it different from choosing conservative heuristics?

To my non-expert eyes, this particular heuristic appears to be quite hard to get right. That's purely based on the discussion in La CoruƱa, and re-reading all the comments on this issue (from the last four years!). So I figured it's worth exploring if there's a way to remove the heuristic, or at least drastically reduce its scope.

For what it's worth we've just implemented this as currently specified, and it makes a real mess of some tests, e.g. CSS2/generated-content/content-counter-004-ref.xht - spaces between U+25FE (black square) are removed, due to the EAW property being "W".

This is because Unicode changed the EAW of a lot of characters in an effectively random and backwards-incompatible way when it introduced Emoji. The results based on e.g. Unicode 6, when these rules were written, would have been quite sensible. :/ Trying to compensate for this change is one of the reasons the rules became too complicated...

I've committed an initial draft of the Unicode block-based approach. I think the interesting questions remaining are:

  • Bopomofo
  • Yijing Hexagram Symbols / Tai Xuan Jing Symbols / Counting Rod Numerals
  • Enclosed ideographics

I'm leaning towards yes on enclosed ideographics, no on the symbols, and I don't know enough about Bopomofo when it is used as a stand-alone script to say.

Lisu and Khitan both use spaces; they should not therefore discard them during collapsing. Small forms etc. are primarily used with Chinese and Japanese, not Korean, so I think it's reasonable to include them here. (Keep in mind also that both sides of the break need to belong to the set in order to discard, and Hangul is excluded.)

Lisu and Khitan both use spaces

Are we talking about one or both of these Khitan (presumably the former, as I don't think the later is in Unicode):

https://en.wikipedia.org/wiki/Khitan_small_script
https://en.wikipedia.org/wiki/Khitan_large_script

If yes, do they really use spaces? Where can I learn more about that?

If not, what are we talking about?

The WG resolution to switch to Unicode Blocks has been edited in. I opened up https://github.com/w3c/csswg-drafts/issues/4993 and https://github.com/w3c/csswg-drafts/issues/4993 as follow-up issues. Closing out discussion here, since we've veered pretty far off the original topic.

Was this page helpful?
0 / 5 - 0 ratings