Csswg-drafts: [css‑fonts‑4] Add `emoji` as a keyword to `unicode‑range`

Created on 8 Dec 2019  Â·  12Comments  Â·  Source: w3c/csswg-drafts

https://drafts.csswg.org/css-fonts/#unicode-range-desc

Inspired by @Crissov’s comment in https://github.com/w3c/csswg-drafts/issues/2855#issuecomment-562926062:

emoji world indeed make more sense in a font descriptor than as a font family.

@font-face {
  font-family: Twemoji;
  unicode-range: emoji;
} 

emoji would be a new <urange> keyword equivalent to enumerating all the Unicode codepoints where emoji reside.

Needs Data Needs Design / Proposal css-fonts-4 i18n-tracker

All 12 comments

Alternatively, a list of ISO 15924 script codes could be allowed for the unicode-range font descriptor. The standard provides the special codes Zsye and 993 for emojis. Thatʼs more versatile and less spec maintenance work than keeping a custom list of keywords in CSS Fonts, but it is also less readable.

PS: #1744 was somewhat similar, requesting a language or lang descriptor, but scripts make more sense.

Adding script names or language names is a recurring request, and we do need to address it at some point. I agree with @Crissov that scripts make more sense than languages.

For example, this is both cumbersome and fragile against future additions:

@font-face {
                font-family: 'Headings';
                src: url(fonts/Japanese.woff);
                unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;
                /* yen, kanji, hiragana, katakana */
            }

I agree with @Crissov that using an existing list, provided it is well maintained, well documented and readily available, is much better than getting into the business of script or language registries.

It isn't clear to me that ISO 15924:2004 defines which characters are included in each script. I wasn't keen to spend the CHF 68 to find out. Anyone know? Or is that all contained in the registry?

Unicode® Standard Annex #24 Unicode Script Property is online, and freely available, and appears to be a superset of ISO 15924.

I'm happy that the registry is online and that Unicode is the registration authority. That at least means that ISO and Unicode are striving to be in alignment here (with a few exceptions, like Fractur and Gaelige being distinct in ISO 15924 and unified in Unicode UAX 24.

I plan to reach out to the maintainers of the registry to confirm the exact status.

Hmm. From the registry

Hira;410;Hiragana;hiragana;Hiragana;1.1;2004-05-01

That says that Hiragana exists, but not which code points are covered.

The complete list is in the Scripts file. For example

# ================================================

3041..3096    ; Hiragana # Lo  [86] HIRAGANA LETTER SMALL A..HIRAGANA LETTER SMALL KE
309D..309E    ; Hiragana # Lm   [2] HIRAGANA ITERATION MARK..HIRAGANA VOICED ITERATION MARK
309F          ; Hiragana # Lo       HIRAGANA DIGRAPH YORI
1B001..1B11E  ; Hiragana # Lo [286] HIRAGANA LETTER ARCHAIC YE..HENTAIGANA LETTER N-MU-MO-2
1B150..1B152  ; Hiragana # Lo   [3] HIRAGANA LETTER SMALL WI..HIRAGANA LETTER SMALL WO
1F200         ; Hiragana # So       SQUARE HIRAGANA HOKA

# Total code points: 379

# ================================================

Using the Unicode Script property as shorthand for a ranges is a really, really good idea. More intuitive, less error prone, less verbose, and it's a public list that's already baked into CSS implementations - the Unicode Script property is already referenced by css-text-3. Thumbs up to this whole issue.

One issue with the Unicode Script property is the characters that have Script=Inherited (generally diacritics) or Script=Common (mostly punctuation)... authors might be surprised at things that don't get included by a naĂŻve Script code because they're actually shared by a couple of scripts and so ended up being assigned Script=Common instead of the "expected" script.

As a trivial example: Script=Devanagari would (perhaps unexpectedly) exclude the punctuation marks DEVANAGARI DANDA and DEVANAGARI DOUBLE DANDA, despite their apparently script-specific names, because Scripts.txt has

0964..0965    ; Common # Po   [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA

So perhaps ranges should also take account of whatever appears in the Unicode ScriptExtensions list, which would handle this:

0964          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DANDA
0965          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Limb Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DOUBLE DANDA

This would be more useful than using just the simple Script property, IMO.

unicode-range: emoji; is probably not what you want, because modern emoji are combining strings that include code points which aren't actually emoji characters (like ZWJ)

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

  • RESOLVED: we are going to create keywords for unicode ranges

The full IRC log of that discussion
<stantonm> topic: Add ISO 15924 script codes to unicode-range

<astearns> github: https://github.com/w3c/csswg-drafts/issues/4573

<stantonm> myles: unicode-range takes bunch of code-points

<dbaron> the addition of those two agenda items was https://wiki.csswg.org/planning/galicia-2020?do=diff&rev2%5B0%5D=1569210305&rev2%5B1%5D=1570141384&difftype=sidebyside

<stantonm> ... bad for a couple reasons, lots of numbers and not clear what they mean

<stantonm> ... also when adding some like emoji, you can list all unicode points - but it changes over time

<stantonm> ... proposal to add keyword that lets the browsers define the code points

<stantonm> florian: what are the keywords

<stantonm> myles: issue says use pull keywords from ISO

<stantonm> hober: we shouldn't define these things, reference something in unicode

<stantonm> myles: different languages use some common code points

<stantonm> ... keywords shouldn't be a partition, there will be overlaps

<stantonm> ... space character will be in most of them

<stantonm> fantasai: two factors, script extensions list - some of these are assigned to common script

<stantonm> ... we should be looking up script extensions

<stantonm> ... other case is super common things - numbers, space, etc

<stantonm> ... alot of things assigned to common script

<stantonm> ... probably makes sense to include common by default, but have opt out

<stantonm> myles: we should resolve that we would like keywords, but not resolve on the actual keywords

<stantonm> fantasai: we should rely on iso

<stantonm> faceless2: rely on existing registry

<stantonm> astearns: should we have everything in the registry

<stantonm> heycam: do the names in the registry match normal css conventions?

<stantonm> TabAtkins: looks like no?

<stantonm> fantasai: should be a list of keywords 4 chars long

<faceless2> https://www.unicode.org/Public/12.1.0/ucd/Scripts.txt

<astearns> Zsye 993: Emoji

<stantonm> TabAtkins: if we're confident they are 4 letters, we can take directly

<stantonm> fantasai: think that should be fine, they need to maintain compat

<faceless2> example values : "Hebrew", "Devanagari", "Common"

<stantonm> myles: we may get it wrong, can we tentatively resolve to try something out first

<stantonm> florian: go with 4 letter name of long name? or not deciding

<stantonm> faceless2: where did four letter name come from?

<stantonm> florian: long name has hyphens, 4 letter is defined somewhere else

<stantonm> TabAtkins: casing shouldn't be important

<dbaron> The 4 letter script codes are always letters and come from ISO15924: https://tools.ietf.org/html/rfc5646#section-2.2.3

<stantonm> astearns: leave it to the fonts editors to define what keywords we pull, don't need to resolve on that now

<stantonm> myles: I'll also contact unicode

<stantonm> jfkthame: should there also be exclusion values?

<stantonm> hober: if you could exclude a range, you could exclude common range

<stantonm> myles: be careful we don't turn this into a full language

<stantonm> chris: even if you do a good job, when unicode adds new values you may unintentionally exclude things

<stantonm> ... shift burden of defining onto external body

<dbaron> also see https://unicode.org/iso15924/iso15924-codes.html

<stantonm> RESOLVED: we are going to create keywords for unicode ranges

<dbaron> "Zsye" is for Emoji, I think :-/

<dbaron> I think that's a little unfortunate.

/cc @markusicu

Hi, I got cc'ed here...

As I think you found, ISO 15924 does not define which characters have which script. Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec).

For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties
(Unicode 13 will hoist all of these into the UCD proper.)

Elsewhere in UTS 51 you can also find regexes for well-formed emoji sequences.

ICU has API to get the emoji character properties (per code point, or as a UnicodeSet).

FYI I work on Unicode/CLDR/ICU and am the current 15924 registrar.

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

  • RESOLVED: we are going to create keywords for unicode ranges

If this is a about ranges, it may make sense to consider blocks instead (or in addition to) scripts. Blocks don't have an ISO standard, they are directly defined by Unicode. There are some overlaps between script name and block names; some regexp engines use e.g. 'hiragana' for the hiragana script, and 'in_hiragana' for the hiragana block. In many cases, there is more than one block for a script. Block data is available in the Blocks file. There are e.g. 8 blocks with the term 'Latin' in their name. There are also cases where characters are not in a block that carries the name of their script. For example, the three blocks
1B000..1B0FF; Kana Supplement
1B100..1B12F; Kana Extended-A
1B130..1B16F; Small Kana Extension
may contain both katakana and hiragana (and other related characters).

Unicode blocks are usually not very useful. They are an artifact of the character assignment process and history and are not designed to fit any other purpose. Multiple blocks for one script is one problem (and growing). Blocks also include unassigned code points, and sometimes unrelated characters. That's why the Script and Script_Extensions properties are generally recommended and used.

There are also cases where characters are not in a block that carries the name of their script.

FYI Outside of the ISO script code, "Kana" refers to both Hiragana and Katakana. https://en.wikipedia.org/wiki/Kana

Was this page helpful?
0 / 5 - 0 ratings

Related issues

svgeesus picture svgeesus  Â·  3Comments

rachelandrew picture rachelandrew  Â·  3Comments

gsnedders picture gsnedders  Â·  3Comments

AmeliaBR picture AmeliaBR  Â·  3Comments

litherum picture litherum  Â·  3Comments