Mapbox-gl-js: Case and diacritic folding for expression operators

Created on 2 Feb 2017 · 17Comments · Source: mapbox/mapbox-gl-js

_From @1ec5 on October 21, 2016 10:26_

As discussed in mapbox/mapbox-gl-native#6781, it can often be necessary for a filter to perform a case-insensitive or diacritic-insensitive comparison. Variations on the existing operators should be added to support these options.

Here’s an example design (but by no means the best):

[
  {
    "operator": "in",
    "case-sensitive": false,
    "diacritic-sensitive": false,
  },
  "key",
  "value 1",
  "value 2"
]

Implementing case-insensitive comparisons should be trivial on all the platforms supported by Mapbox GL. On the other hand, while there are fine options for diacritic folding on the native platforms, JavaScript would have to rely on a library for diacritic-insensitive comparisons.

/cc @incanus @lucaswoj @jfirebaugh

_Copied from original issue: mapbox/mapbox-gl-style-spec#548_

cross-platform feature

Source

lucaswoj

Most helpful comment

TIL what diacritic means

mollymerp on 8 Feb 2017

👍3

All 17 comments

TIL what diacritic means

mollymerp on 8 Feb 2017

👍3

_Everyone's a diacritic._

incanus on 8 Feb 2017

😄1

Over in https://github.com/mapbox/mapbox-gl-js/pull/4715#discussion_r117343669, we’re discussing a potential syntax for equality comparisons in expressions. Although this issue talks about filters, getting case and diacritic folding into expressions would address some of the main use cases for this functionality (for bilingual labeling in particular, but not for highlighting “fuzzy-matched” search results).

https://github.com/mapbox/mapbox-gl-js/pull/4715#discussion_r117343669 asks whether we should incorporate case and diacritic folding into the proposal as some sort of modifier on equality comparisons, or whether we should rely on the style author to run each operand through functions like lowercase or strip-diacritics beforehand.

Is a case- and diacritic-folded equality comparison equivalent to an equality comparison between two normalized strings, or are there aspects to folding that require both original strings as context? To put it another way, are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

/cc @anandthakker @kkaefer @apendleton @jcsg

1ec5 on 18 May 2017

are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

In the absence of a fix for mapbox/mapbox-gl-js#3999, there are some problem areas caused by a lack of locale information. Should strip-diacritics add a tittle to ı (found in Turkish) in order for strip-diacritics("Kırşehir") == "Kirsehir"? Should lowercase("GROSSER STERN") == "sroßer stern"? The equality operator itself could be tolerant of differences in case or diacritics in these cases.

1ec5 on 18 May 2017

There is a human- and machine-readable file published by Unicode here that details all the special casing operations that we could catch. They explicitly use two-letter language codes for some of these rules, so locale specific information that uses those codes could transparently connect to the Unicode standard.

The rules listed there cover the cases that you discussed in the previous comment, and list special cases for ligatures (Latin and Armenian), for where there is no uppercase precomposed character (all Greek and Latin), conditional mappings that depend on position within a word (all Greek), and a category of "Language-Sensitive Mappings" which contain rules for characters in Lithuanian, Turkish, and Azeri (Azerbaijani).

The comprehensive list of all other "normal" casing conventions are in:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
with a guide to understanding that file format here:
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

The standard UnicodeData.txt file also gives information on the character decomposition mapping. I need to do more research on these decomposition mappings, but a good starting point may be to try to associate any Latin-ish precomposed character that we'd want to de-diacritize with only the strict ASCII letters that we find in the character's decomposition mappings. (As far as I could tell, it wasn't clear that unidecode actually used Unicode standards for reference, although it also does de-diacritization.)

If we implement our own lowercasing and de-diacritization functions, I think we should do so according to the current Unicode standards themselves unless something prevents that.

cc: @boblannon

jcsg on 19 May 2017

on geocoding, we've done away with unidecode altogether. we currently have a dedicated script, using manually curated rules written mostly by @apendleton: https://github.com/mapbox/carmen/blob/master/lib/util/remove-diacritics.js

also worth knowing, though: Javascript has character decomposition built in: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

boblannon on 19 May 2017

....so you could have

// Compatibly-decomposed (NFKD)

// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize('NFKD'); // '\u0073\u0323\u0307'

and then a separate step that stripped out the combining dot characters.

boblannon on 19 May 2017

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs. It also supports diacritic folding ("Unaccenting"): https://bitbucket.org/alekseyt/nunicode#markdown-header-unaccenting as well as case folding.

kkaefer on 19 May 2017

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with. Assuming we don’t, would we get more reliable results by making the equality operator optionally aware of diacritic folding than by composing equality with diacritic-stripping? Perhaps that would allow us multiple levels of canonicalization. For example:

["==", "Þingvellir", "Thingvellir", "diacritic-insensitive"]
["==", "Phơ 54", "Phở 54", "diacritic-insensitive"]

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs.

Nunicode is similar in concept to unidecode: it attempts to strip diacritics without accounting for language-specific rules.

Do we have a nunicode analogue in JavaScript? Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

1ec5 on 19 May 2017

Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

Case and diacritic folding can be implemented in JavaScript (for GL JS) like this:

Intl.Collator({ sensitivity: 'base', usage: 'search' }).compare('ä', 'a') === 0

The same functionality can be implemented in Objective-C (for iOS and macOS) like this in string_nsstring.mm:

[@"ä" compare:@"a" options:NSCaseInsensitiveSearch | NSDiacriticInsensitiveSearch range:NSMakeRange(0, @"ä".length) locale:nil] == NSOrderedSame;

and in Java (for Android) like this:

Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
collator.equals("ä", "a");

1ec5 on 26 Oct 2017

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with.

This is still a valid concern, e.g. in German, diacritic folding for umlauts typically means adding an e after the vowel (München -> Muenchen), while it doesn't in Swedish (Malmö -> Malmo).

kkaefer on 9 Nov 2017

👍1

Agreed. The assumption in https://github.com/mapbox/mapbox-gl-js/issues/4136#issuecomment-339522493 is that the strings in question are in the same language as the browser/system. But that’s a difficult assumption to make across a world map.

Supposing we implement one of the proposals in https://github.com/mapbox/mapbox-gl-js/issues/3999#issuecomment-273332504 for telling GL a given source property’s language, I think it would still be desirable to focus on enhancing the equality operator rather than composing it with a diacritic-stripping operator. (In the code examples above, all three APIs accept a language identifier.) After all, the vector tiles aren’t the only places that an expression might get its strings from; strings could also be embedded literally or come from elsewhere in runtime styling code.

1ec5 on 10 Nov 2017

I'm currently evaluating an implementation of diacritic-insensitive equality (but not diacritic stripping, or even diacritic-insensitive contains/begins-with/etc. -- will those be important?) that takes a locale as an argument (could default to current locale).

Before committing to a JS implementation, I did a survey of how we could implement this on all our supported platforms -- we want to avoid as much as possible any subtle differences that could cause map rendering to change from one platform to the next, and we also want to avoid expensive bundling of collation rules:

JS: Intl.Collator. We don't control the underlying implementation here, but it looks to me on quick inspection like both Chromium and Firefox are using ICU in their underlying implementations.
iOS/macOS: Diacritic-insensitive NSPredicate. I don't know for sure, but I suspect that under the hood the Apple implementation is based on ICU4C. Edit: @1ec5 already pointed out there's a simpler way to do this with NSDiacriticInsensitiveSearch NSStringCompareOption..
Android: java.text.Collator with SECONDARY strength. I believe the java.text implementation is meant to use the same logic as ICU4J, which _should_ stay in sync with ICU4C.
Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.
Other (linux/glfw/node/etc.): Statically linked ICU::Collator with SECONDARY strength. The reason for not using this as the default across gl-native is that it requires bundling collation data with the app.

ChrisLoer on 2 Mar 2018

👍1

Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.

Unfortunately, Qt does not provide a mechanism for ignoring diacritics. Qt uses ICU internally e.g. for QCollator but does not expose it as part of its public APIs. Some workarounds are explored in https://stackoverflow.com/questions/14009522/how-to-remove-accents-diacritic-marks-from-a-string-in-qt but there is no silver bullet. One workaround is to add a custom lookup table for searching/replacing characters with their canonical/compatible versions.

brunoabinader on 12 Mar 2018

I keep coming around to the idea that what we need most is locale-aware comparisons, almost orthogonal to a way to explicitly ignore diacritics when comparing. The latter would effectively be a generic diacritic-stripping operation that gives you a Boolean instead of the stripped string, and it would suffer from the same problems:

A lack of support on some platforms (Qt)
Incorrect behavior for languages with complex rules – for example, in Vietnamese, xoa ≠ xóa = xoá and Đ ≠ D+-
Greater verbosity compared to specifying a single collation locale

The upside is that you could use a diacritic-insensitive comparison in a language-agnostic stylesheet, but in some sense a language is always involved when comparing strings, even if it’s just the default C locale.

1ec5 on 21 Mar 2018

I keep coming around to the idea that what we need most is locale-aware comparisons

That is the direction we're heading, see https://github.com/mapbox/mapbox-gl-js/pull/6270#issuecomment-375817855 in the GL JS "Collator" PR.

ChrisLoer on 23 Mar 2018

Closing this as fixed with #6270, which adds support for diacritic-insensitve _comparisons_. We don't have support for a diacritic "stripping" transliterator, which is harder to define and hopefully not necessary for the most important use cases.

ChrisLoer on 16 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings