_From @1ec5 on October 21, 2016 10:26_
As discussed in mapbox/mapbox-gl-native#6781, it can often be necessary for a filter to perform a case-insensitive or diacritic-insensitive comparison. Variations on the existing operators should be added to support these options.
Hereâs an example design (but by no means the best):
[
{
"operator": "in",
"case-sensitive": false,
"diacritic-sensitive": false,
},
"key",
"value 1",
"value 2"
]
Implementing case-insensitive comparisons should be trivial on all the platforms supported by Mapbox GL. On the other hand, while there are fine options for diacritic folding on the native platforms, JavaScript would have to rely on a library for diacritic-insensitive comparisons.
/cc @incanus @lucaswoj @jfirebaugh
_Copied from original issue: mapbox/mapbox-gl-style-spec#548_
TIL what diacritic means
_Everyone's a diacritic._
Over in https://github.com/mapbox/mapbox-gl-js/pull/4715#discussion_r117343669, weâre discussing a potential syntax for equality comparisons in expressions. Although this issue talks about filters, getting case and diacritic folding into expressions would address some of the main use cases for this functionality (for bilingual labeling in particular, but not for highlighting âfuzzy-matchedâ search results).
https://github.com/mapbox/mapbox-gl-js/pull/4715#discussion_r117343669 asks whether we should incorporate case and diacritic folding into the proposal as some sort of modifier on equality comparisons, or whether we should rely on the style author to run each operand through functions like lowercase or strip-diacritics beforehand.
Is a case- and diacritic-folded equality comparison equivalent to an equality comparison between two normalized strings, or are there aspects to folding that require both original strings as context? To put it another way, are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?
/cc @anandthakker @kkaefer @apendleton @jcsg
are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through
lowercaseandstrip-diacritics?
In the absence of a fix for mapbox/mapbox-gl-js#3999, there are some problem areas caused by a lack of locale information. Should strip-diacritics add a tittle to ı (found in Turkish) in order for strip-diacritics("KırĆehir") == "Kirsehir"? Should lowercase("GROSSER STERN") == "sroĂer stern"? The equality operator itself could be tolerant of differences in case or diacritics in these cases.
There is a human- and machine-readable file published by Unicode here that details all the special casing operations that we could catch. They explicitly use two-letter language codes for some of these rules, so locale specific information that uses those codes could transparently connect to the Unicode standard.
The rules listed there cover the cases that you discussed in the previous comment, and list special cases for ligatures (Latin and Armenian), for where there is no uppercase precomposed character (all Greek and Latin), conditional mappings that depend on position within a word (all Greek), and a category of "Language-Sensitive Mappings" which contain rules for characters in Lithuanian, Turkish, and Azeri (Azerbaijani).
The comprehensive list of all other "normal" casing conventions are in:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
with a guide to understanding that file format here:
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html
The standard UnicodeData.txt file also gives information on the character decomposition mapping. I need to do more research on these decomposition mappings, but a good starting point may be to try to associate any Latin-ish precomposed character that we'd want to de-diacritize with only the strict ASCII letters that we find in the character's decomposition mappings. (As far as I could tell, it wasn't clear that unidecode actually used Unicode standards for reference, although it also does de-diacritization.)
If we implement our own lowercasing and de-diacritization functions, I think we should do so according to the current Unicode standards themselves unless something prevents that.
cc: @boblannon
on geocoding, we've done away with unidecode altogether. we currently have a dedicated script, using manually curated rules written mostly by @apendleton: https://github.com/mapbox/carmen/blob/master/lib/util/remove-diacritics.js
also worth knowing, though: Javascript has character decomposition built in: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
....so you could have
// Compatibly-decomposed (NFKD)
// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize('NFKD'); // '\u0073\u0323\u0307'
and then a separate step that stripped out the combining dot characters.
Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs. It also supports diacritic folding ("Unaccenting"): https://bitbucket.org/alekseyt/nunicode#markdown-header-unaccenting as well as case folding.
The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language weâre dealing with. Assuming we donât, would we get more reliable results by making the equality operator optionally aware of diacritic folding than by composing equality with diacritic-stripping? Perhaps that would allow us multiple levels of canonicalization. For example:
["==", "Ăingvellir", "Thingvellir", "diacritic-insensitive"]
["==", "PhÆĄ 54", "Phá» 54", "diacritic-insensitive"]
Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs.
Nunicode is similar in concept to unidecode: it attempts to strip diacritics without accounting for language-specific rules.
Do we have a nunicode analogue in JavaScript? Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?
Could we use
Intl.Collatorto implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?
Case and diacritic folding can be implemented in JavaScript (for GL JS) like this:
Intl.Collator({ sensitivity: 'base', usage: 'search' }).compare('Ă€', 'a') === 0
The same functionality can be implemented in Objective-C (for iOS and macOS) like this in string_nsstring.mm:
[@"Ă€" compare:@"a" options:NSCaseInsensitiveSearch | NSDiacriticInsensitiveSearch range:NSMakeRange(0, @"Ă€".length) locale:nil] == NSOrderedSame;
and in Java (for Android) like this:
Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
collator.equals("Ă€", "a");
The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language weâre dealing with.
This is still a valid concern, e.g. in German, diacritic folding for umlauts typically means adding an e after the vowel (MĂŒnchen -> Muenchen), while it doesn't in Swedish (Malmö -> Malmo).
Agreed. The assumption in https://github.com/mapbox/mapbox-gl-js/issues/4136#issuecomment-339522493 is that the strings in question are in the same language as the browser/system. But thatâs a difficult assumption to make across a world map.
Supposing we implement one of the proposals in https://github.com/mapbox/mapbox-gl-js/issues/3999#issuecomment-273332504 for telling GL a given source propertyâs language, I think it would still be desirable to focus on enhancing the equality operator rather than composing it with a diacritic-stripping operator. (In the code examples above, all three APIs accept a language identifier.) After all, the vector tiles arenât the only places that an expression might get its strings from; strings could also be embedded literally or come from elsewhere in runtime styling code.
I'm currently evaluating an implementation of diacritic-insensitive equality (but not diacritic stripping, or even diacritic-insensitive contains/begins-with/etc. -- will those be important?) that takes a locale as an argument (could default to current locale).
Before committing to a JS implementation, I did a survey of how we could implement this on all our supported platforms -- we want to avoid as much as possible any subtle differences that could cause map rendering to change from one platform to the next, and we also want to avoid expensive bundling of collation rules:
Intl.Collator. We don't control the underlying implementation here, but it looks to me on quick inspection like both Chromium and Firefox are using ICU in their underlying implementations.NSDiacriticInsensitiveSearch NSStringCompareOption..java.text.Collator with SECONDARY strength. I believe the java.text implementation is meant to use the same logic as ICU4J, which _should_ stay in sync with ICU4C.QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.ICU::Collator with SECONDARY strength. The reason for not using this as the default across gl-native is that it requires bundling collation data with the app.Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.
Unfortunately, Qt does not provide a mechanism for ignoring diacritics. Qt uses ICU internally e.g. for QCollator but does not expose it as part of its public APIs. Some workarounds are explored in https://stackoverflow.com/questions/14009522/how-to-remove-accents-diacritic-marks-from-a-string-in-qt but there is no silver bullet. One workaround is to add a custom lookup table for searching/replacing characters with their canonical/compatible versions.
I keep coming around to the idea that what we need most is locale-aware comparisons, almost orthogonal to a way to explicitly ignore diacritics when comparing. The latter would effectively be a generic diacritic-stripping operation that gives you a Boolean instead of the stripped string, and it would suffer from the same problems:
D+-The upside is that you could use a diacritic-insensitive comparison in a language-agnostic stylesheet, but in some sense a language is always involved when comparing strings, even if itâs just the default C locale.
I keep coming around to the idea that what we need most is locale-aware comparisons
That is the direction we're heading, see https://github.com/mapbox/mapbox-gl-js/pull/6270#issuecomment-375817855 in the GL JS "Collator" PR.
Closing this as fixed with #6270, which adds support for diacritic-insensitve _comparisons_. We don't have support for a diacritic "stripping" transliterator, which is harder to define and hopefully not necessary for the most important use cases.
Most helpful comment
TIL what diacritic means