Rust: Locale/language-sensitive case mapping ("Whats wrong with Turkey?")

Created on 4 Jun 2020 · 7Comments · Source: rust-lang/rust

Whats wrong with turkey ?
Please test the code below after reading the above blog.

assert_eq!("istanbul".to_uppercase(), "ISTANBUL");

Is it OK ? It will work fine, right ?
Don't be so sure!
If you care a whit about localization or internationalization, force your code to run under the Turkish locale as soon as reasonably possible.

Please add support localized case folding to Rust like other programming languages below:

JS case folding
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleLowerCase
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleUpperCase

GO case folding
https://golang.org/pkg/strings/#ToLowerSpecial
https://golang.org/pkg/strings/#ToUpperSpecial

JAVA localized case folding
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#toLowerCase(java.util.Locale)
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#toUpperCase(java.util.Locale)

.NET localized case folding
https://docs.microsoft.com/en-US/dotnet/api/system.string.tolower
https://docs.microsoft.com/en-US/dotnet/api/system.string.toupper

When it is supported, the code below should works like a charm !

assert_eq!("istanbul".to_uppercase("tr-TR"), "İSTANBUL");

or with a new function

assert_eq!("istanbul".to_localeuppercase("tr-TR"), "İSTANBUL");

C-feature-request T-libs

Source

OzqurYalcin

Most helpful comment

@hbina It wouldn't suffice to just change those tables (In fact they are implemented correctly for non-conficting cases ı->I and İ->i). There needs to be specific lowercase/uppercase functions for TR locale (as @iago-lito suggested) since in Turkish "i" maps to uppercase "İ" and "I" maps to lowercase "ı" which conflicts with Latin versions (I->i and i->I). In my humble opinion, it would be better if it was part of the language rather than some crate; after all even emojis seem to get first-class support.

canbakiskan on 6 Jul 2020

👍4

All 7 comments

Would this test be sufficient to indicate that this conversion is impelemented correctly? https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=63caf15374a3c580bbe9c5260522d102

Edit: I assume what needs to be changed is just the tables here? https://github.com/rust-lang/rust/blob/f3fadf6abd571868d70538561a0731ddd800003a/src/libcore/unicode/unicode_data.rs#L609

Edit 2: Hmmm according to the unicode FAQ here, it seems like they suggest _against_ adding this support claiming that it will confuse the end user.
https://unicode.org/faq/casemap_charprop.html

hbina on 4 Jun 2020

I am the end user. I'm not confused. Unlike; I am confused when I think why it is not so. How can a feature that exists in hundreds of programming languages can confuse people :)

Think about it, you are writing a titlecase function for a string including latin letters like turkish language, but it never works as it should. Don't you get confused?

OzqurYalcin on 4 Jun 2020

Would a new method .to_locale_uppercase(locale) be both compliant with Unicode recommendation (because it leaves .to_uppercase() unchanged) and useful to explicitely support these specific capitalization cases?

Under this view, .to_uppercase() might need to explicitely become .to_ascii_uppercase().. which ends up not much different from .to_locale_uppercase("ascii") or something in the end :\

iago-lito on 4 Jun 2020

I don't know which function would be more appropriate but I can suggest a test like this when it is supported.

#[test]
fn test_turkish_case() {
    let lowercase = "abcçdefgğhıijklmnoöprsştuüvyz".chars();
    let mut uppercase = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ".chars();
    for lower in lowercase {
        let upper = uppercase.next().unwrap();
        assert_eq!(lower.to_string().to_locale_uppercase("tr-TR"), upper.to_string());
    }
}

OzqurYalcin on 4 Jun 2020

👍2

Does this feature already exist in a crate in the ecosystem, and what shape of the API have they settled on? I can't see it in unic, but it might exist.