Whats wrong with turkey ?
Please test the code below after reading the above blog.
assert_eq!("istanbul".to_uppercase(), "ISTANBUL");
Is it OK ? It will work fine, right ?
Don't be so sure!
If you care a whit about localization or internationalization, force your code to run under the Turkish locale as soon as reasonably possible.
Please add support localized case folding to Rust like other programming languages below:
JS case folding
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleLowerCase
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleUpperCase
GO case folding
https://golang.org/pkg/strings/#ToLowerSpecial
https://golang.org/pkg/strings/#ToUpperSpecial
JAVA localized case folding
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#toLowerCase(java.util.Locale)
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#toUpperCase(java.util.Locale)
.NET localized case folding
https://docs.microsoft.com/en-US/dotnet/api/system.string.tolower
https://docs.microsoft.com/en-US/dotnet/api/system.string.toupper
When it is supported, the code below should works like a charm !
assert_eq!("istanbul".to_uppercase("tr-TR"), "陌STANBUL");
or with a new function
assert_eq!("istanbul".to_localeuppercase("tr-TR"), "陌STANBUL");
Would this test be sufficient to indicate that this conversion is impelemented correctly? https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=63caf15374a3c580bbe9c5260522d102
Edit: I assume what needs to be changed is just the tables here? https://github.com/rust-lang/rust/blob/f3fadf6abd571868d70538561a0731ddd800003a/src/libcore/unicode/unicode_data.rs#L609
Edit 2: Hmmm according to the unicode FAQ here, it seems like they suggest _against_ adding this support claiming that it will confuse the end user.
https://unicode.org/faq/casemap_charprop.html
I am the end user. I'm not confused. Unlike; I am confused when I think why it is not so. How can a feature that exists in hundreds of programming languages can confuse people :)
Think about it, you are writing a titlecase function for a string including latin letters like turkish language, but it never works as it should. Don't you get confused?
Would a new method .to_locale_uppercase(locale)
be both compliant with Unicode recommendation (because it leaves .to_uppercase()
unchanged) and useful to explicitely support these specific capitalization cases?
Under this view, .to_uppercase()
might need to explicitely become .to_ascii_uppercase()
.. which ends up not much different from .to_locale_uppercase("ascii")
or something in the end :\
I don't know which function would be more appropriate but I can suggest a test like this when it is supported.
#[test]
fn test_turkish_case() {
let lowercase = "abc莽defg臒h谋ijklmno枚prs艧tu眉vyz".chars();
let mut uppercase = "ABC脟DEFG臑HI陌JKLMNO脰PRS艦TU脺VYZ".chars();
for lower in lowercase {
let upper = uppercase.next().unwrap();
assert_eq!(lower.to_string().to_locale_uppercase("tr-TR"), upper.to_string());
}
}
Does this feature already exist in a crate in the ecosystem, and what shape of the API have they settled on? I can't see it in unic, but it might exist.
https://lib.rs/internationalization
I have reviewed almost all of the libraries in this category. There is no library including case mapping feature for Turkish or similar languages.
@hbina It wouldn't suffice to just change those tables (In fact they are implemented correctly for non-conficting cases 谋->I and 陌->i). There needs to be specific lowercase/uppercase functions for TR locale (as @iago-lito suggested) since in Turkish "i" maps to uppercase "陌" and "I" maps to lowercase "谋" which conflicts with Latin versions (I->i and i->I). In my humble opinion, it would be better if it was part of the language rather than some crate; after all even emojis seem to get first-class support.
Most helpful comment
@hbina It wouldn't suffice to just change those tables (In fact they are implemented correctly for non-conficting cases 谋->I and 陌->i). There needs to be specific lowercase/uppercase functions for TR locale (as @iago-lito suggested) since in Turkish "i" maps to uppercase "陌" and "I" maps to lowercase "谋" which conflicts with Latin versions (I->i and i->I). In my humble opinion, it would be better if it was part of the language rather than some crate; after all even emojis seem to get first-class support.