Strings in Kotlin/Native use UTF-16 as internal encoding, same as Kotlin/JVM.
It was invented twenty years ago, and now almost everything else (except JS and Win32 API, invented same time) is in UTF-8.
How much will broke if K/N will use UTF-8 for String internally?
It used to, originally, but it's pretty hard to implement string indexing with variable width strings, so we switched to UTF-16. O(N) String.get() is too bad thing for that.
Ok, then maybe optionally, for some targets?
And in many cases smooth C interop is much more important than String.get() speed.
Especially if CharSequence.asIterable().forEach is not O(N).
Personally, I think there's a lot to be said for the JEP 254 approach of sticking to an essentially UTF-16 encoding (for the reasons @olonho has outlined) but having a flag to indicate whether or not code-points greater than 255 are used in the string. If they are not (which is mostly the case), then the string is stored as 1 byte per character or 2 bytes per character otherwise.
This seems to me to be a reasonable compromise between staying with UTF-16 or moving to UTF-8.
Admittedly, you're still stuck with surrogate pairs for characters outside the basic multilingual plane. However, this isn't much of a problem in practice and could only really be solved by moving to UTF-32 and then having a flag for 1, 2 or 4 bytes of storage per character which probably wouldn't be worth the additional complication.
Well, that might be enougth in most cases.
If ascii-only string literals become asciiz immutableBinaryBlobs placed in read-only section.
I still think that full switch to UTF-8 is better, but if it's impossible - there can be compromises.
Wouldn't moving to UTF-16 break Linux compatibility? Linux uses UTF-8.
Whilst it's true that Linux APIs generally use UTF-8, Windows APIs use UTF-16.
So, unless K/N is going to use a different encoding for each platform, it still has to make a choice between the two.
Java, of course, uses UTF-16 for all supported platforms and probably also will. There might even be value in K/N sticking to UTF-16 for this reason alone in the interests of having no inconsistencies with Kotlin JVM in multi-platform projects.
When interoping, strings are converted to UTF-8.
When interoping, strings are converted to UTF-8.
Yes - in allocated temporary buffer, VERY ineffective.
And it happens everytime, while accessing to string char by index is relatively rare.
Typically strings passed as-is.
Marker bit defining if strings are 1 or 2 bytes wide may be sensible in longer term, complete switch to UTF-8 is unlikely.
Ok, then why marker bit, not two classes String16 and String8 and common interface String?
pretty hard to implement string indexing with variable width strings
...variable char width?
so we switched to UTF-16. O(N) String.get() is too bad thing for that
UTF-16 has two-word chars (surrogate pairs), too
Swift switched to UTF 8 https://swift.org/blog/utf8-string/
We don't have plans to switch strings to UTF-8, because this would harm the compatibility between different Kotlin backends.
See also the discussion here: https://www.reddit.com/r/Kotlin/comments/ji9z19/kotlin_team_ama_2_ask_us_anything/ga5k33c?utm_source=share&utm_medium=web2x&context=3
Most helpful comment
Swift switched to UTF 8 https://swift.org/blog/utf8-string/