Kotlin-native: String and UTF-8

Created on 22 Dec 2017 · 13Comments · Source: JetBrains/kotlin-native

Strings in Kotlin/Native use UTF-16 as internal encoding, same as Kotlin/JVM.
It was invented twenty years ago, and now almost everything else (except JS and Win32 API, invented same time) is in UTF-8.

How much will broke if K/N will use UTF-8 for String internally?

Source

msink

👍3

Most helpful comment

Swift switched to UTF 8 https://swift.org/blog/utf8-string/

RUSshy on 21 Mar 2019

👍3

All 13 comments

It used to, originally, but it's pretty hard to implement string indexing with variable width strings, so we switched to UTF-16. O(N) String.get() is too bad thing for that.

olonho on 22 Dec 2017

Ok, then maybe optionally, for some targets?
And in many cases smooth C interop is much more important than String.get() speed.
Especially if CharSequence.asIterable().forEach is not O(N).

msink on 22 Dec 2017

Personally, I think there's a lot to be said for the JEP 254 approach of sticking to an essentially UTF-16 encoding (for the reasons @olonho has outlined) but having a flag to indicate whether or not code-points greater than 255 are used in the string. If they are not (which is mostly the case), then the string is stored as 1 byte per character or 2 bytes per character otherwise.

This seems to me to be a reasonable compromise between staying with UTF-16 or moving to UTF-8.

Admittedly, you're still stuck with surrogate pairs for characters outside the basic multilingual plane. However, this isn't much of a problem in practice and could only really be solved by moving to UTF-32 and then having a flag for 1, 2 or 4 bytes of storage per character which probably wouldn't be worth the additional complication.

alanfo on 22 Dec 2017

Well, that might be enougth in most cases.
If ascii-only string literals become asciiz immutableBinaryBlobs placed in read-only section.

I still think that full switch to UTF-8 is better, but if it's impossible - there can be compromises.

msink on 22 Dec 2017

Wouldn't moving to UTF-16 break Linux compatibility? Linux uses UTF-8.

napperley on 22 Dec 2017

Whilst it's true that Linux APIs generally use UTF-8, Windows APIs use UTF-16.

So, unless K/N is going to use a different encoding for each platform, it still has to make a choice between the two.

Java, of course, uses UTF-16 for all supported platforms and probably also will. There might even be value in K/N sticking to UTF-16 for this reason alone in the interests of having no inconsistencies with Kotlin JVM in multi-platform projects.

alanfo on 23 Dec 2017

👍1

When interoping, strings are converted to UTF-8.

olonho on 23 Dec 2017

When interoping, strings are converted to UTF-8.

Yes - in allocated temporary buffer, VERY ineffective.
And it happens everytime, while accessing to string char by index is relatively rare.
Typically strings passed as-is.

msink on 23 Dec 2017

Marker bit defining if strings are 1 or 2 bytes wide may be sensible in longer term, complete switch to UTF-8 is unlikely.

olonho on 25 Dec 2017

Ok, then why marker bit, not two classes String16 and String8 and common interface String?

msink on 25 Dec 2017

👍2