Docs: Add conceptual documentation for System.Text.Rune

Created on 14 Nov 2019  Â·  26Comments  Â·  Source: dotnet/docs

We should add conceptual documentation for System.Text.Rune (introduced in .NET Core 3) to complement the existing API docs. Such conceptual documentation would cover, briefly:

  • What is the impetus for this type? Where does it fit in the .NET ecosystem?
  • What problems does it set out to solve? What problems does it not solve?
  • Provide code samples demonstrating its usage in various scenarios.
  • Provide context for developers who are used to similar concepts in other languages like Go, Rust, and Swift.

I've posted a draft of such a document at https://gist.github.com/GrabYourPitchforks/b9dbd348b448c938497cff37a3526725. It needs to be edited for clarity and some of the contents likely need to be rearranged. But it should effectively cover all of the bullets I listed above.

This documentation should be probably be inline at https://docs.microsoft.com/en-us/dotnet/api/system.text.rune. If it's too long to include inline on that page, then the likely path forward is to inline some part of it in that page, then link to a separate page containing the full contents. For comparison, the page https://docs.microsoft.com/en-us/dotnet/api/system.string contains a very large amount of text inline, so it's probably not terrible if we do the same here.

Area - .NET Guide Area - API Reference P1 doc-idea

Most helpful comment

@Entomy Hence my tendency to drop the detail altogether. It could be there for the existing developers' transition document. Splitting it makes sense to me. You also have developers that know how the current situation is broken and want to know how the new type fits into the situation. Seeing the Remarks section for the char type it sounds like these and the new developers documentations could go into Remarks for Rune and the transition guide be indeed linked separately.

All 26 comments

It occurs to me that thinking of Rune as "sometimes a single char, sometimes a surrogate pair" might be a useful simplification. I don't quite want to say that in the document because it's perhaps a bit _too_ simplistic and loses a lot of the nuance. But maybe even as an oversimplification it makes this type more approachable by a wider audience?

@GrabYourPitchforks I'm really interested in this. I understand very basic concepts of unicode, mostly to avoid printing the wrong string now and again. I've wanted to understand how to use and interact with unicode and I guess the new rune stuff.

There was a lot of interest on this but I'm assigning this to @tdykstra. He'll start on this after he's done with the JSON work. @tdykstra, can you assign the appropriate milestone here?

I see a lot of these talk about what the thing is and why it was created, but not really driving home the point of why it needs to be used. I do a lot of text processing and have seen so many incorrect algorithms because of misunderstandings that I think I have a good example to consider adding.

Edit-distance algorithms like Hamming and Levenshtein are particularly problematic when implemented with char. Or at the very least are much easier to implement correctly with Rune.

Consider the way Hamming edit-distance works, where only substitutions are calculated. Ignoring grapheme clusters even exist for now, since that's another issue, a naive Hamming might look like this:

~~csharp
public static Int32 HammingDistance(String source, String other) {
if (source.Length != other.Length) {
throw new ArgumentException("Must be equal length");
} else {
Int32 d = 0;
for (Int32 i = 0; i < source.Length; i++) {
if (source[i] != other[i]) {
d++;
}
}
}
}
~
~

Now anyone even remotely familiar with the concept of edit-distance algorithms can tell you the distance between "F clef" and "𝄢 clef" would be 1, because 1 substitution occurs, but the aforementioned implementation will throw an exception, because 𝄢 has to be represented in the .NET string with 2 char.

Similarly, even though a Levenshtein or Damerau-Levenshtein implementation naively using char will work, the result will be wrong, reporting 2 edits, for one substitution and one insertion.

This still ignores grapheme cluster issues, which there's additional API's for dealing with, but it does make very clear why Rune should be used for analysis purposes.

@GrabYourPitchforks I don't like the idea of that simplification at all. For one thing, it's conceptually tied to UTF-16, despite the fact that the Rune type itself is not. I think it also encourages the mental model of splitting Unicode into BMP and non-BMP, where char is a BMP character, when ideally BMP characters should be thought of as Runes themselves first and foremost as well.

Oh, but I should add that I'm very much in favour of adding a part specifying that when you put it into a C# string, it can become either one or two chars. I think that's essential information. I just don't like the idea of saying it can _hold_ two chars. Conceptually, I like the idea of explaining a char as being a “Rune piece” instead of a Rune being a “char bundle”.

@Serentty "Rune holds the amount of Char necessary to represent a single UNICODE Scalar Value", maybe? From there the details can be elaborated.

~@Entomy Better to stick to established Unicode terminology, which would be a code point in this case. https://unicode.org/glossary/~

@milous I am

Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

UNICODE Glossary

Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).

MS Docs

@miloush
It actually is a Unicode scalar value and not a code point in this case. This is the official Unicode terminology. A scalar value is any code point which is not a surrogate, which is exactly what a Rune is.

@Entomy

I find this kind of explanation unsatisfactory for a few reasons.

  1. It's UTF-16 specific, whereas Rune isn't. It's the same type you would use when enumerating a Utf8String. Encouraging people to think of Runes as things constructed out of one or two UTF-16 code units is the wrong approach in my opinion.
  2. It encourages thinking of char as the fundamental character type, with Rune being built on top of it. I think going forward we should be doing the opposite: Rune should be thought of as the fundamental character type, with char being a way of representing a Rune in pieces.

@Serentty I actually completely forgot Utf8String is a thing. Good point. And yes, I agree about encouraging Rune being the default.

My bad - looks like I ironically misunderstood what Rune is, thanks for the clarification.

@Serentty from what I can gather from the proposals, it looks like Utf8String still intends to provide Char, so I think my description might still hold, since I didn't explicitly state "1 or 2". Although this entirely depends on how they implement that API.

Maybe "Rune consolidates multiple byte or char into a single UNICODE Scalar Value" fits better though, since this also introduces the intent of the type.

I think an explanation like this would be good:

“A Rune is C#'s character type, and holds a single 32-bit Unicode scalar value (approximately equivalent to a code point). It might contain a whole character, a combining diacritical mark, whitespace, or a control character. Inside a C# String, it is represented as either one or two 16-bit chars in sequence. Inside of a UTF-8 string (Utf8String), it is represented as a sequence of one to four bytes.

In most cases, users should use Rune directly for indexing and iterating instead of char, as it does not require the user to consider the size of each of Rune in memory.”

And then it could go on to show examples of how to iterate and index a string using Runes.

I do like the direction the explanation of @Serentty is going but I think there is a few confusing bits:

  • 32-bit is confusing since the scalars are actually not 32 bits. From the glossary it seems it's the 'code unit' that can be 32-bit
  • 'approximately equivalent' is very vague, you just expressed it succinctly as 'a code point which is not a surrogate', why is that worse?

"A Rune is a C#'s character type holding a single Unicode scalar value (i.e. a code point which is not a surrogate)."

  • The examples of what it might contain is a weird selection (and whitespace does not necessarily mean single character). Why not e.g. full-width character, variation selector or an emoji? Do we need to have the examples in the first place?

Hm... the part about being approximately equivalent to code point could be dropped. That's the part I was most unsure about in terms of help versus harm.

From the glossary it seems it's the 'code unit' that can be 32-bit

This is only true in UTF-32, an encoding that pretty much no one uses. In UTF-8 a code unit is 8 bits, and in UTF-16 it's 16 bits. A scalar value is in the range from zero to 10FFFF (except for the surrogate range), which, rounded up to the next power of two, requires a 32-bit integer to store. This is why Rune is a 32-bit type (although of course not all possible 32-bit numbers are allowed to be stored in it).

I'm happy with dropping the explanation but it would be helpful if the 'scalar value' could be linked to the glossary.

Okay, how about "A Rune is a C#'s 32-bit character type ..."?

By the way, we should probably drop the C#, is it not intended to work in VB.NET/F#/others?

I'm less happy with that phrasing because it implies that there is still such a thing as a “16-bit character”, which is also a complaint I have about the draft currently posted. So, I would prefer something along the lines of that it's .NET's character type, is 32-bit, and stores a Unicode scalar value.

A Rune is a C#'s 32-bit character type ...

This alone doesn't describe why you'd want to be using Rune, and is even problematic in that it makes it sound less efficient and wasteful. While it's true that it's 32-bits, a description needs to have why this is a necessary change. It's not just an added data type of a different size. It solves a problem. Leading with the problem it solves sets the tone better.

@Entomy The point I was trying to make is that the 32-bit implementation detail is a property of the type, not of the scalar value.

@miloush I got that, I just wouldn't lead with that detail.

I think the documentation here has to solve two different and conflicting goals.

  1. For experienced developers, it should explain why the current situation is broken and needs to be improved, like what Pitchfork's draft does, and show examples of what _not_ to do anymore.

  2. For new developers, beginning by explaining the old way to do things before explaining what you're actually _supposed_ to do now is confusing and kind of beating around the bush. It runs the risk of failing to explain things clearly and succinctly enough. Plus, it has the unfortunate effect of portraying the old way as the standard practice, and the new way as an optional improvement for people who want to put in extra effort.

So, I think that it might make sense to split this into two. The main documentation, which should describe things from a fresh point of view that starts with Runes and then explains how they can be transformed and represented inside of string types _instead_ of presenting them as a solution to an existing problem, and then a “transition guide” for existing developers. The main documentation could then have a link to the transition guide.

@Entomy Hence my tendency to drop the detail altogether. It could be there for the existing developers' transition document. Splitting it makes sense to me. You also have developers that know how the current situation is broken and want to know how the new type fits into the situation. Seeing the Remarks section for the char type it sounds like these and the new developers documentations could go into Remarks for Rune and the transition guide be indeed linked separately.

@Serentty I strongly agree with that approach.

Moved this to February @tdykstra.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebagomez picture sebagomez  Â·  3Comments

mekomlusa picture mekomlusa  Â·  3Comments

Manoj-Prabhakaran picture Manoj-Prabhakaran  Â·  3Comments

stanuku picture stanuku  Â·  3Comments

LJ9999 picture LJ9999  Â·  3Comments