Runtime: Introducing System.Rune

Created on 16 Sep 2017  Â·  106Comments  Â·  Source: dotnet/runtime

Inspired by the discussion here:

https://github.com/dotnet/corefxlab/issues/1751

One of the challenges that .NET faces with its Unicode support is that it is rooted on a design that is nowadays obsolete. The way that we represent characters in .NET is with System.Char which is a 16-bit value, one that is insufficient to represent Unicode values.

.NET developers need to learn about the arcane Surrogate Pairs:

https://msdn.microsoft.com/en-us/library/xcwwfbb8(v=vs.110).aspx

Developers rarely use this support, mostly because they are not familiar enough with Unicode, and let alone what .NET has to offer for them.

I propose that we introduce a System.Rune that is backed by 32 bit integer and which corresponds to a codePoint and that we surface in C# the equivalent rune type to be an alias to this type.

rune would become the preferred replacement for char and serve as the foundation for proper Unicode and string handling in .NET.

As for why the name rune, the inspiration comes from Go:

https://blog.golang.org/strings

The section "Code points, characters, and runes" provides the explanation, a short version is:

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.

Update I now have an implementation of System.Rune here:

https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs

With the following API:

public struct Rune {

    public Rune (uint rune);
    public Rune (char ch);

    public static ValueTuple<Rune,int> DecodeLastRune (byte [] buffer, int end);
    public static ValueTuple<Rune,int> DecodeLastRune (NStack.ustring str, int end);
    public static ValueTuple<Rune,int> DecodeRune (byte [] buffer, int start, int n);
    public static ValueTuple<Rune,int> DecodeRune (NStack.ustring str, int start, int n);
    public static int EncodeRune (Rune rune, byte [] dest, int offset);
    public static bool FullRune (byte [] p);
    public static bool FullRune (NStack.ustring str);
    public static int InvalidIndex (byte [] buffer);
    public static int InvalidIndex (NStack.ustring str);
    public static bool IsControl (Rune rune);
    public static bool IsDigit (Rune rune);
    public static bool IsGraphic (Rune rune);
    public static bool IsLetter (Rune rune);
    public static bool IsLower (Rune rune);
    public static bool IsMark (Rune rune);
    public static bool IsNumber (Rune rune);
    public static bool IsPrint (Rune rune);
    public static bool IsPunctuation (Rune rune);
    public static bool IsSpace (Rune rune);
    public static bool IsSymbol (Rune rune);
    public static bool IsTitle (Rune rune);
    public static bool IsUpper (Rune rune);
    public static int RuneCount (byte [] buffer, int offset, int count);
    public static int RuneCount (NStack.ustring str);
    public static int RuneLen (Rune rune);
    public static Rune SimpleFold (Rune rune);
    public static Rune To (Case toCase, Rune rune);
    public static Rune ToLower (Rune rune);
    public static Rune ToTitle (Rune rune);
    public static Rune ToUpper (Rune rune);
    public static bool Valid (byte [] buffer);
    public static bool Valid (NStack.ustring str);
    public static bool ValidRune (Rune rune);
    public override bool Equals (object obj);

    [System.Runtime.ConstrainedExecution.ReliabilityContractAttribute((System.Runtime.ConstrainedExecution.Consistency)3, (System.Runtime.ConstrainedExecution.Cer)2)]
    protected virtual void Finalize ();
    public override int GetHashCode ();
    public Type GetType ();
    protected object MemberwiseClone ();
    public override string ToString ();

    public static implicit operator uint (Rune rune);
    public static implicit operator Rune (char ch);
    public static implicit operator Rune (uint value);

    public bool IsValid {
        get;
    }

    public static Rune Error;
    public static Rune MaxRune;
    public const byte RuneSelf = 128;
    public static Rune ReplacementChar;
    public const int Utf8Max = 4;

    public enum Case {
        Upper,
        Lower,
        Title
    }
}

Update Known Issues

  • [x] Some APIs above take a uint, need to take a Rune.
  • [ ] Need to implement IComparable family
  • [ ] RuneCount/RuneLen need better names, see docs (they should be perhaps Utf8BytesNeeded?)
  • [ ] Above, the "ustring" APIs reference my UTF8 API, this is really not part of the API, but we should consider whether there is a gateway to System.String in some of those, or to Utf8String.
api-needs-work area-System.Runtime up-for-grabs

Most helpful comment

I said it in the original issue and will say it again. Abandoning what a standard says because you don't like the phrase will confuse more than it will solve, and, given there is a rune code page in Unicode, that just confuses it more.

The name is wrong.

All 106 comments

Do you expect the in-memory representation to be strings of 32-bit objects, or translated on the fly? What about the memory doubling if the former? What's the performance impact if the latter?

Is naming a Unicode-related technology after a particular Unicode-supported script (and a technology to improve astral plane support after a BMP script, at that) a good idea?

I think the proposal (and perhaps it needs to be made more explicit) is that the in-memory representation of strings does not change at all. The Rune type merely represents a distinct individual 21-bit code point (stored as a 32-bit int). Methods referring to code points could potentially return a Rune instead. Presumably there is some functionality in string that would let you enumerate Rune's.

I think there's a couple obvious points that we need to get consensus about for something like this:

  1. Is there significant value in creating a Rune type rather than using Int32 as current methods do?
  2. Is the word "rune" actually a good choice?

To answer (1), I think we need a fuller description of how Rune would be exposed, what methods would receive and return it, etc. And to determine whether that is better than having those deal with Int32 instead.

As for (2), I'm a bit hesitant myself. "Rune" is sort of an esoteric word in English, and has some unusual connotations for its use in this context. There is also the point that others are bringing up: it collides with another Unicode concept. When I do a search for "Unicode Rune", I get mainly results for the Runic Unicode block, and only a few of Go language documentation.

char is both half a word and also a full word; and you have to inspect its surroundings to determine which - like it current represents half a letter or a full letter.

Perhaps System.character where its always a full letter... :sunglasses:

char is a bit of a terrible representation and even for ascii/latin only languages; the rise of emoji will still permeate; it means char is a check and maybe check next char type

@NickCraver on twitter

While utf8 is a variable width encoding; its rare (if at all?) that a user wants to deal with half characters; both for utf8 and utf32.

A 32-bit type would work well for enumeration.

More difficult would be indexOf, Length etc for a performance or memory perspective.

  1. byte array is best representation for an opaque format; e.g. keeping the format in its original format or a final format (file transfer, putting on wire etc)
  2. byte array is best representation for memory bandwidth and memory size
  3. byte array is consistent with Position and indexOf, Length etc in terms of bytes

However, when you start caring about actual characters, uppercasing, splitting on charaters; understanding what a character is, byte becomes variable width. Char doesn't make that really any better; it doubles the size of the smallest characters; includes more characters, but is still variable width.

For this a 32bit value might be very useful from a user code perspective. However it the has issues with position, length and secondary items (indexOf etc)

I'm very keen on an ascii only string and a utf8 string "Compact String implementation" https://github.com/dotnet/coreclr/issues/7083; for fast processing of ascii only strings

However, going against everything thing I was arguing there... I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?

Conversion to and from would be more expensive; so it would be more of a processing format; than a storage format.

@migueldeicaza as I undertsand it you are only referring to expanding single character format from 16-bit char to 32-bit so all representations are contained in the value; rather than the possibility of a half-value - rather than necessarily the internal format.

However somethings to consider (i.e. relation of position, and cost of seeking, etc)

Aside: Swift also deals in whole character formats

Swift provides several different ways to access Unicode representations of strings. You can iterate over the string with a for-in statement, to access its individual Character values as Unicode extended grapheme clusters. This process is described in Working with Characters.

Alternatively, access a String value in one of three other Unicode-compliant representations:

  • A collection of UTF-8 code units (accessed with the string’s utf8 property)
  • A collection of UTF-16 code units (accessed with the string’s utf16 property)
  • A collection of 21-bit Unicode scalar values, equivalent to the string’s UTF-32 encoding form (accessed with the string’s unicodeScalars property)

I said it in the original issue and will say it again. Abandoning what a standard says because you don't like the phrase will confuse more than it will solve, and, given there is a rune code page in Unicode, that just confuses it more.

The name is wrong.

@mellinoe

The Rune would provide many of the operations that today you expect on a Char, like ToLower[Invariant], ToUpper[Invariant], ToTitle, IsDigit, IsAlpha, IsGraphic, IsSymbol, IsControl.

Additionally, it would provide things like:

  • EncodeRune (encodes a rune into a byte buffer)
  • RuneUtf8Len (returns the number of bytes needed to encode the rune in UTF8),
  • IsValid (not all Int32 values are valid)

And interop to string, and Utf8string as needed.

I ported/adjusted the Go string support to .NET, and it offers a view of what this world would look like (this is without any runtime help):

https://github.com/migueldeicaza/NStack/tree/master/NStack/unicode

@benaadams said:

I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?

UTF8 is an in-memory representation, that would continue to exist and would continue to be the representation (and hopefully, this is the longer term internal encoding for future strings in .NET).

You would decode the existing UTF16 strings (System.String) or the upcoming UTF8 strings (Utf8String) not into Chars (for the reason both you and I agree on), but into Runes.

Some examples, convert a Utf8 string into runes:

https://github.com/migueldeicaza/NStack/blob/6a071ca5c026ca71c10ead4f7232e2fa0673baf9/NStack/strings/ustring.cs#L756

Does a utf8 string contain a rune:

https://github.com/migueldeicaza/NStack/blob/6a071ca5c026ca71c10ead4f7232e2fa0673baf9/NStack/strings/ustring.cs#L855

I just noticed I did not implement the indexer ("Get me the n-th rune")

The speed of access to the Nth-rune in a string is a function of the storage, not of the Rune itself. For example, if your storage is UTF32, you have direct access to every rune. This is academic, as nobody uses that. Access to the Nth element on UTF16 and UTF8 requires the proper scanning of the elements making up the string (bytes or 16-bit ints) to determine the right boundary. Not to be confused with String[int n] { get; } which just returns the n-th character, regardless of correctness.

@benaadams The Swift Character is a level higher up from a rune. Characters in swift are "extended grapheme clusters" which are made up of one or more runes that when they are combined produce a human readable character.

So the Swift character does not have a fixed 32-bit size, it is variable length (and we should also have that construct, but that belongs in a different data type). Here is the example from that page, but this also extends to setting the tint of an emoji:

Here’s an example. The letter Ă© can be represented as the single Unicode scalar Ă© (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an Ă© when it’s rendered by a Unicode-aware text-rendering system.

Just for me grapheme word would be more self-describing.

My two cents on the name, quoting again the Go post on strings with emphasis:

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.

I 100% agree with @blowdart, calling it rune is just confusing and wrong. The unicode standard mention code points three times just in the first page of the introduction chapter but the term rune appears nowhere.

If it’s a code point, then it should be named code point, simple as that.

If the term rune never appeared in the standard, it could be okay, the problem is that it appears several times in chapter 8, in relation to runes. It's not just wrong, it's actively confusing the matter with another.

Just for me grapheme word would be more self-describing.

If this is about 32-bit code-points the term grapheme would be confusing because a grapheme is something else again.

I've often wanted a code-point datatype (not in a good while, as what I've worked on has changed, but a few years ago I've wanted this a lot and written overlapping partial solutions to parts of that need and could have done with a well-tested library). I don't see why this shouldn't be called something like CodePoint. Most people who realise they needed such a type would likely be thinking in terms of code-points anyway, not in terms of runes; or else in terms of code-points and runes as separate parts of their task. áš±ášąášŸášȘ ᛒᛇᚩ ᛄᛁᛚᛖ ᛒᚱᚣᚳᛖᚹ/rĂșna bĂ©oĂŸ stille bryceu/runes are still used. I only need to use runes about once a year, and generally with parchment and ink rather than anything digital, but there are certainly people who deal with them digitally too. (Even with 20th century data, I know of a case from where they're in use in archiving WWII-era data).

Grapheme is trickier still, since one often wants to go octets → chars (nicely handled by .NET already) then chars → code-points, and then code-points → graphemes.

flagging this as up-for-grabs for now.

Next Steps: What we are looking for is: a formal proposal that will include the feedback from above (the actual naming of the type, and the advantages of using this as opposed to just using an Int32).

I have updated the issue, both with the proposed API and an initial implementation:

https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs

As for the naming of the type, it is both a matter of having a place where you can look for the valid operations on the type, as well as having type-specific capabilities (see the implementation for some examples).

@migueldeicaza before flagging it as ready for review, what are your thoughts regarding the concerns on the actual naming of the type, do you think that perhaps CodePoint might be better in terms of describing what it the type is?

I think the argument for using codepoint as a name is weak.

Using it is a terrible idea, in the long term, this needs to replace every single use of "char" in existing code - if we hope to get proper Unicode support.

I wish we could have used "char" like Rust does, but sadly, we already took it and we have a broken one.

Go having embraced this name is a good precedent.

I agree that code point isn't the correct term to use here. At the very least, based on the Unicode standard it does not include values above 10FFFF (http://unicode.org/glossary/#code_point).

I don't like the term rune. I think it has an existing use in Unicode and elsewhere that will only cause confusion overall. I also think it has a pretty good chance of conflicting with existing user types (especially for things like Unity, where a 'Rune' might represent a specific game object).

However, I do like the idea of a type that covers the C++ 11 char32_t type, just with a different name.

There's something to be said for Char32. It's to the point, it's analogous to the type names of the integral types. It talks at the character conceptual level, rather than the code-point level. It isn't the name of a script.

Since we are looking at having nint how about nchar?

The precedent would be in databases nchar and nvarchar

Where nchar are national char / national character and nvarchar is national char varying / national character varying; which are the field types you can store unicode to, also some ISO standard - not sure which, maybe SQL?

What is this Unicode use of rune? That is news to me.

U+16A0 to U+16F8

It is used to refer to a specific code page in the Unicode standard. It has been brought up a few times in this thread: http://unicode.org/charts/PDF/U16A0.pdf

Ah runic, not rune.

The backing name (System.Rune or System.Char32) is not as important as the label that will be projected into C#.

Firstly: yes, yes, and more of this please. I love this idea (honestly, I've had a similar idea going for a long time now). In fact we've been using a custom string class and character struct in our Git compatibility later in Visual Studio for a while now (Git speaks in Utf-8 and transcoding everything is very slow).

On the topic of static method names, can we avoid arbitrary short-naming please? Given that Char.IsPunctuation is the current method can we please mirror that with Rune.IsPunctuation or similar?

Assuming (always dangerous) that this gets accepted, can we have an intrinsic rune or c32, or just replace char completely with the System.Rune implementation?

I suggest unichar or uchar although uchar would look like its a unsigned char. Whichever is chosen, though, I do hope we get a language specific alias for it. I personally am a big fan of using the language aliases for primitive types.

Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.

Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.

IMO a language (and it's libraries) needs to choose either full, abbreviated names, or go whole hog on the abbreviations (like C with strcmp, memcpy, etc.)

or just replace char completely with the System.Rune implementation?

That would be a breaking change for fairly obvious reasons.

That would be a breaking change for fairly obvious reasons.

My comments was mostly tongue and cheek, and hopeful. A 16-bit type for character was a mistake from the start.

Good catch on the naming, will fix.

There are other small inconsistencies in the provided API, will take a look at fixing those as well.

@migueldeicaza

Ah runic, not rune.

Runic is the adjective, rune the noun. All the runic characters are runes.

_Runic_ is the adjective, _rune_ the noun. All the runic characters are runes.

Fair as it seems "Cortana: define _'rune'_" comes up with:

a letter of an ancient Germanic alphabet, related to the Roman alphabet.

Ah yes, whenever I see the word "rune", I immediately think of this obscure chapter on a spec nobody has read that talks about "The Runic Unicode Block".

😆 I think of childhood memories of reading Tolkien.

ᛁ᛫ᚊᛁᛜášČá›«á›Ÿáš á›«áš±ášąášŸá›–á›‹

Yeah, I don't specifically think of the spec, but I do think of the type of characters that the spec refers to.

You say rune and I think of magic, fantasy, cryptic puzzles, ancient languages, etc.

I am glad that you do not see the word "rune" and immediately think "Ah this clearly refers to the Unicode 7.0 runic block whose value will be limited to those unique values in the range 16A0..16F8".

I know that Tanner is a single voice here, and some of you are still thinking "But Miguel, I see the word 'rune' and I immediately think of a data type that could ever only hold 88 possible values". If this is a problem you are struggling with it, my brother/sister, I have news for you: you have bigger fish to fry.

I've been following this thread for a while with a mixture of excitement and hesitancy for a little over a month. I attended the Internationalization and Unicode Conference last month, and none of the presentations dealt with .NET. There is a perception problem with the .NET Framework; one that isn't necessarily unearned given the history of its globalization features. That being said, I love programming in C# and absolutely want to see new features that reinforce .NET's place in a truly global community. I think this proposal is a good step in that direction of embracing the standards that the internationalization community expects of software.

My hesitancy has mostly been over the bickering about the type name. While it is true that the designers of Go chose the name "rune", that's problematic for the reason listed above repeatedly: there are code points that are properly called runes. It is hard for me to agree with a proposal that tries to hew closely to a respected standard, and then redefines terminology that is part of the specification. Furthermore, the argument that most developers are ignorant of the term is specious given that the developers most interested in using this type correctly are more likely to understand the Unicode specification and have a good idea what a "rune" actually is. Imagine the oddity that could exist if you mixed the terminology:

Rune.IsRune(new Rune('ᛁ')); // evaluates to true
Rune.IsRune(new Rune('I')); // evaluates to false

Of course, I've taken the easy path here, critiquing without providing a new name. I think the previous suggestion of CodePoint is the most self-descriptive option (and it appears in the original issue description), but char32 would have more parity with the existing primitive types (although I would hesitate to say that not every code point is a character). If the goal is building better Unicode support into .NET, I'm absolutely supportive of that path, but the best way to do that is to follow the spec.

Three suggestions:

  1. The Rune class is missing the critical "IsCombining". Without that, we can't convert from a series of runes (code points) into a series of graphemes.
  1. I'd love to also have a corresponding Grapheme class. A grapheme in this context is really just a list of one or more Runes (Code Points) such that the first rune isn't combining and the rest of the runes are combining. The use case is for when a developer needs to deal with chunks of "visible characters". For example, a + GRAVE is two runes that form one grapheme.

  2. In networking we often get a hunk of bytes which we need to turn into a "string" like object where the bytes might not be complete (e.g., we get told of some bytes, but the last byte in a multi-byte sequence hasn't quite arrived yet). I don't see any obvious way of converting a stream of bytes into a stream of runes such that missing the last byte of a multi-byte sequence is considered a normal situation that will be rectified when we get the next set of bytes in.

And lastly, please use Unicode names and call this a CodePoint. Yes the Unicode consortium does a terrible job at explaining the difference. But the solution is to add clear and usable documentation; anything else confuses the issue instead of helping to clarify.

I do not where to start on the combining request, neither Go, Rust or Swift surface such an API on rune, Character or Unicode Scalar (their names for System.Rune). Please provide a proposed implementation.

On grapheme clusters, it is a good idea, it should be tracked independently of System.Rune. For what its worth, Swift use Character for this, but also Swift is not a great model for handling strings.

Turning streams of bytes into a proper rune is a problem that belongs to a higher level API. That said, you can look at my ustring implementation that uses the same substrate as my System.Rune implementation to see how these buffers are mapped into utf8 strings:

https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs

Documentation, which I have not updated yet since I introduced System.Rune into the API, but covers it:

https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html

As for naming, clearly Rust is the best one with char, but we messed that one up. The second best is Go with rune. Anything larger than four characters will just be a nuisance for people to do the right thing.

I'm sorry; I think CodePoint is an outstandingly good name. It's self-explanatory, memorable, and autocompletes with cp.

IsCombining would definitely be necessary, but so too is knowing the combining class and once we have that IsCombining is largely sugar as it's just IsCombining => CombiningClass != 0 or IsCombining => CombiningClass != CombiningClass.None. Grapheme clusters would indeed be outside of it again, but the starting point would be knowing the combining class for default clustering, reordering, etc.

CodePoint is a great name for a type about code points, and four characters is hardly a limit we have to deal with with other heavily used types; string is 50% larger and doesn't prevent us using it regularly. Four randomly picked letters would be a better name than repeating Go's mistake.

Since uint isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int would be necessary too.

Two-way implicit conversions can lead to bad things happening with overloads, so one direction should perhaps be explicit. It's not clear which. On the one hand uint/int is wider than code-points as values below 0 or above 10FFFF16 aren't meaningful, and having that conversion implicit allows for quicker use of more existing APIs for numbers. On the other hand I can see wanting to cast from a number to a code-point more often than the other way around.

Since uint isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int would be necessary too.

That is unless a new intrinsic type were introduced into the common language.

JonHanna -- do you mean that these three constructors:
public static implicit operator uint (Rune rune);
public static implicit operator Rune (char ch);
public static implicit operator Rune (uint value);

should be "int" instead of "uint". AFAICT, int easily covers the entire set of astral (non-BMP) planes.

@PeterSmithRedmond I mean that as well as the two constructors, one taking char and one taking uint, there should be one taking int, but yes there should also be an int conversion operator (just what should be implicit and what explicit is another question). There's no harm having uint too for those languages that can use it; it's quite a natural match after all.

If this should replace System.Char should be possible to do "arithmetic" on it (that is ==, !=, >, < unsure on +, -, *, /) and more importantly it should be support for literals of this type for example I should be able to write:

rune r = '𐍈'; // Ostrogothic character chose on purpose as in UTF16 will be a "surrogate pairs"


image

If not rune, only other synonym of character that could work is perhaps letter?

noun

  1. a written or printed communication addressed to a person or organization and usually transmitted by mail.
  2. a symbol or character that is conventionally used in writing and printing to represent a speech sound and that is part of an alphabet.
  3. a piece of printing type bearing such a symbol or character.

Though that would conflict with letter vs number

Letter has an even more precise meaning in unicode (and Net in general) than rune.

I think, if we're going to make this a Unicode character type we need to follow Unicode's naming conventions; which means _"code point"_.

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Or maybe we just give up and call a duck a "duck" and refer to them as Unicode Characters (aka uchar).

Why not just solve this to use System.CodePoint instead?
Imho it's more proper in terms of terminology from Unicode, and other folks in Java world are using it. So instead of having term on our own, let's abide to Unicode terms. It makes more sense, and more universal in terms of general characters and string implementation in .NET, also knowing the fact that String in .NET is a collection of char, and this collection of char is Unicode-based.

I know, because I have lived in both Java and .NET worlds.
And maybe let's start have a draft implementation on this.

Really there are two components of this and both would be required (CodeUnit in https://github.com/dotnet/corefxlab/issues/1799 by @GrabYourPitchforks)

C# keyword      Ugly Long form      Size
----------------------------------------
ubyte      <=>  System.CodeUnit    8 bit  - Assumed Utf8 in absence of encoding param
uchar      <=>  System.CodePoint  32 bit

CodeUnit/ubyte are important for representing variable width encoding and for use in Span<ubyte> to ensure text apis are available on text types but not raw bytes.

CodePoint/uchar is important for sensible processing; e.g. .IndexOf(❀) as ubyte by itself cannot be used to search for a multibyte unicode char; and enumerating over ubytes would be fraught with peril, so the enumerator should work in uchar units.

Combining the two proposals it would be something like

using System;
using System.Runtime.InteropServices;

// C# Keywords
using ubyte = System.CodeUnit;
using uchar = System.CodePoint;
using uspan = System.Utf8Span;
using ustring = System.Utf8String;

namespace System
{
    public ref struct Utf8Span
    {
        private readonly ReadOnlySpan<ubyte> _buffer;

        public Utf8Span(ReadOnlySpan<ubyte> span) => _buffer = span;
        public Utf8Span(uspan span) => _buffer = span._buffer;
        public Utf8Span(ustring str) => _buffer = ((uspan)str)._buffer;
        public Utf8Span(ReadOnlyMemory<ubyte> memory) => _buffer = memory.Span;

        // Returns the CodeUnit index, not CodePoint index
        public int IndexOf(char value) => IndexOf(value, 0);
        public int IndexOf(char value, int startIndex) => IndexOf(value, 0, _buffer.Length);
        public int IndexOf(char value, int startIndex, int count);
        public int IndexOf(char value, StringComparison comparisonType);

        public int IndexOf(uchar value) => IndexOf(value, 0);
        public int IndexOf(uchar value, int startIndex) => IndexOf(value, 0, _buffer.Length);
        public int IndexOf(uchar value, int startIndex, int count);
        public int IndexOf(uchar value, StringComparison comparisonType);

        public uspan Substring(int codeUnitIndex);
        public uspan Substring(int codeUnitIndex, int codePointCount);

        public bool StartsWith(uchar ch) => _buffer.Length >= 1 && _buffer[0] == ch;
        public bool StartsWith(ustring str) => StartsWith((uspan)str);
        public bool StartsWith(uspan value) => _buffer.StartsWith(value._buffer);
        public bool EndsWith(uchar ch) => _buffer.Length >= 1 && _buffer[0] == ch;
        public bool EndsWith(ustring str) => EndsWith((uspan)str);
        public bool EndsWith(uspan value) => _buffer.EndsWith(value._buffer);

        public Enumerator GetEnumerator() => new Enumerator(this);

        // Iterates in uchar steps, not ubyte steps
        public ref struct Enumerator
        {
            public Enumerator(uspan span);

            public uchar Current;
            public bool MoveNext();
            public void Dispose() { }
            public void Reset() => throw new NotSupportedException();
        }
    }

    public class Utf8String
    {
        private readonly ReadOnlyMemory<ubyte> _buffer;

        public Utf8String(ustring str) => _buffer = str._buffer;
        public Utf8String(ReadOnlyMemory<ubyte> memory) => _buffer = memory;

        public bool StartsWith(uchar ch) => ((uspan)this).StartsWith(ch);
        public bool StartsWith(ustring value) => ((uspan)this).StartsWith(value);
        public bool StartsWith(uspan value) => ((uspan)this).StartsWith(value);
        public bool EndsWith(uchar ch) => ((uspan)this).EndsWith(ch);
        public bool EndsWith(ustring value) => ((uspan)this).EndsWith(value);
        public bool EndsWith(uspan value) => ((uspan)this).EndsWith(value);

        public static implicit operator uspan(ustring value) => new uspan(value._buffer);

        // Returns the CodeUnit index, not CodePoint index
        public int IndexOf(char value) => IndexOf(value, 0);
        public int IndexOf(char value, int startIndex) => IndexOf(value, 0, _buffer.Length);
        public int IndexOf(char value, int startIndex, int count);
        public int IndexOf(char value, StringComparison comparisonType);

        public int IndexOf(uchar value) => IndexOf(value, 0);
        public int IndexOf(uchar value, int startIndex) => IndexOf(value, 0, _buffer.Length);
        public int IndexOf(uchar value, int startIndex, int count);
        public int IndexOf(uchar value, StringComparison comparisonType);

        public ustring Substring(int codeUnitIndex);
        public ustring Substring(int codeUnitIndex, int codePointCount);

        public uspan.Enumerator GetEnumerator() => ((uspan)this).GetEnumerator();
    }

    [StructLayout(LayoutKind.Auto, Size = 1)]
    public struct CodeUnit : IComparable<ubyte>, IEquatable<ubyte>
    {
        private readonly byte _value;

        public CodeUnit(ubyte other) => _value = other._value;
        public CodeUnit(byte b) => _value = b;

        public static bool operator ==(ubyte a, ubyte b) => a._value == b._value;
        public static bool operator !=(ubyte a, ubyte b) => a._value != b._value;
        public static bool operator <(ubyte a, ubyte b) => a._value < b._value;
        public static bool operator <=(ubyte a, ubyte b) => a._value <= b._value;
        public static bool operator >(ubyte a, ubyte b) => a._value > b._value;
        public static bool operator >=(ubyte a, ubyte b) => a._value >= b._value;

        public static implicit operator byte(ubyte value) => value._value;
        public static explicit operator ubyte(byte value) => new ubyte(value);

        // other implicit conversions go here
        // if intrinsic then casts can be properly checked or unchecked

        public int CompareTo(ubyte other) => _value.CompareTo(other._value);

        public override bool Equals(object other) => (other is ubyte cu) && (this == cu);

        public bool Equals(ubyte other) => (this == other);

        public override int GetHashCode() => _value;

        public override string ToString() => _value.ToString();
    }

    [StructLayout(LayoutKind.Auto, Size = 4)]
    public struct CodePoint : IComparable<uchar>, IEquatable<uchar>
    {
        private readonly uint _value;

        public CodePoint(uint CodePoint);
        public CodePoint(char ch);

        public static ValueTuple<uchar, int> DecodeLastCodePoint(ubyte[] buffer, int end);
        public static ValueTuple<uchar, int> DecodeLastCodePoint(ustring str, int end);
        public static ValueTuple<uchar, int> DecodeCodePoint(ubyte[] buffer, int start, int n);
        public static ValueTuple<uchar, int> DecodeCodePoint(ustring str, int start, int n);
        public static int EncodeCodePoint(uchar CodePoint, ubyte[] dest, int offset);
        public static bool FullCodePoint(ubyte[] p);
        public static bool FullCodePoint(ustring str);
        public static int InvalidIndex(ubyte[] buffer);
        public static int InvalidIndex(ustring str);
        public static bool IsControl(uchar CodePoint);
        public static bool IsDigit(uchar CodePoint);
        public static bool IsGraphic(uchar CodePoint);
        public static bool IsLetter(uchar CodePoint);
        public static bool IsLower(uchar CodePoint);
        public static bool IsMark(uchar CodePoint);
        public static bool IsNumber(uchar CodePoint);
        public static bool IsPrint(uchar CodePoint);
        public static bool IsPunctuation(uchar CodePoint);
        public static bool IsSpace(uchar CodePoint);
        public static bool IsSymbol(uchar CodePoint);
        public static bool IsTitle(uchar CodePoint);
        public static bool IsUpper(uchar CodePoint);
        public static int CodePointCount(ubyte[] buffer, int offset, int count);
        public static int CodePointCount(ustring str);
        public static int CodePointLen(uchar CodePoint);
        public static uchar SimpleFold(uchar CodePoint);
        public static uchar To(Case toCase, uchar CodePoint);
        public static uchar ToLower(uchar CodePoint);
        public static uchar ToTitle(uchar CodePoint);
        public static uchar ToUpper(uchar CodePoint);
        public static bool Valid(ubyte[] buffer);
        public static bool Valid(ustring str);
        public static bool ValidCodePoint(uchar CodePoint);

        public static bool operator ==(uchar a, uchar b) => a._value == b._value;
        public static bool operator !=(uchar a, uchar b) => a._value != b._value;
        public static bool operator <(uchar a, uchar b) => a._value < b._value;
        public static bool operator <=(uchar a, uchar b) => a._value <= b._value;
        public static bool operator >(uchar a, uchar b) => a._value > b._value;
        public static bool operator >=(uchar a, uchar b) => a._value >= b._value;

        // etc
    }
}

I've been using UnicodeScalar in my prototype implementations to refer to a Unicode scalar value (values in the range U+0000..U+10FFFF, inclusive; excluding surrogate code points) and Utf8Char to refer to the UTF-8 code unit. Seems like a lot of people prefer _Rune_ instead of _UnicodeScalar_ because it's less of a mouthful. I don't care too much, but I will point out that the term "Unicode scalar value" is the same term used by the Unicode specification. ;)

The .NET Framework also has the concept of a "text element", which is one or more scalars which when combined create a single indivisible grapheme. More info on this at MSDN. In particular, when you enumerate a string you may want to enumerate by code unit (Utf8Char or Char), scalar value (UnicodeScalar), or text element, depending on your particular scenario. Ideally we'd support all three types across both String and Utf8String.

The API surface for our prototype isn't finished and is subject to rapid change, but you can see some current thinking at https://github.com/dotnet/corefxlab/tree/utf8string/src/System.Text.Utf8/System/Text and https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Primitives/System/Text/Encoders/Utf8Utility.cs.

A bit off-topic:
Should the "text element" be the segmentation defined by "Grapheme Cluster Boundaries" in UAX dotnet/corefx#29?

using System;
using System.Globalization;

class Program
{
    static void Main()
    {
        var e = StringInfo.GetTextElementEnumerator("đŸ‘©đŸ»â€đŸ‘ŠđŸŒđŸ‘šđŸœâ€đŸ‘ŠđŸŸâ€đŸ‘ŠđŸżđŸ‘©đŸŒâ€đŸ‘šđŸœâ€đŸ‘ŠđŸŒâ€đŸ‘§đŸœđŸ‘©đŸ»â€đŸ‘©đŸżâ€đŸ‘§đŸŒâ€đŸ‘§đŸŸ");
        while (e.MoveNext())
        {
            Console.WriteLine(e.GetTextElement());
        }
    }
}

expected result:
đŸ‘©đŸ»â€đŸ‘ŠđŸŒ
đŸ‘šđŸœâ€đŸ‘ŠđŸŸâ€đŸ‘ŠđŸż
đŸ‘©đŸŒâ€đŸ‘šđŸœâ€đŸ‘ŠđŸŒâ€đŸ‘§đŸœ
đŸ‘©đŸ»â€đŸ‘©đŸżâ€đŸ‘§đŸŒâ€đŸ‘§đŸŸ

actual result:
đŸ‘©
đŸ»
‍
👩
đŸŒ
👹
đŸœ
‍
👩
đŸŸ
‍
👩
🏿
đŸ‘©
đŸŒ
‍
👹
đŸœ
‍
👩
đŸŒ
‍
👧
đŸœ
đŸ‘©
đŸ»
‍
đŸ‘©
🏿
‍
👧
đŸŒ
‍
👧
đŸŸ

UnicodeScalar is still super easy to type. uscSpace (autocompletes) Since that's the correct, most self-descriptive term, I really hope we get that.

@ufcpp That is a good point. Feel free to open a new issue for that. If we cannot change the behavior for compat reasons then I'd suggest we deprecate that type and create a spec-compliant grapheme enumerator.

ubyte/uchar are confusing. They read like unsigned char/unsigned byte given convention established with ushort/uint/ulong. Perhaps char8/u8char and char32/u32char are clearer?

In any case, I think we're misaligned on whether UTF-8 code units & code points are:

  1. low-level primitive data types in .NET - like byte, int
  2. a data format to convert to/from existing primitives - like DateTime, Guid

And then, how do we expose codepoint-related APIs given that decision?

Option 1 means handling text via char8, char16, and char32 primitives (and accompanying u8string, u16string, and u32string) like C++17. Then char32 as rune is a bad name, given we already have char16 as char and need a 3rd name for char8 too.

Option 2 means byte and int/uint are 'good enough' for storing UTF code units & code points. This implies all strings remain UTF-16. CodePoint/rune solves problems of Code Point semantics rather than binary representation - and is not intended for IO.

IMO UTF-8/UTF-32 are just data formats (option 2). Treat them as data (byte/int). CodePoint is more like DateTime or Guid (another identifier*) than int to me - not a low-level primitive type, not directly supported in IO (i.e. BinaryWriter), no need for intrinsics.

@miyu The prototype we're bringing up in corefxlab is closer to Option 1. There are specific data types to represent code units, and these data types are for internal representation of textual data and cannot be used to transmit textual data across the wire. (As you point out, .NET already works like this today: System.Char is the code unit of a UTF-16 string, but System.Char cannot be sent across the wire.)

Additionally, there are APIs to convert between byte[] / Span<byte> / etc. (this is the binary representation of all data and is appropriate for I/O) and primitive types like Utf8String / String / Guid / etc. Some of these are more straightfoward than others. For example, we can expose a convenience Utf8String.Bytes property which returns a ReadOnlySpan<byte> for use in i/o, and this property getter can have O(1) complexity. We would not introduce such a property on the String type, though you could imagine having a String.ToUtf8Bytes() convenience method. And even though there would exist a Utf8String.Bytes property, the elemental type of enumerating over a Utf8String instance directly wouldn't be byte. It would be Utf8CodeUnit (name TBD) or UnicodeScalar, whichever we think makes more sense for the types of applications developers want to build.

Silly off the wall idea - what about wchar (_wide char_)? Today, most C and C++ compiler environments (outside of Windows) already use wchar_t to represent the functional equivalent of a 32-bit code unit. Windows is a notable exception, where wchar_t is defined to be a 16-bit type, but developers who p/invoke on Windows today already have to be cognizant of the bit width differences between a .NET char and a C-style char.

The type / keyword wchar would violate our naming conventions, but just throwing this out there for consideration.

Silly off the wall idea - what about wchar (wide char)?

Works for me

The type / keyword wchar would violate our naming conventions, ...

Doesn't sound like we're going to get a short C# language keyword

https://github.com/dotnet/apireviews/pull/64#discussion_r196962756 it seems extremely unlikely that we'd introduce language keywords for these types as these would have to be contextual (i.e. depending on whether they can resolve to a type with the name of the keyword they'd still have to bind to that type, rather than the type represented by the keyword).

So if we want something nice... i.e. NotLotsOfCapitalFullWords...

While I normally like .NET's naming conventions a long name is a little offensive for essentially an int which will also likely be used in generics and as loop variables.

e.g. no one does

foreach (Int32 i in list)
{
    // ...
}

Do they? (Surely...)

foreach (UnicodeScalar us in str)
{
    // ...
}

Is far worse

foreach (wchar c in str)
{
    // ...
}

Seems ok...

rune, wchar, and uchar (suggested on other thread) all sound good to me. Any suggestions for a peer of string? wstring, ustring, or other?

... and why not get a C# language keyword? Sure, not having one for the first release makes sense, but if this is going the future to string handling not having a keyword is not only disingenuous, but overtly hostile towards its adoption.

/CC @MadsTorgersen @jaredpar

why not get a C# language keyword?

New keywords are breaking changes 100% of the time. No matter what word you choose there is a company out there that has a type of that name which is used everywhere in their project. The only option we have are contextual keywords: var for example.

I have mixed feelings about using a contextual keyword for this. The existing type keywords (int, string, etc ...) have a concrete advantage over the actual type name (Int32, String):

  • string: this refers to the type System.String in the assembly the compiler identifies as corelib. This name has zero ambiguity associated with it.
  • String: the compiler has zero understanding of this type. It is just a type like any other and goes through all of the same lookup rules as types you define. It may be equivalent to string or it may not be.

Once we introduce contextual keywords here then rune could be either:

  • The type System.Rune inside the corelib assembly
  • The type rune that you defined two years ago when you read about Go.

The lookup of rune is just as ambiguous as String hence I don't see a firm advantage to having it as a contextual keyword.

BTW: this is why you should be using string and not String 😄

BTW: this is why you should be using string and not String

Which 99% of the reason I think people want a language keyword. The other 1% being it just "looks better" 😏

Thumbs down for strong dislike of the "rune" keyword.

A better word is glyph, as it already represents the general concept of an elemental symbol in typography.

Rune is a specific type of glyph which is ironically defined by Unicode. Referring to Go as prior art is somewhat ridiculous. Prior art for runes is what was written back in 150 AD and actual physical rune stones. Not what someone in Redmond thinks a rune is. Trying to redefine existing concepts like this is unusual since .NET usually has a well designed API surface. This is a rare exception of very poor API naming and I want to voice my discontent.

A better word is glyph, as it already represents the general concept of an elemental symbol in typography.

Issue is "Glyph" is a used term when rendering the unicode to visible text (from: utf8everywhere.org)

Glyph

A particular shape within a font. Fonts are collections of glyphs designed by a type designer. It’s the text shaping and rendering engine responsibility to convert a sequence of code points into a sequence of glyphs within the specified font. The rules for this conversion might be complicated, locale dependent, and are beyond the scope of the Unicode standard.

Referring to Go as prior art is somewhat ridiculous.

Using the term Rob Pike and Ken Thompson used when creating Utf-8 https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Rob Pike works on Go now, which is why it uses the original term.

Rune is a specific type of glyph which is ironically defined by Unicode.

Runic is defined by Unicode, Rune isn't

Runic is defined by Unicode, Rune isn't

I don't think this is an accurate statement, the latest unicode spec (http://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf) has 37 hits for "rune" (only 36 are valid, the last is part of a larger word) and it is always used to refer to individual letters of the Runic Alphabet.

I don't think this is an accurate statement, the latest unicode spec has 37 hits for "rune"

In the body text describing motivations; not in any character name or text block name (where its Runic and Runic character)

In the body text describing motivations; not in any character name or text block name (where its Runic and Runic character)

Ok, fair. But then we are back to the issue that the current Unicode spec does not define the term "Rune" and when it is used, it is for informative text describing "runic characters".

What is does formally define and use for describing things is "Code Point" and "Code Unit".

  • Even if, historically, the original creator(s) used the term "Rune", the official spec does not (and I would imagine they had good reasons for not using it).

Needs to be short or its usage gets ugly

int CountCommas(string str)
{
    int i = 0;
    foreach(UnicodeCodePoint c in str.AsUnicodeCodePoints())
    {
        if (c == ',') i++;
    }
}

string Trim(string str)
{
    int end = str.Length - 1;
    int start = 0;

    for (start = 0; start < Length; start++)
    {
        if (!UnicodeCodePoint.IsWhiteSpace(str.GetUnicodeCodePointAt(start)))
        {
            break;
        }
    }

    for (end = Length - 1; end >= start; end--)
    {
        if (!UnicodeCodePoint.IsWhiteSpace(str.GetUnicodeCodePointAt(start)))
        {
            break;
        }
    }

    return str.SubString(start, end);
}

vs

int CountCommas(string str)
{
    int i = 0;
    foreach(Rune c in str.AsRunes())
    {
        if (c == ',') i++;
    }
}

string Trim(string str)
{
    int end = str.Length - 1;
    int start = 0;

    for (start = 0; start < Length; start++)
    {
        if (!Rune.IsWhiteSpace(str.GetRuneAt(start)))
        {
            break;
        }
    }

    for (end = Length - 1; end >= start; end--)
    {
        if (!Rune.IsWhiteSpace(str.GetRuneAt(start)))
        {
            break;
        }
    }

    return str.SubString(start, end);
}

For length, I would totally go for CodePoint.IsWhiteSpace and str.GetCodePointAt, but Rune is also fun and I don't mind it.

@jnm2 We wouldn't use GetCodePointAt when it comes to strings. It's too ambiguous: we don't know if you wanted the char that happened to be at that index (since all chars - even unpaired surrogates - are also valid code points) or the scalar / rune that happened to be at that index.

@GrabYourPitchforks Can GetRuneAt avoid the same problem, or are you saying neither would make sense?

@jnm2 I was just saying that CodePoint in particular is too ambiguous in this scenario. Otherwise the method name GetXyzAt should match the type name Xyz that eventually goes in.

FYI the core implementation is now checked in (see https://github.com/dotnet/coreclr/pull/20935). Give it some time to propagate to corefx, then the ref APIs will come in via https://github.com/dotnet/corefx/pull/33395. Feel free to leave this issue open or to resolve it as you see fit.

I don't expect to influence anyone or be able to change anything but just for the record:

A better word is glyph, as it already represents the general concept of an elemental symbol in typography.

Issue is "Glyph" is a used term when rendering the unicode to visible text (from: utf8everywhere.org)

That line of reasoning doesn't support rune either, because "rune" has been a used term for over a thousand years throughout history, well before Unicode or transistors or Microsoft or open source ever existed. At least it indicates that some arbitrarily apply different standards to different proposals which is obviously not consistent so maybe it is more about who was first or is loudest rather than the most coherent argument, what do I know. I am just a late comer trying to understand the process but it doesn't make sense.

Referring to Go as prior art is somewhat ridiculous.

Using the term Rob Pike and Ken Thompson used when creating Utf-8 https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Rob Pike works on Go now, which is why it uses the original term.

Go and Rob Pike are relatively speaking newcomers to this topic. Actually their opinion is somewhat irrelevant in terms of defining what a rune is historically and in popular literature and society. Rob did not hammer any rune stones himself by hand so he has few qualifications to define what a rune is. I bet he can't even write or read rune script himself but that is my guess. At best he can capture that concept through encoding, but he can't come in and say that a Chinese character, Arabic writing or Hangul or smiley face is a rune or whatever else that is a "Code Point" is now also a Rune, or something like that. It almost seems to disrespectfully trample on the term, look, now everything can be a rune, which means, runes are nothing but a four letter wildcard term to refer to something esoteric in the domain of encoding text.

Rune is a specific type of glyph which is ironically defined by Unicode.

Runic is defined by Unicode, Rune isn't

Unicode is not supposed to redefine what a rune or runic is. If they do that, they are overstepping their mandate. They have no business telling the public what a rune is. In fact they have no business defining any new language or character system whatsoever. They can't just appropriate a word that is already a clearly overloaded term since a thousand years and then run around cheering like they have invented a new concept. Runic writing consists of runes only, and runes are already an established concept. If you ask a random person on a street what a rune is they will not think of Unicode.

In addition to all the above problems, rune is a poor metaphor which is the worst part. It does not clarify anything. It just adds another level of confusion. Any newcomer to the topic now needs to go through a round of disambiguation explanation and reading because everyone comes in with the context that a rune is a historical writing system used in certain cultures. The explanation will have to go something like this: "A rune is a Unicode code point". "But why not call it code point?" "Well, because it is too long.", or "Somebody decided that they like rune". So basically, because someone thinks 9 letters is too much as compared with 4 (even though they have auto complete with Intellisense and is nothing compared with the Java Kingdom Of Nouns), now we have to deal with this confusion and explain this to a thousands of developers who may need to dabble in Unicode. Just use a using statement to shorten the term if you use it a lot in code.

It doesn't have to be UnicodeCodePoint either, it can simply be CodePoint. This is already unique. There are many API terms that are longer than "CodePoint" so that should suffice. If still too long, well just use a using statement with some abbreviation.

I foresee this becoming one of those gotcha interview questions that really do not add much value or have logical basis in anything useful. At least for the metaphor "milestone", while we are on the topic of symbolic words used in software development based on concepts derived from stone and rock, a milestone has a real descriptive meaning. It immediately communicates a concept that everyone is familiar with. Aha, a milestone, like when you are on a long journey and you pass by on the trail. It is a nice real world metaphor that actually helps to visualize something and can become managerial language instantly. I can't imagine people talking about runes in this way unless they are intimately familiar with the topic, at which point they will already know that it's just a gimmick term for code point.

A better word is glyph, as it already represents the general concept of an elemental symbol in typography.

Issue is "Glyph" is a used term when rendering the unicode to visible text (from: utf8everywhere.org)

That line of reasoning doesn't support rune either, because "rune" has been a used term for over a thousand years throughout history, well before Unicode or transistors or Microsoft or open source ever existed.

My point was the word "glyph" is problematic as it already used as one of the concepts in rendering text; its the graphical representation of that character in a particular font. So a character can be represented by many different glyphs.

... again with @benaadams having the 10,000 meter view of things and the correct answer 😁

Honestly, we are going to have to live with the old adage: "you can make some of the people happy all of the time, and all of the people happy some of the time; but you cannot make all of the people happy all of the time." This very much a situation of the former.

Sigil?

Exit, pursued by a bear.

As someone who would be using this API extensively, I’m putting in a strong vote for code point. Unicode terminology is already confusing enough, and inconsistencies already abound. You will make my life a lot easier if I can just say “code point” everywhere.

I’m lying in bed right now. If I turn sideways, I face a whiteboard propped against my wall. For months, that whiteboard has been home to various scribbles and charts while I try to figure out how to deal with IDNs efficiently in C#. I treat it like a relic that I’ve summoned from the depths of hell. If I tried to explain the logic it describes, I wouldn’t be able to.

Please, don’t make my life harder. A code point is a code point. It’s not a rune, glyph, character, grapheme, or even symbol. It need not represent anything meaningful to a human—it could be a control code. It might not represent a visual symbol, as the name “rune” implies. It is just a code point.

A more concrete argument is that “rune” implies representation of a single grapheme, which is very often not the case. If I count the number of code points and the number of graphemes, I might get two very different numbers. The same sequence of graphemes could be represented by two distinct series of code points.

A better word is glyph, as it already represents the general concept of an elemental symbol in typography.

That’s even worse. A single code point could be represented by multiple glyphs, and a single glyph could represent multiple code points. The exact mapping can vary by system, program, typeface...

All of these words have very specific technical meanings. While the differences might seem insignificant in the context of this proposal, they have real consequences elsewhere, especially in languages other than English.

Just as an example of how difficult it can be to deal with text, even in a language as common as German:

  1. Convert ß to uppercase and you’ll get SS.
  2. Convert it back to lowercase and you’ll get ss.

Problems:

  • What should char.ToUpper('ß') return? (It has to return a single char.)
  • A capital version of ß which my phone can’t enter in this text box was added to Unicode 5.1. If I try to paste it, I get SS. Now upper/lower conversions are even more ambiguous.
  • Changing the casing of a string changes its length.
  • Case changes aren’t idempotent or reversible.
  • You can’t perform a case-insensitive comparison by simply lowercasing each string.

Even though this isn’t a direct example of a situation in which terminology causes problems, it demonstrates how there are sorts of edge cases that we don’t normally think about. Giving each term a distinct, consistent meaning helps programmers communicate these issues. If I ask a teammate to write a function to count graphemes, they know exactly what they’re going to be counting and how to do it. If I ask them to count code points, again, they know exactly what to do. These definitions are independent of the languages and technologies we’re using.

If I ask a JavaScript developer to count runes, they’re going to look at me like I have three heads.

Wikipedia says

Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex

Code point seems to be the official name. I have read this thread and have not found a forcing argument for why code point would be incorrect.

I agree that code point isn't the correct term to use here. At the very least, based on the Unicode standard it does not include values above 10FFFF (http://unicode.org/glossary/#code_point).

Maybe that sentence is just wrong? It says "any value in the code space". So it clearly means everything while at the same time getting the integer wrong.

Also, "rune" has a real world meaning that has nothing to do with Unicode. In Germany, the word "Rune" has Nazi connotations because runes have "Germanic" history which the Nazis liked to refer to.

I find "rune" to be a confusing name. Does anyone here really like "rune" or are the arguments for it based on correctness. Intuitively, it is a really bad name.

Maybe that sentence is just wrong? It says "any value in the code space". So it clearly means everything while at the same time getting the integer wrong.

That sentence is correct. The code space is from U+0000 to U+10FFFF. Unicode could theoretically be expanded beyond that someday, but it would break UTF-8 and UTF-16. We'd need new encodings.

Edit: Actually, don't quote me on the UTF-16 breakage, but I'm pretty sure it'd break UTF-8. UTF-8 definitely can't represent 0xFFFFFF (2^24 -1).

Edit 2: To clarify, Unicode states that code points cannot ever exceed U+10FFFF. That doesn't mean there are currently 0x110000 code points--most of those code points are unassigned.

@Zenexer @GSPP

This type as currently checked in to master (System.Text.Rune) maps very specifically to a "Unicode scalar value" (see glossary). The type's ctors will throw an exception if you try to construct it from the values -1, 0xD800, or 0x110000, since these are not scalar values per the Unicode specification. If you take a Rune parameter as input to your method, you don't have to perform any validation check on it. The type system has already ensured that it was constructed from a valid scalar value.

Re: case conversion, all case conversion APIs in the .NET Framework _unless otherwise noted_ use a technique called simple case folding. Under the rules for simple case folding, for any input scalar value, the output lowercase, uppercase, and titlecase forms are also each guaranteed to be exactly one scalar value. (Some inputs, like the digits 0-9 or punctuation symbols, don't have entries in the case conversion map. In these cases operations like _ToUpper_ simply return the input scalar value.) Additionally, under simple case folding rules if the input is in the Basic Multilingual Plane (BMP), then the output must also be in the BMP; and if the input is in a supplementary plane, the output must also be in a supplementary plane.

There are some consequences to this. First, Rune.ToUpper and friends will always return a single _Rune_ (scalar) value. Second, String.ToUpper and friends will always return a string with the exact same length as its input. This means that a string containing 'ß' (miniscule eszett), after a case conversion operation, may end up containing 'ß' (no change) or 'áșž' (majuscule eszett), depending on the culture being used. But it _will not_ contain "SS", because this would change the length of the string, and almost all publicly exposed .NET case conversion APIs use simple case folding rules. Third, Utf8String.ToUpper and friends (not yet checked in) are _not_ guaranteed to return a value whose _Length_ property matches the input value's _Length_ property. (The number of UTF-16 code units in a string cannot change after simple case folding, but the number of UTF-8 code units in a string can change. This is due to how BMP values are encoded by UTF-16 and UTF-8.)

There are some .NET APIs which internally use complex case folding rules rather than simple case folding rules. String.Equals, String.IndexOf, String.Contains, and similar operations use complex case folding rules under the covers, depending on culture. So if your culture is set to _de-DE_, the one-character string "ß" and the two-character string "SS" will compare as equal if you pass _CurrentCultureIgnoreCase_.

@GrabYourPitchforks I'm primarily objecting to the choice of name. The casefolding example was purely to emphasize how complicated Unicode (and text in general) can be. As long as there's some way to handle normalization, I don't care too much how the simple operations work, as I'll be converting to NFKD for everything anyway for my use case.

That sentence is correct. The code space is from U+0000 to U+10FFFF. Unicode could theoretically be expanded beyond that someday, but it would break UTF-8 and UTF-16. We'd need new encodings.

Just to be nitpicking (or, if people are interested): In theory, the UTF-8 algorithm works for up to 42 bits (Prefix Byte 0xFF and 7 bytes of 6 bits payload), and originally, the first specifications covered the full 31 bit space of those old versions of the Universal Character Set (UCS4) - however, the current specifications (RFC 3629, Unicode Standard, Annex D of ISO/IEC 10646) all agree to restrict it to the current range of valid codepoints (U+0000 to U+10FFFF).

For UTF-16, the situation is more difficult. But they could reserve code points in an upper plane as "Escapes" for 32 bit or more. As Planes 3 to 13 are currently undefined, they could reserve two of them as "low surrogate plane" and "high surrogate plane". Then a 32 bit codepoint would be split into two 16 bit values (one in each plane), and then each value would be encoded using two "classic" surrogates, effectively using 4 code units of 16 bit each to encode a 32 bit codepoint.

Btw, AFAICS, the unicode consortium has publicly stated that they will never allocate codepoints above U+10FFFF, so in practice, I hope I'll be long retired before that actually happens. :wink:

This type as currently checked in to master (System.Text.Rune) maps very specifically to a "Unicode scalar value"

@GrabYourPitchforks thanks for that clarification. This means that the struct does not represent a code point. So that name indeed would be incorrect.

I guess UnicodeScalar is too arcane as a name...

@GrabYourPitchforks, what's left to do for this issue?

@stephentoub There's no additional functionality planned for the in-box Rune type for 3.0, but @migueldeicaza had ideas for extending the reach of the type, including for things like grapheme clusters. (The closest thing we have in-box is TextElementEnumerator, which is a very outdated type.) Some of those ideas were bandied about in this thread but there's nothing yet that's concrete.

We could leave this issue open in case the community wants to discuss the scenarios further, or we could direct folks to open new issues if they want to make specific suggestions. TBH I don't have a strong preference.

Thanks. Since Rune was already introduced and the APIs outlined here (or approximations thereof) already exposed, let's close this. Additional support can be addressed via separate issues.

So is this essentially stabilized at this point? Because in all honesty this dreadful name, which doesn't line up with any information that you'll find about Unicode from good and accurate sources, and has the unfortunate nuance of implying a glyph as opposed to a nonprinting character, is only going to worsen the already dreadful understanding of Unicode by your average programmer.

I know this has been integrated by this point, but I just want to chime in on the Rune part and some peoples disagreement about the name.

I first encountered Rune in Plan 9, and like others have seen it in Go and others. When the msdocs started listing Rune I knew exactly what it was before reading.

In at least two instances, Plan 9 and Go, you have the individuals responsible for UTF-8 using the name Rune. I think it's safe to say they thought about these concerns already and still thought Rune was reasonable. Runic isn't really a used writing system anymore, other than with some traditionalists. And Rune does mean the grapheme in that system, just like it essentially means the grapheme here (except in cases like control characters.

I really see little wrong with the naming. Runic is such an old writing system I highly doubt your average programmer is going to confuse it, and there's already been a several decade old de-facto standard of Rune for proper Unicode "characters".

@Entomy

just like it essentially means the grapheme here (except in cases like control characters.

This is simply not true. Unicode contains a huge number of precomposed code points that represent multiple graphemes (generally letter and diacritic combinations), and these are commonly used to write languages such as French and Spanish, and pretty much all of the computerized text in these languages will use those code points.

Conversely, even when a single code point represents one grapheme, it's very common for them to combine into a _grapheme cluster_, which is essential for the proper handling of text in most Indian languages. So, a single character as perceived by the user when moving with the arrow keys often corresponds to multiple code points in sequence. So, there can be no easy correspondence made between code points and either graphemes or grapheme clusters. Even “character” would probably be a better name, considering that programmers are used to considering characters weird and wacky at this point, while “rune” gives the impression that the issue of figuring out user-perceived character boundaries has been solved for the programmer already when it has in fact not been.

When the msdocs started listing Rune I knew exactly what it was before reading.

The fact that you thought that the name rune described graphemes well is very good evidence of the issue that I have here: the name “rune” gives programmers a false sense of security by making it easier to assume that there is such a correspondence.

In at least two instances, Plan 9 and Go, you have the individuals responsible for UTF-8 using the name Rune.

As much respect as I have for Ken Thompson and Rob Pike, their work here was essentially just devising a very clever scheme for encoding a series of variable-length integers. They are not experts on Unicode as a whole, and I disagree with them quite strongly on this issue. I admit that I'm not an expert on Unicode either, but I don't think the appeal to authority here is as strong as it might seem.

and there's already been a several decade old de-facto standard of Rune for proper Unicode "characters".

“Standard” you say? It has mostly just been these two pushing the name, and a few minor programming languages such as Nim adopting it from Go. And of course I must repeat again that a code point does not represent a single “proper Unicode character” whether that be in the sense of selection, arrow key movement, graphemes, or grapheme clusters.

...essentially means the grapheme here...

Yes, as it not exactly but roughly close enough. Graphemes, at least as they are defined in linguistics are the orthographic components that make up a writing system and are used to express phonemes. These aren't a 1:1 thing. In syllabaries and logosyllabaries a single grapheme can represent multiple phonemes, typically a consonant-vowel pair. Conversely alphabetically languages often have cases of multiple graphemes representing a single phoneme, such as "th" in English being responsible for the archaic eth and thorn, depending on the specific word. Then you can't even find agreement across languages as to whether a letter like 'ĂĄ' is it's own unique letter, or 'a' with an accent. We can't even establish consistency in languages over thousands of years old. We're not going to have a perfectly consistent addition on top of that, that is the encoding of these.

Since you're arguing for extremely strict semantics, what UNICODE calls a "grapheme cluster" is often in linguistics just a single grapheme. Does this invalid UNICODE? No. Does this mean UNICODE needs to rename it? No. Why? Because context. Fields have their own lingo, and as long as there isn't conflation within a single field it's not an issue.

I don't see the name as too big of a deal. Msdocs is clear about what Rune is in the summary. If people don't read the docs that's their own problem. People aren't reacting vehemently to 'Stream' and saying nonsense like "oh but what if people think it's a small river, because that already has the same name!" No.

@Serentty @Entomy You both might also be interested in the StringInfo class, which exposes the actual Unicode concept "extended grapheme clusters". The StringInfo type is fairly ancient and as a result implements a very old version of the Unicode standard, but there's active work to update it to be compliant with UAX #29, Sec. 3.

Yes, as it not exactly but roughly close enough.

I think the issue of composed versus decomposed representations makes this untrue. If we're going by the linguistic definition of a grapheme here as opposed to any sort of computing-related definition, then 한 and 헌 are the exact same sequence of graphemes (three Hangul jamo representing the syllable _han_ as the segments H-A-N), and yet the first is only one code point whereas the second is a sequence of three.

Fields have their own lingo, and as long as there isn't conflation within a single field it's not an issue.

This is exactly my point as well. Unicode is a really complicated system with its own terminology, so why try to force some sort of half-baked “intuitive” term onto it when it doesn't line up that accurately? Code points are code points. They have no linguistic parallel, and trying to do be intuitive while only 75% accurate is a recipe for the same kind of disaster that C# is still trying to recover from.

Since you're arguing for extremely strict semantics, what UNICODE calls a "grapheme cluster" is often in linguistics just a single grapheme.

In the standard, a cluster is allowed to comprise only a single grapheme. There is nothing wrong with this here. A _cluster_ is a unit of text selection and cursor movement.

I don't see the name as too big of a deal. Msdocs is clear about what Rune is in the summary. If people don't read the docs that's their own problem.

This is the “programmers need to be smarter” argument that comes up repeatedly in defence of bad design decisions. If programmers need to be reading the documentation and learning that a rune is a Unicode code point anyway, then what's the point calling it a a more “intuitive” name in the first place? The argument here seems to be that “code point” is confusing, so it makes sense to choose a more intuitive name, but then when faced the issue of the name being misleading, the defence is that programmers should know what a code point is anyway from reading the documentation. If that's the case, why not just call the type CodePoint and make it easier for programmers to look up and learn about? This is all putting aside the issue that the .NET documentation is pretty terrible with regards to Unicode in the first place, treats surrogate pairs as an afterthought in a world of “16-bit Unicode characters”.

This is the “programmers need to be smarter” argument that comes up repeatedly in defence of bad design decisions.

I never said this.

The argument here seems to be that “code point” is confusing

I never said this either.

People aren't reacting vehemently to 'Stream' and saying nonsense like "oh but what if people think it's a small river, because that already has the same name!" No.

I'm saying that programmers are smart enough to not think Rune is specifically a runic rune, much in the same way that they know Stream isn't a small river.

Let me repeat this

I'm saying programmers are smart enough to figure this out. You're putting words into my mouth.

I don't see the name as too big of a deal. Msdocs is clear about what Rune is in the summary. If people don't read the docs that's their own problem.

This is what I'm referring to here. The argument in favour of the name “rune” is based on intuition and the intuitive connection with the notion of a grapheme. You yourself were arguing that the two lined up closely enough that it wasn't an issue. When I pointed out all of the ways that that intuition was wrong and the correspondence could be very bad, your response was essentially that it didn't matter because programmers needed to read the documentation anyway. This is what I mean by “programmers need to be smarter.” Documentation is not an excuse for misleading names when there's no legacy reason for them.

I'm saying that programmers are smart enough to not think Rune is specifically a runic rune, much in the same way that they know Stream isn't a small river.

My argument here is not that people will confuse it with runic runes. My argument is that people will confuse it with glyphs, graphemes, and grapheme clusters, which despite your insistence all correlate very badly with code points.

I'm saying programmers are smart enough to figure this out. You're putting words into my mouth.

Smart enough to figure out that they're not actual Germanic runes, sure. But to figure out that they're not glyphs, graphemes, or grapheme clusters? My actual experience with the quality of most software's handling of Unicode says no.

If people don't read the docs that's their own problem.

Yes, and I stand by this. Not as a matter of deficiency in intelligence, but rather of tendency towards hasty assumptions.

If a programmer assumes String means a strong, thin, piece of rope, made from the twisting of threads, because, yes it does mean that, that is not considered a problem with the name String.

If a programmer assumes Char means a charred material such as charcoal, or a particular type of trout, that is not considered a problem with the name Char.

If a programmer assumes character means the the portrayal of a set of mental and ethical traits used in storytelling, that is not considered a problem with the name character.

Notice these are all text/linguistic matters. They all have other meanings. And yet programmers have acclimated just fine. Those terms have become de-facto standards, because of an established convention in the field: our lingo. There's established precedent that programmers _are_ smart enough to follow along with this.

You yourself were arguing that the two lined up closely enough that it wasn't an issue.

Yes this is GitHub. On an already closed issue, where I was just adding my thoughts on why I felt Rune was fine because there was some established precedent in the name. This isn't the place nor context to write a treatise, filled with extensive definitions and carefully chosen words. For example, if I'm putting in a PR for, say, a UTF-8 decoder, I'm not going to explicitly describe why I implemented the Hoehrmann DFA over alternative approaches. I'm just gonna say "here it is, here's some proof it works, here's some benchmarks backing up why I went with this".

My argument is that people will confuse it with glyphs, graphemes, and grapheme clusters

They aren't confusing any of the aforementioned, nor Tree, Heap, Table, Key, Socket, Port...

This is an extremely disingenuous argument. A piece of thread and a string of text are not easily confused. A tall plant and a tree data structure are not easily confused. A code point on the other hand is a very poorly understood concept by most programmers, and constantly confused with all of the other concepts we've discussed. The solution to this is, as you say, reading the documentation. However, a language using its own “clever” name for code points makes it even more difficult for apply knowledge from the _actual Unicode documentation_ to that language. And that brings me to this:

Those terms have become de-facto standards, because of an established convention in the field: our lingo.

And this is the crux of it all. You seem to be claiming that either “rune” is a well-established term for a code point that is widely understood in programming, or it should be. If it's the former, then I invite you to ask an average programmer experienced in a major programming language other than Go if they have ever heard it. If it's the latter, then I would ask you the point of competing with official Unicode terminology in an already confusing and poorly-understood situation that is frequently misunderstood by even highly experienced developers.

@Entomy outsider input: your entire argument, as far as I can tell, is 'it's confusing and bad, yes, but it's not that confusing and bad'.
So? Why can't it be actually good instead? What is the problem with naming it exactly what Unicode names it?
Also, runes are not code points, or even graphemes or clusters, in the general field of computing. If you search 'Unicode runes' in Google, anything relating them to code points doesn't show up until page 2, and even then it's just godoc / Nim links. Even on DuckDuckGo, which programmers might be more comfortable with, it's still a page 2 result. So the only argument left for the name I've seen is that it's intuitive that it represents a code point, but it's not. It's intuitive that it represents a grapheme cluster, or perhaps just a grapheme.
Source: I've used Go and I thought it was a grapheme until four years later when I read this issue just now.

(and saying that it's okay that it suggests a grapheme because it's 'close enough' reminds me of the 16-bit char being close enough.)
Yes, if programmers were smarter and read more documentation we wouldn't need a meaningful name for it, or even a type at all. People would just know to pass code points in an int around instead of char. But they're not. They're as smart as they are right now, and that's not going to change just because Yet Another API was added. The goal is to increase the amount of software that correctly handles languages other than English, not just introduce new ways to do the same thing and keep the same barriers to entry as before.

Just for the sake of argument, and for scientific purposes, I'd like to point everyone here at the one programming language that does Unicode text handling best, where »best« is defined by »closest in accordance to the Unicode standard«, not by faking simplicity: Swift

  • String is a buffer of arbitrary Unicode text.
  • Character, which you iterate over and what not, is not a single Unicode Scalar Value, but an Extended Grapheme Cluster. See this example for the grapheme cluster 한: let decomposed: Character = "\u{1112}\u{1161}\u{11AB}" // ᄒ, ᅡ, ᆫ
  • If you need Unicode Scalar Values, you can iterate over them as well. Their type is called UnicodeScalar.
  • And if you really-really feel like needing it, you can also iterate over UTF-8 and UTF-16 code units, yielding UInt 8s and UInt 16s.

Now, I'm not here suggesting for C# to go full Swift style. While this would be amazing, it's also a damn lot of changes and work needed. I'm here to suggest picking up Swift-style naming, however, for all the reasons @Serentty pointed out, and to leave the option open to turn text strings Swift style eventually.

Some potential better names than Rune: CodeUnit32, UnicodeScalar, CodeUnit, UniScalar, UnicodeValue, UniValue, UnicodeScalarValue. I think the first two might fit neatly into C#'s naming conventions. Note that UnicodeScalar is objectively the better name, as code units are just ways to encode a Unicode Scalar Value in Unicode lingo. So CodeUnit32 implies iterating over the code units of a UTF-32-encoded text string, whereas UnicodeScalar is encoding-agnostic.

Edit: Yes, the name System.Rune is already out there. All this is just an »if we want to make it better before this thing is half a decade old«.

@pie-flavor

your entire argument, as far as I can tell, is 'it's confusing and bad, yes, but it's not that confusing and bad'.

No that's not my argument at all. I'm doing the best with the disability I have, but this isn't my intended communication.

If you search 'Unicode runes' in Google, anything relating them to code points doesn't show up until page 2, and even then it's just godoc / Nim links.

If you search 'Unicode string' in Google you won't get specifically how .NET strings work, either. This is a matter of searching for an adjacent thing. As a very strict analogy, I program in both .NET and Ada; string is not the same between them, and some slight reading for each is a good idea.

Overloaded definitions are not unusual in language, and yet we manage just fine. It might surprise you, but "run" has at least 179 formal definitions, "take" has at least 127, "break" has at least "123", and so on. [source] People are amazingly capable and can sucessfully navigate far more complexity than what's deemed problematic here. The concern of "rune" having at least 2 formal definitions is, in my opinion, not warranted when people can be shown to deal with over 50x the overloads.

Furthermore, this is grossly exploiting search engine behavior. With most search engines, you get results based on how many pages link to something. There are other factors as well, with each approach weighting things differently. As .NET Rune is a fairly recent concept by comparison, there's going to be far less content talking about it, and it will take more pages to get to it. But it's also using the wrong search tool. If I want to find research on string-searching algorithms, to see if anything new has come up in the past few years, I don't search Google or DDG. Semantic Scholar, Google Scholar, and others are better starting points. Similarly, if you want to understand things about .NET API's, you search MSDocs first. If I complain that "moment of inertia", a physics/engineering term, is vague or misleading in its name, and it should be renamed because I can't find any information about it in the first few books, starting from the lowest number in a library using Dewey Decimal Classification, that isn't a problem with the naming of "moment of inertia"; I'm clearly looking in the wrong place.

Source: I've used Go and I thought it was a grapheme until four years later when I read this issue just now.

I looked through the Go docs and release notes, at least those I could find, and I have to agree with you. They are very vague about what rune is, and unfortunately are even vague about how big rune is. I suspect this vagueness will cause problems later on, as I've seen Ada be equally as vague about data type constraints and have it bite itself in the ass years later.

However I must say msdocs does a much better job with a very detailed and concise description.

Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).

This being said, the remarks are somewhat lacking and some elaboration on why Rune exists and when you'd want to use it would be beneficial (and also the appropriate place for a more detailed explanation than my simplified aforementioned one). I'll put forward some improvements there.

@Evrey

Just for the sake of argument, and for scientific purposes, I'd like to point everyone here at the one programming language that does Unicode text handling best

This is an opinion. One I absolutely agree with; Swift certainly handles modern UNICODE better. But without a citation of peer-reviewed reproducible research confirming these results, this is not a scientific claim.

Now, I'm not here suggesting for C# to go full Swift style. While this would be amazing, it's also a damn lot of changes and work needed.

And would break existing software.

leave the option open to turn text strings Swift style eventually.

And would break existing software.

Yes, the name System.Rune is already out there. All this is just an »if we want to make it better before this thing is half a decade old«.

And would break existing software.

As a hypothetical if changes were to be made to the existing name, how do you propose existing software targeting .NET Core 3.0/3.1, where Rune is already in use, still be compatible, while also having it exist as a different name in later target runtimes?

And would break existing software.

As mentioned I'm just arguing from the perspective of principle and idealism. The reality of things has been mentioned plentifully. Though there is some nuance to all that:

  • Going Swift-style with strings does not necessarily break software. It's just a matter of adding more enumeration methods and types on top of the already existing String interface. I do not mean radical things like changing System.Char into a grapheme cluster type or some such thing by that.
  • If an existing type name like System.Char would be repurposed for a way different type, then yes, that would be a huge breaking change. And an irresponsible change at that. I'm with you there.
  • A hypothetical .NET Core 4.0, speaking in SemVer, can do anything it wants. Other than that, the changes until a hypothetical 4.0 are not that scary: Turn System.Rune into a deprecated type alias for System.UnicodeScalar or whatever the name would be. Software using Rune won't notice a difference, apart from a deprecation note, and new software can use the better-named actual type. And a hypothetical 4.0 then just drops Rune.
  • Similarly, System.Char could be turned into an alias for System.CodeUnit16 or something.
  • Doing it Swift-style then effectively just means adding System.GraphemeCluster into the mix.
  • The introduction of more, new keyword aliases for all these types may be problematic.

Just dropping food for thought here. I think System.Rune, while a bad type name for its purpose, does not really make the previous naming status quo any worse. I think it's great that there finally is a proper type able to encode all Unicode scalars. I do see a neat opportunity to spread a trend of more accurate Unicode handling and naming, however. An opportunity everyone here is free to put aside.

Hi all - the name System.Text.Rune is what shipped and what we're using going forward. There was significant (and heated!) earlier discussion of using the name UnicodeScalar instead of Rune, but in the end Rune won out. The team is not entertaining the idea of choosing a different name for it at this time. And while I know folks are passionate about this and we'll continue to monitor the conversation here, ultimately be aware that any energy spent continuing the litigate the naming issue will not yield dividends.

For clarification, and per the docs: the System.Text.Rune type in .NET is exactly equivalent to a Unicode scalar value. This is enforced by construction. This makes it more analogous to Swift's UnicodeScalar type than it does to Go's rune type.

There's an effort underway to add a section to the Rune docs detailing its use cases and how it relates to other text processing APIs in .NET and concepts in Unicode. The tracking issue is at https://github.com/dotnet/docs/issues/15845. There's also a link from that tracking issue to a current draft of the concept docs.

To me the main drawback with UnicodeScalar is the large disparity between the length of the type name and the datasize of the type. Essentially it is an int with some gaps in its domain.

However, the verboseness in usage is would be extreme:

foreach (UnicodeScalar unicodeScalar in name.EnumerateUnicodeScalars())
{
     // ... unicodeScalar contains 1 int
}

vs the equivalent char over a string (and ideally people would use the new type over char as they are whole values rather than containing split values)

foreach (char c in name)
{
     // ... c contains 1 ushort
}

Rune is a compromise in type name verboseness:

foreach (Rune rune in name.EnumerateRunes())
{
     // ... rune contains 1 int
}

@GrabYourPitchforks

Hello! To be honest, I got caught up in this argument not because I'm trying to convince the .NET people that the name needs to be changed, as it seems that that ship has sailed, but simply because I wanted to express my opinion to others in this thread who disagreed with it. I think it's wonderful that C# finally has a _real_ character type as opposed to the broken character type that it has had for so long, and the name is completely secondary to that. I understand that there's a huge balance to be struck between brevity and accuracy, and although I would have placed the sweet spot somewhere around CodePoint, I understand why others would disagree.

But again, I want to thank you for all the hard work in modernizing .NET's Unicode support! This is something that makes a huge difference to a lot of people around the world.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EgorBo picture EgorBo  Â·  3Comments

jzabroski picture jzabroski  Â·  3Comments

GitAntoinee picture GitAntoinee  Â·  3Comments

sahithreddyk picture sahithreddyk  Â·  3Comments

chunseoklee picture chunseoklee  Â·  3Comments