Runtime: Proposal: Better API for StringInfo.GetTextElementEnumerator

Created on 19 Nov 2016 · 15Comments · Source: dotnet/runtime

To get text elements from a string, you can currently use System.Globalization.StringInfo.GetTextElementEnumerator:

```c#
public class StringInfo
{
public static TextElementEnumerator GetTextElementEnumerator(string str);
public static TextElementEnumerator GetTextElementEnumerator(string str, int index);
}

public class TextElementEnumerator : IEnumerator
{
public bool MoveNext();
public object Current { get; }

public string GetTextElement();

public int ElementIndex { get; }

public void Reset();

}

Notice that `TextElementEnumerator` is a (non-generic) enumerator, not enumerable, so it can't be used in a `foreach` or in LINQ or pretty much any other collection-related operation. To use it, you write code like:

```c#
var textElementEnumerator = StringInfo.GetTextElementEnumerator(s);

while (textElementEnumerator.MoveNext())
{
    string textElement = textElementEnumerator.GetTextElement();
    // process textElement here
}

This makes the API unfamiliar and inconvenient to use. I propose that a new API based on IEnumerable<T> should be added.

Proposed API

```c#
class StringInfo
{
public static IEnumerable GetTextElements(string str);
public static IEnumerable GetTextElements(string str, int index);
}

The new methods work like the old methods, except that they return generic enumerable instead of enumerator.

## Usage

The new API could be used just like any other `IEnumerable<T>`, e.g.:

```c#
foreach (var textElement in StringInfo.GetTextElementEnumerator(s))
{
    // process textElement here
}

Open questions

Returning IEnumerable<string> means information about ElementIndex is lost. Is that information useful enough to use something like IEnumerable<(string textElement, int index)> instead? (Possibly using a custom struct instead of a tuple.)
Each text element string is a substring of the input string. Would it be worth waiting for spans and use IEnumerable<ReadOnlyMemory<char>> instead?
Returning IEnumerable<string> requires allocating that IEnumerable<string>. Would it be worth to return struct enumerable with struct enumerator instead, which would avoid the allocation when used in foreach?

api-needs-work area-System.Globalization

Source

svick

Most helpful comment

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

GrabYourPitchforks on 26 Nov 2018

👍3

All 15 comments

I think having TextElementEnumerator impelment the IEnumerable interface should be enough to have it work with the foreach(...)

tarekgh on 21 Nov 2016

@tarekgh That's an interesting option. I don't like that it would mean that the method name GetTextElementEnumerator and the type name TextElementEnumerator would become inaccurate, but it's nice that it does not increase API surface by much.

svick on 23 Nov 2016

With all of the new Rune support being added, including enumerators, is this still needed?
cc: @GrabYourPitchforks

stephentoub on 23 Nov 2018

@stephentoub Rune represents Unicode scalars, which are not the same thing as text elements. For example the string "a\u0301" (where the second character is U+0301 COMBINING ACUTE ACCENT) is a single text element (and GetTextElementEnumerator does return it that way), but it's two Unicode scalars.

So, unless the Rune API also has some explicit support for enumerating text elements that I missed, I don't think it changes anything here.

svick on 23 Nov 2018

So, unless the Rune API also has some explicit support for enumerating text elements that I missed

I thought I heard Levi talking about doing that, which is why I asked, but maybe I misunderstood.

stephentoub on 23 Nov 2018

I found this comment by @GrabYourPitchforks from June: https://github.com/dotnet/corefxlab/issues/2350#issuecomment-397425721. Assuming it's still current and that I understand it correctly, I think it means that Utf8String/UnicodeScalar/Rune won't have API for enumerating text elements.

svick on 24 Nov 2018

I think @svick is right and we should keep this proposal for now.

tarekgh on 24 Nov 2018

Ok.

stephentoub on 24 Nov 2018

GrabYourPitchforks on 26 Nov 2018

👍3

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

How breaking would it be if we updated to the latest standard regardless for 3.0?

stephentoub on 27 Nov 2018

How breaking would it be if we updated to the latest standard regardless for 3.0?

We don't consider updating the Unicode data as a breaking changes. it is like any other globalization data which can change at any time.

tarekgh on 27 Nov 2018

by the way, StringInfo is already using CharUnicodeInfo which should be using the Unicode data we have updated to the latest release of Unicode standard.

tarekgh on 27 Nov 2018

Quick update: I found a way to smuggle the TR-29 grapheme break data (see https://www.unicode.org/reports/tr29/) in the existing CharUnicodeInfo class without significantly expanding the size of the RVA static data. Will experiment with this a bit more and submit a PR over the next few days.

GrabYourPitchforks on 25 Sep 2019

👍2

@ahsonkhan You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?

APIs like this don't really need to be allocation-free like the ROS<char> APIs require - when considering string usability is of higher concern than is performance. One could imagine returning IEnumerable<string>, but there is significant value in knowing _where_ this substring occurred in the original source string, so Range springs to mind.

But your thoughts on this would be appreciated since you have recent experience with this type of API surface. :)

GrabYourPitchforks on 27 Sep 2019

I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.