To get text elements from a string, you can currently use System.Globalization.StringInfo.GetTextElementEnumerator:
```c#
public class StringInfo
{
public static TextElementEnumerator GetTextElementEnumerator(string str);
public static TextElementEnumerator GetTextElementEnumerator(string str, int index);
}
public class TextElementEnumerator : IEnumerator
{
public bool MoveNext();
public object Current { get; }
public string GetTextElement();
public int ElementIndex { get; }
public void Reset();
}
Notice that `TextElementEnumerator` is a (non-generic) enumerator, not enumerable, so it can't be used in a `foreach` or in LINQ or pretty much any other collection-related operation. To use it, you write code like:
```c#
var textElementEnumerator = StringInfo.GetTextElementEnumerator(s);
while (textElementEnumerator.MoveNext())
{
string textElement = textElementEnumerator.GetTextElement();
// process textElement here
}
This makes the API unfamiliar and inconvenient to use. I propose that a new API based on IEnumerable<T> should be added.
```c#
class StringInfo
{
public static IEnumerable
public static IEnumerable
}
The new methods work like the old methods, except that they return generic enumerable instead of enumerator.
## Usage
The new API could be used just like any other `IEnumerable<T>`, e.g.:
```c#
foreach (var textElement in StringInfo.GetTextElementEnumerator(s))
{
// process textElement here
}
IEnumerable<string> means information about ElementIndex is lost. Is that information useful enough to use something like IEnumerable<(string textElement, int index)> instead? (Possibly using a custom struct instead of a tuple.)string is a substring of the input string. Would it be worth waiting for spans and use IEnumerable<ReadOnlyMemory<char>> instead?IEnumerable<string> requires allocating that IEnumerable<string>. Would it be worth to return struct enumerable with struct enumerator instead, which would avoid the allocation when used in foreach?I think having TextElementEnumerator impelment the IEnumerable interface should be enough to have it work with the foreach(...)
@tarekgh That's an interesting option. I don't like that it would mean that the method name GetTextElementEnumerator and the type name TextElementEnumerator would become inaccurate, but it's nice that it does not increase API surface by much.
With all of the new Rune support being added, including enumerators, is this still needed?
cc: @GrabYourPitchforks
@stephentoub Rune represents Unicode scalars, which are not the same thing as text elements. For example the string "a\u0301" (where the second character is U+0301 COMBINING ACUTE ACCENT) is a single text element (and GetTextElementEnumerator does return it that way), but it's two Unicode scalars.
So, unless the Rune API also has some explicit support for enumerating text elements that I missed, I don't think it changes anything here.
So, unless the Rune API also has some explicit support for enumerating text elements that I missed
I thought I heard Levi talking about doing that, which is why I asked, but maybe I misunderstood.
I found this comment by @GrabYourPitchforks from June: https://github.com/dotnet/corefxlab/issues/2350#issuecomment-397425721. Assuming it's still current and that I understand it correctly, I think it means that Utf8String/UnicodeScalar/Rune won't have API for enumerating text elements.
I think @svick is right and we should keep this proposal for now.
Ok.
If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.
If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.
How breaking would it be if we updated to the latest standard regardless for 3.0?
How breaking would it be if we updated to the latest standard regardless for 3.0?
We don't consider updating the Unicode data as a breaking changes. it is like any other globalization data which can change at any time.
by the way, StringInfo is already using CharUnicodeInfo which should be using the Unicode data we have updated to the latest release of Unicode standard.
Quick update: I found a way to smuggle the TR-29 grapheme break data (see https://www.unicode.org/reports/tr29/) in the existing CharUnicodeInfo class without significantly expanding the size of the RVA static data. Will experiment with this a bit more and submit a PR over the next few days.
@ahsonkhan You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?
APIs like this don't really need to be allocation-free like the ROS<char> APIs require - when considering string usability is of higher concern than is performance. One could imagine returning IEnumerable<string>, but there is significant value in knowing _where_ this substring occurred in the original source string, so Range springs to mind.
But your thoughts on this would be appreciated since you have recent experience with this type of API surface. :)
I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.
Most helpful comment
If we do this, it should probably be married with an overall logic update to
TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.