Runtime: Add 'split' support for ReadOnlySpan<char> similar to string

Created on 23 Jan 2018 · 89Comments · Source: dotnet/runtime

Latest Update:

New proposal to re-review: Enumerator returns Range instead of ReadOnlySpan<T>
```C#
partial class MemoryExtensions
{
public static SpanSplitEnumerator Split(this ReadOnlySpan span, T separator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

}

**Previously approved API Proposal**
```C#
public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator GetEnumerator() { return this;  }
    public bool MoveNext();
    public ReadOnlySpan<T> Current { get; }
}

Split off from https://github.com/dotnet/corefx/issues/21395#issuecomment-359342832

_From @Joe4evr on January 21, 2018 23:16_

Can I throw in another suggestion? I'd really like to see some ability to split a ReadOnlySpan<char>. Obviously, you can't return a collection of Spans directly, but isn't that what Memory<T> is for?

// equivalent to the overloads of 'String.Split()'
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, int count);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, int count, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, string[] seperator, int count, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, string[] seperator, StringSplitOptions options);

> The reason for choosing `IReadOnlyList<T>` is to make the resulting collection indexable, just like `string[]`. It would be nice if the implementation is `ImmutableArray<T>`, but I'm not sure if that's a concern for distribution and such.

_From @ahsonkhan on January 22, 2018 01:36_

@Joe4evr, out of curiosity, do you have a scenario atm where these APIs would be useful? If so, can you please show the code sample?

> I would replace the char[] overloads with ReadOnlySpan\
_From @Joe4evr on January 22, 2018 08:35_

My scenario is taking a relatively big string of user input and then parsing that to populate a more complex object. So rather than take the whole string at once, I'd like to parse it in pieces at a time. It'd be pretty nice if this can be facilitated by the `Span`/`Memory<T>` APIs so that this code won't have to allocate an extra 30-40 tiny strings whenever it runs.

Admittedly, I only started on this particular case earlier today, mostly to experiment and find out how much I could get out of the Span APIs at this time.

> Maybe it was a bit naive of me to expect a collection like I did, but I'd at least like to see some API to deal with this scenario a little easier, because I'll probably not be the only one looking to split a span up into smaller chunks like this.

_From @stephentoub on January 22, 2018 08:41_

Splitting support would be good, but I don't think it would look like the proposed methods; as @ahsonkhan points out, that would result in a lot of allocation (including needing to copy the whole input string to the heap, since you can't store the span into a returned interface implementation).

I would instead expect a design more like an iterator implemented as a ref struct, e.g.
`C# public ref struct CharSpanSplitter { public CharSpanSplitter(ReadOnlySpan<char> value, char separator, StringSplitOptions options); public bool TryMoveNext(out ReadOnlySpan<char> result); }`

cc @KrzysztofCwalina, @stephentoub, @Joe4evr

api-needs-work area-System.Memory utf8-impact

Source

ahsonkhan

👍38 ❤2

Most helpful comment

A slightly different take on this. Considering that Range is now available, and the non-zero cost of Slice on potentially-many results, what if the enumerable returned Ranges instead of sliced RoS. In other words, a list of delineations. The consumer could then decide when/if to Slice(range)

foreach (var range in span.Split(','))
{
    var x = span.Slice(range);
}

Perhaps that's too abstract for a public api.

grant-d on 16 Apr 2019

👍8

All 89 comments

Presumed usage would be like this?

var splitter = span.Split(',', StringSplitOptions.RemoveEmptyEntries);
while (splitter.TryMoveNext(out var slice))
{
    //....
}

Yeah, looks good.

Joe4evr on 23 Jan 2018

Would it be possible to implement the enumerable pattern, even if it's not possible to implement IEnumerable<T>?

I think the interface would look something like:

```c#
public static SpanSplitEnumerable Split(this ReadOnlySpan span, char seperator);

public ref struct SpanSplitEnumerable
{
public SpanSplitEnumerator GetEnumerator();
}

public ref struct SpanSplitEnumerator
{
public bool MoveNext();
public ReadOnlySpan Current { get; }
}

That way, the usage could be:

```c#
foreach (var slice in span.Split(','))
{
    …
}

If indexing is important, something like ref struct SpanSplitList would probably work, but I think it couldn't guarantee zero allocations.

svick on 23 Jan 2018

👍6

Sorry, didn't see this issue before posting https://github.com/dotnet/corefx/issues/21395#issuecomment-359802015.

As mentioned, I'd love to see something like Google Guava's CharMatcher for this. It could be useful for other scenarios as well.

khellang on 23 Jan 2018

It would be great if the split worked also with Span and returned ReadOnlyMemory>, ReadOnlySpan>

grahamehorner on 4 May 2018

It would be great if the split worked also with Span and returned ReadOnlyMemory>, ReadOnlySpan>

Clarifying the comment so the generic types show up:
It would be great if the split worked also with Span<byte> and returned ReadOnlyMemory<ReadOnlyMemory<T>>, ReadOnlySpan<ReadOnlyMemory<T>>

ahsonkhan on 4 May 2018

Simple implementation for SpanSplitEnumerable<T> Split<T>(this ReadOnlySpan<T> span, T separator) interface where the return type is foreachable. Unit tests included.

https://gist.github.com/LordJZ/92b7decebe52178a445a0b82f63e585a

The sentinel is a "clever trick" to avoid having a boolean field, can be replaced with a backing field.

LordJZ on 23 May 2018

👍4

We needed features like this on top of ASP.NET's StringTokenizer type in the aspnet/WebHooks repo. See that repo's TrimmingTokenizer and have seen some interest (aspnet/WebHooks#324) in making that more broadly available.

It would be great to point customers to something in the BCL instead of (say) adding TrimmingTokenizer features to StringTokenizer. What is the expected timeframe for the "Future" milestone that would include a dotnet/corefx#26528 fix?

/cc @davidfowl

dougbu on 7 Sep 2018

@ahsonkhan if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?

danmosemsft on 10 Sep 2018

cc @JeremyKuhne who has been doing thinking about low allocation string operations.

danmosemsft on 10 Sep 2018

StringBuilder added an enumerator for the next version that we should consider when deciding on a pattern for enumerating spans. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs#L587

JeremyKuhne on 10 Sep 2018

if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?

Let me make sure the API shape is clear. Incorporating the recent feedback (to get foreach support, make it generic), here is the API:

```C#
public static SpanSplitEnumerable Split(this ReadOnlySpan span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable {}

public ref struct SpanSplitEnumerable where T : IEquatable
{
public SpanSplitEnumerator GetEnumerator();
}

public ref struct SpanSplitEnumerator where T : IEquatable
{
public bool MoveNext();
public ReadOnlySpan Current { get; }
}
```

ahsonkhan on 12 Sep 2018

@KrzysztofCwalina this is the enumerating spans issue we discussed yesterday.

JeremyKuhne on 12 Sep 2018

Does StringSplitOptions really make sense for a generic T?

khellang on 12 Sep 2018

Does StringSplitOptions really make sense for a generic T?

I suppose a SpanSplitOptions which would mimic StringSplitOptions makes more sense here.

candoumbe on 12 Sep 2018

Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?

KrzysztofCwalina on 12 Sep 2018

@KrzysztofCwalina I think it makes the API cleaner and easier to understand. The foreach pattern requires an enumerable and an enumerator, and that's exactly what the proposed API provides.

It there any advantage in combining them, apart from decreasing the size of the API surface?

svick on 13 Sep 2018

Video

We don't think we want these APIs:

They allocate
They aren't as convenient as String.Split

If you care about allocations, then you want a different API. And if you don't care about allocations, well, then you can just use String.Split.

So what API would we like to see? It could be a struct-based enumerator that allows the consumer to foreach the individual spans without allocations. Alternatively (or additionally) we could have a method that allows the consumer to pass in a buffer with the locations, which, for the most part, could be stackallocated.

terrajobst on 25 Sep 2018

We don't think we want these APIs:
1. They allocate

I readily concur with this. If I don't care about allocations, I'll simply be using string.Split. If I care enough to want to avoid allocating substrings, then obviously I would like to avoid _any_ allocations whatsoever.

Suggest ReadOnlySpan<T>.Split returns ReadOnlySpan<ReadOnlySpan<T>> on the stack so there is no more allocation.

If I want to use foreach, which allocations, then I can still enumerate over the ReadOnlySpan<ReadOnlySpan<T>>. Otherwise, I'll use a simple for loop to have zero allocation.

schungx on 25 Oct 2018

Can you use ReadOnlySpan<T> as a generic argument?

khellang on 25 Oct 2018

@schungx

Suggest ReadOnlySpan<T>.Split returns ReadOnlySpan<ReadOnlySpan<T>> on the stack so there is no more allocation.

Split can't return a Span that was stack allocated by itself. It could take that Span as an additional parameter, but then you have to somehow handle the case where that Span is not big enough. I believe this is what @terrajobst meant by "we could have a method that allows the consumer to pass in a buffer with the locations".

svick on 25 Oct 2018

@terrajobst, the API that @svick proposed above does not seem to allocate. Am I missing something? Or are you saying that we want both convenience and no allocations?

@svick, I wonder how important it is for this enumerator to be easy to understand. The classic enumerator pattern has two types for many reasons that are less and less applicable once you deal with by ref structs that cannot implement enumerator interfaces and get copied when passed around. I do agree we should discuss pros and cons and maybe indeed it's better to have two types, but I wanted to open the discussion as it seems such an overkill to add two by ref types just so we can split.

KrzysztofCwalina on 25 Oct 2018

@KrzysztofCwalina I agree that my argument of "this version of the API is slightly easier to understand" is fairly weak in this case. But in my opinion, the argument of "this version of the API is slightly smaller" is even weaker, which is why I prefer to have two types.

But ultimately it doesn't make that much of a difference, as long as I'll be able to have foreach (var slice in span.Split(',')) without allocations, I'll be happy.

svick on 25 Oct 2018

@svick I'm suspect a normal foreach will always allocate because the current state must be kept somewhere, even if you somehow come up with value-based iterators...

And what is the _purpose_ of this again? That's to prevent allocations in parsing code in tight loops, where all we're doing is manipulating streams of text without ever changing them.

schungx on 31 Oct 2018

@schungx There's nothing really new to come up with. For example, foreach over a List<T> already does not allocate on the heap, because the current state is kept in the enumerator struct on the stack. A very similar approach can be used here.

svick on 31 Oct 2018

❤1

They allocate

It could be a struct-based enumerator that allows the consumer to foreach the individual spans without allocations.

Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?

The APIs, as suggested here, don't allocate and are essentially what was recommended. Maybe it was missed since it wasn't at the top. Will copy it over to the top post.

Making it into a single type:
```C#
public static SpanSplitEnumerator Split(this ReadOnlySpan span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable {}

public ref struct SpanSplitEnumerator where T : IEquatable
{
public SpanSplitEnumerator GetEnumerator() { return this; }
public bool MoveNext();
public ReadOnlySpan Current { get; }
}
```

ahsonkhan on 13 Nov 2018

❤1 👍1

We don't think it should be constrained to just char, so leaving it as T is fine.
While we don't want to expose an overload that works on Span<T>, we should make sure we can
- We should rename SpanSplitEnumerator to ReadOnlySpanSplitEnumerator to
The enumerator should probably live on the same type as the method, which would be MemoryExtensions
Using StringSplitOptions might feel odd, but (1) we've never extended it and (2) and the operations are about how the split is performed, which applies to this API to
If we ever need to expose char[] (matches any) we'd add a new method called SplitAny so that we can also have the equivalent of string (matches the sequence)
We might want to provide a similar method for memories, but let's wait until that's required

terrajobst on 27 Nov 2018

Are we going to have an overload that takes ReadOnlySpan as the delimiter as well?

Gnbrkm41 on 11 Apr 2019

I tried implementing the APIs above, and additionally a two more split methods in my personal repository. The types and methods themselves can be found in ConsoleApp4 (lol) folder, and the tests can be found in SplitTests folder.

public static ReadOnlySpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T delimiter,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitBySequenceEnumerator<T> Split<T>(this ReadOnlySpan<T> span,
    ReadOnlySpan<T> delimiter, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitByAnyEnumerator<T> SplitByAny<T>(this ReadOnlySpan<T> span,
    ReadOnlySpan<T> delimiters, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

The first one replaces string.Split(char), the second one replaces string.Split(string), and the third replaces string.Split(char[]). The last two are not really included in the original discussion (and would take a bit of time to go through approval process even if it does), but I personally think it is worth including to have the complementary alternatives as well.

It's not a great code perhaps, but I'd love to open a PR for this issue (with the code above). Although the last two methods are not approved, at least I can open one for the regular T version.

Gnbrkm41 on 15 Apr 2019

foreach (var range in span.Split(','))
{
    var x = span.Slice(range);
}

Perhaps that's too abstract for a public api.

grant-d on 16 Apr 2019

👍8

Did some experimenting with this as well.
Not sure if it's an optimal solution though.
https://github.com/bbartels/coreclr/blob/master/src/System.Private.CoreLib/shared/System/MemoryExtensions.Split.cs

bbartels on 17 Apr 2019

@grant-d I like that API

davidfowl on 18 Apr 2019

That seems reasonable to me.

stephentoub on 18 Apr 2019

Only problem is that it would conflict with the approved api, as only the return type differs.

bbartels on 18 Apr 2019

Correct; it would need to go back into api triage

grant-d on 19 Apr 2019

Returning ranges has some real advantages, namely we can operate on a ReadOnlySpan<T> but then apply the results to other types, in particular Span<T>, Memory<T>, and ReadOnlyMemory<T>, e.g.
C# foreach (Range r in memory.Span.Split(':')) Process(memory[r]);

stephentoub on 19 Apr 2019

👍5

Correct; it would need to go back into api triage

To clarify, can you share the proposed range based API shape and then we can reconcile it/review it?
I can update the original post if you provide the precise shape, and we can re-review.

ahsonkhan on 19 Apr 2019

I suppose it's just substituting Range for ReadOnlySpan\ I don't think overloading Slice would work as stephen suggested (Probably just a typo). A ReadOnlySpan.Slice call would be ambiguous, as it already has an overload that takes an int.

So something like:

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

bbartels on 19 Apr 2019

👍1

I don't think overloading Slice would work as stephen suggested (Probably just a typo)

I didn't suggest overloading Slice. If a type has a Slice(int, int) method, the C# 8 compiler now supports passing a Range as an indexer on the type and the compiler expands it into a call to the Slice(int, int) method.

stephentoub on 19 Apr 2019

I didn't suggest overloading Slice.

memory.Span.Slice(':')

I believe this is what's causing the confusion.

ahsonkhan on 19 Apr 2019

Returning ranges has some real advantages, namely we can operate on a ReadOnlySpan<T> but then apply the results to other types, in particular Span<T>, Memory<T>, and ReadOnlyMemory<T>, e.g.
foreach (Range r in memory.Span.Slice(':')) Process(memory[r]);

I meant in the foreach, you used memory.Span.Slice(':').

bbartels on 19 Apr 2019

Ah, sorry, I meant Split, not Slice.

stephentoub on 19 Apr 2019

👍1

Maybe it would make sense to add another enumerator, which would allow for splitting by a sequence.
Just like string.Split("??") is possible.

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public static SpanSplitSequenceEnumerator<T> Split(this ReadOnlySpan<T> span, ReadOnlySpan<T> seperators,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

public ref struct SpanSplitSequenceEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitSequenceEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

Could also be merged into one enumerator that just holds an ReadOnlySpan for separators. Would be a little more space efficient when T is a large struct (> 12 bytes on 64 bit systems), but also a little less space efficient on smaller structures (which I reckon is the most common use-case with char). Maybe something to debate in triage.

bbartels on 19 Apr 2019

A split api that returns ranges is pure genius @grant-d. That would be a really nice api to aim for.

NinoFloris on 29 Apr 2019

👍3

Sad to see we haven't made progress on this!

scalablecory on 10 Jul 2019

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan<T> span)?

This will enable splitting on surrogate pairs etc.

scalablecory on 10 Jul 2019

@scalablecory, how about Rune? (and ReadOnlySpan of Rune)

Gnbrkm41 on 10 Jul 2019

Rune doesn't solve grapheme clusters, and dealing with normalization would become unwieldy. A predicate would handle all cases.

scalablecory on 10 Jul 2019

Isn't one of the main points within the Span ecosystem to be allocation free? I feel like bringing in a delegate would just go against that mentality. Wouldn't allowing to pass a ReadOnlySpan as a separator solve the graphmere cluster problem just as well, without an allocation?

bbartels on 10 Jul 2019

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan span)?

And turn split into an O(N^2) algorithm?

I think this should be kept relatively simple, making the common case easy.

stephentoub on 10 Jul 2019

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan span)?

And turn split into an O(N^2) algorithm?

I think this should be kept relatively simple, making the common case easy.

I don't see how this would be any different than String.Split(string[])

scalablecory on 10 Jul 2019

I don't see how this would be any different than String.Split(string[])

Because string.Split(string[]) takes an array of the exact strings to use, which means a) the implementation knows exactly how many characters need to be compared, and b) regardless string matching algorithms exist to find substrings in O(N+M) time rather than the naive O(N*M) time. Your proposed IsSeparatePredicate that takes a span would presumably need to be invoked at every character and for every length after that character until the end of the string to determine whether it represents a separator.

stephentoub on 10 Jul 2019

I don't see how this would be any different than String.Split(string[])

Because string.Split(string[]) takes an array of the exact strings to use, which means a) the implementation knows exactly how many characters need to be compared, and b) regardless string matching algorithms exist to find substrings in O(N+M) time rather than the naive O(N*M) time. Your proposed IsSeparatePredicate that takes a span would presumably need to be invoked at every character and for every length after that character until the end of the string to determine whether it represents a separator.

Oh, I see the confusion; my fault. I should not make suggestions right before going to sleep!

Imagine something more like: bool IsSeparatorPredicate(ReadOnlySpan span, out int length), which would be called only once per character. it gets passed the full remaining span, and returns the length of the separator to skip in the out param.

scalablecory on 10 Jul 2019

Imagine something more like: bool IsSeparatorPredicate(ReadOnlySpan span, out int length), which would be called only once per character. it gets passed the full remaining span, and returns the length of the separator to skip in the out param.

Don't think I would be opposed to this idea, but I definitely think there should be a separate (simpler) type that would just handle the cases of splitting on a single character or splitting on a sequence.

bbartels on 10 Jul 2019

there should be a separate (simpler) type

And cheaper. Adding O(N) delegate invocations in an API whose purpose is to reduce overheads is not the right choice. We should start with the 99% case, and only add additional complexity/cost when and where there's a real need.

stephentoub on 10 Jul 2019

👍1

Agreed, not proposing to get rid of the simple overload :)

scalablecory on 10 Jul 2019

Video

Based on the conversation, we believe we would prefer something more scoped:

```C#
partial class MemoryExtensions
{
public static SpanSplitEnumerator Split(this ReadOnlySpan span);
public static SpanSplitEnumerator Split(this ReadOnlySpan span, char separator);
public static SpanSplitEnumerator Split(this ReadOnlySpan span, string separator);

// This API seems possible, but we can defer until we see scenarios
// public static SpanSplitEnumerator<char> SplitAny(this ReadOnlySpan<char> span, ReadOnlySpan<char> separators);

// This API seems possible, but we should defer until we see scenarios. It wouldn't be good make this work for UTF8, for example.
// public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T separator);

}
```

The enumerator returns ranges so you cans split a memory via this API and use the span to slice the memory and to unify span and read-only spans. It's unclear whether we need different enumerator types for different Split behaviors, but we don't expect people to exchange those, which is why they are currently proposed to be nested types too.

terrajobst on 20 Aug 2019

I should have time in the next couple of weeks to tackle this

grant-d on 20 Aug 2019

👍1

Having this for bytes would be useful as well.....

davidfowl on 21 Aug 2019

👍4 😄1

Unless I am missing something, with the currently proposed SpanSplitEnumerator:

public static SpanSplitEnumerator<char> Split(this ReadOnlySpan<char> span, string separator);

would not work with the other overloads. At least not without an allocation.

If the SpanSplitEnumerator holds a single T as a separator, one could not split on string, and if the type holds a ReadOnlySpan\ Which in this case afaik would only work by allocating an array. A stack allocation wouldn't be possible, as the Enumerator is returned from the Split method.

One solution could be to have a separate type for splitting on sequences.

Threw together a quick implementation of this to showcase this issue:
https://gist.github.com/bbartels/87c7daae28d4905c60ae77724a401b20

bbartels on 21 Aug 2019

@bbartels we also decided that the enumerator type may differ depending on overload as needed.

scalablecory on 21 Aug 2019

@davidfowl we're hoping to defer anything regarding UTF-8 until we iron out Char8, so we don't have duplicate byte and Char8 APIs.

scalablecory on 21 Aug 2019

If the SpanSplitEnumerator holds a single T as a separator, one could not split on string, and if the type holds a ReadOnlySpan then you would somehow need to pass the single char separator to the type.

It could hold both and just use one of them, with a bool or enum value dictating which (or neither) to use.

stephentoub on 21 Aug 2019

Not really sure what the corefx team would value more, fewer types overall and less code duplication or a slimmer memory footprint on each individual type.
For now I went with two different types based on splitting on a single length separator and a multi length one.
But could change it to what @stephentoub suggested if deemed better!

Added some tests to verify correctness of my proposed implementation. (Tests were adapted from the official string.split tests)
https://gist.github.com/bbartels/87c7daae28d4905c60ae77724a401b20

bbartels on 21 Aug 2019

May I suggest another overload for enumerating backwards through the sequence?

Just came across a scenario today where it would have required me to check equality of a splitted sequence starting at the last entry. This way I wouldn't have to store the result somewhere before I iterate through them in reverse.
Something like:

partial class MemoryExtensions
{
    public static SpanSplitEnumerator<char> Split(this ReadOnlySpan<char> span, bool reverse = false);
    public static SpanSplitEnumerator<char> Split(this ReadOnlySpan<char> span, char separator, bool reverse = false);
    public static SpanSplitEnumerator<char> Split(this ReadOnlySpan<char> span, string separator, bool reverse = false);
}

bbartels on 22 Aug 2019

Or you could add a Reverse method to SpanSplitEnumerator<char> that returns another enumerator that goes in reverse?

khellang on 22 Aug 2019

I like the idea, but I feel like it might be a little ambiguous. As in, does it return a completely new enumerator that has no previous state, or assuming you've already called MoveNext() a couple times does the returned enumerator preserve that state and just goes backwards from there.

bbartels on 22 Aug 2019

@bbartels That could be solved by splitting SpanSplitEnumerator into enumerable and enumerator: only the enumerable would have Reverse(), so there is no confusion about what calling Reverse() in the middle of enumerating does, because it's not possible. (Note that I've unsuccessfully argued for that split before, for a different reason.)

svick on 22 Aug 2019

I'd actually like a Reverse() method on the SpanSplitEnumerator that doesn't return anything, just flips a switch to change direction. This way you could use it as a lookahead: If a certain condition is true for the current element, you could go back and do some extra processing on the previous.

Would 'require' the aforementioned overloads for the Split methods, as you'd otherwise have to manually get the enumerator and call Reverse before you could foreach over it.

bbartels on 22 Aug 2019

@bbartels you can do lookahead and reset with current spec already; just save a copy of the enumerator and restore if you need to back up. granted, you can't easily go back a variable number of times...

scalablecory on 23 Aug 2019

@grant-d I assigned this to you. thanks!

danmosemsft on 6 Sep 2019

🎉1

_[Edited]_
Question about desired behavior.
Compare the existing string idiom (assume we _keep_ empty entries):

// 012345
  "|ab|c|".Split('|'); // string[4] { "", "ab", "c", --> "" <-- }

Note the final "" on the _rhs_ boundary.

If we use a comparable Range enumerator instead (per this issue), then the results would _ostensibly_ be:

Range(0, 0) // <empty> (ok)
Range(1, 3) // ab (ok)
Range(4, 5) // c (ok)
Range(6, 6) // <empty> (oops!)

The final value is questionable since start=6 would be IndexOutOfRange.
So we have a different semantic to the existing idiom.

Do we emit a final value regardless, even though it represents an invalid index?
Or do we ignore it.
(My vote) Or maybe it should be Range(5, 5) which sorta kinda parallels Range(0, 0). Though it breaks the pattern of next.start = previous.end + 1, eg (1, _3_) -> (_4_, 5)

grant-d on 7 Sep 2019

👍1

@terrajobst?

danmosemsft on 7 Sep 2019

I think Range(6,6) is more appropriate. There shouldn't be any IndexOutOfRangeException, because anything using the range shouldn't read >= the exclusive end index.

c# "|ab|c|"[6..6].ToString() == ""

scalablecory on 7 Sep 2019

Though in the case of Range(6, 6), I think the first element should then be Range(-1, -1) to retain semantic equivalence.

bbartels on 7 Sep 2019

I am busy implementing and noticed there's this overload:
Split(this ReadOnlySpan<char> span, string separator);,
but no corresponding overload with:
Split(this ReadOnlySpan<char> span, string separator, StringComparison comparisonType);

Is that a spec miss, should I add it?
Or was it intentional, in which case is the behavior Ordinal?

grant-d on 10 Sep 2019

Please un-assign me, I am not going to get to this any time soon. Thanks

grant-d on 11 Nov 2019

Done. Thanks.

stephentoub on 11 Nov 2019

I already have a working implementation that conforms with https://github.com/dotnet/corefx/issues/26528#issuecomment-523152939: https://gist.github.com/bbartels/87c7daae28d4905c60ae77724a401b20.
I could submit a PR once it is migrated to dotnet/runtime?

bbartels on 11 Nov 2019

👍4

@bbartels please do :)

scalablecory on 11 Nov 2019

I would love to see these extensions for ReadOnlySpan with any generic equtable type, ignore-empty and definitely for ReadOnlySequence - in my experience when working with pipes sequences are more useful than spans.

shutdown256 on 26 Nov 2019

@bbartels , dotnet/runtime is open for business :)

danmosemsft on 16 Dec 2019

@danmosemsft I really should have commented here, I already submitted a PR a couple weeks ago!
See https://github.com/dotnet/runtime/pull/295

Could use some feedback over there regarding the bullet points I noted!

bbartels on 17 Dec 2019

Ran into a good use case for this just today. Would love to see this finished up.

One minor nitpick though: doesn't having a single type acting as both enumerable and enumerator create a less-than-obvious inconsistency with the rest of .NET?

var split = "a,b,c".Split(',');

foreach (var letter in split) 
{
    // Hit three times
} 

foreach (var letter in split) 
{
    // Hit three times 
}

var splitSpan = "a,b,c".AsSpan().Split(',');

foreach (var letter in splitSpan) 
{
    // Hit three times
} 

foreach (var letter in splitSpan) 
{
    // Never hit because splitSpan was already consumed by first foreach
}

jonathanmarston on 5 Feb 2020

@jonathanmarston

One minor nitpick though: doesn't having a single type acting as both enumerable and enumerator create a less-than-obvious inconsistency with the rest of .NET?

The compiler creates a copy of the struct enumerator before enumerating (SharpLab Example).

SeeminglyScience on 5 Feb 2020

The compiler creates a copy of the struct enumerator before enumerating
Ah yes. I see it now. Thanks!

jonathanmarston on 5 Feb 2020

👍1

is this already available? 🥺

wstaelens on 21 Apr 2020

@wstaelens Still waiting for the PR (https://github.com/dotnet/runtime/pull/295) to be reviewed/approved, but feel free to copy the implementation into your codebase if you need it early.

bbartels on 21 Apr 2020

Closed in https://github.com/dotnet/runtime/pull/295.