Runtime: Provide StringBuilder.ForeachChunk for efficient access to characters in a StringBuilder

Created on 20 Jun 2017 · 43Comments · Source: dotnet/runtime

Today, StringBuilder has problem in that the only way to read the characters out of the StringBuilder is to do it a character at a time [] or call the ToString APIs to convert it to a string (and then index it). Both of these are slow.

Typically what people actually do is call ToString, which forces the creation of a large string. Often the very next thing that happens is that the String is written to a file, leaving behind a large, dead string on the GC heap.

With the advent of Span there is now an efficient, safe mechanism for extracting the data in the StringBuilder.

The proposed API is
```c#
///

/// ForEachChunk is an efficient way of accessing all the characters in the StringBuilder.
/// It is an alternative to calling ToString() on the StringBuilder.
///
/// The 'callback' delegate is called 0 or more times, each time being passed a chunk of
/// the string as a ReadOnlySpan of characters (in order). This is repeated until all
/// the characters in the span have been passed back through 'callback'.
///

/// A method that is called repeatedly, being passed a chunk
/// of the Strinbuilder with each call. If 'false' is returned by the callback
/// then ForEachChunk terminates immediately.
public void ForEachChunk(Func, bool> callback)

The actual implmentation is simple: 
```c#
        public void ForEachChunk(Func<ReadOnlySpan<char>, bool> callback)
        {
            VerifyClassInvariant();
            ForEachChunk(callback, this);
        }

        private bool ForEachChunk(Func<ReadOnlySpan<char>, bool> callback, StringBuilder chunk)
        {
            // The chunks are last written first, so we need to output the others first.  
            var next = chunk.m_ChunkPrevious;
            if (next != null && !ForEachChunk(callback, next))
                return false;

            // The fields here might be changing and inaccurate if threads are racing, but the validation that Span does on 
            // construction insures that the values, even if inaccurate, are 'in bounds'.   
            return callback(new Span<char>(chunk.m_ChunkChars, chunk.m_ChunkOffset, chunk.m_ChunkLength));
        }

The intent is that this new API can be a building block to provide extension methods on Stream (or other APIs that act like streams) that operate on StringBuilders directly (so an intermediate strings need not be formed)

Question: A variation of this is to have an explicit context parameter that is passed to the callback.
c# public void ForEachChunk<T>(Func<ReadOnlySpan<char>, T, bool> callback, T context = null) where T : class
While this is not as convenient to use, it allows the user to avoid creating closures (which force the allocation of a Func), which is faster (and the goal of this API is is all about being faster)

@stephentoub @jkotas @alexperovich @AlexGhiondea @davidfowl

api-needs-work area-System.Memory

Source

vancem

👍4

Most helpful comment

How about using ReadOnlyMemory instead of ReadOnlySpan for each chunk? Then it can be used in generics, can be used with asynchronous operations, can be passed to things that take both ReadOnlyMemory and ReadOnlySpan, etc.

stephentoub on 22 Feb 2018

👍2

All 43 comments

This looks like a reasonable thing to add. Note that I'll be opening one or more issues in the coming days with a list of APIs based on Span/Buffer I believe we should be adding, and I can put these on the list; others on StringBuilder include overloads for methods like Append/Insert for working with spans.

If we add such a ForEachChunk method, we should definitely include the generic context argument; otherwise we force the closure and delegate allocations, and the most common uses of this method will be in transferring the data to something else, so invariably there's going to need to be some context, and we should enable and encourage using it efficiently.

As a nit, isn't Span as ref-like type prohibited from being used as a generic type argument? Assuming yes, we'll also need a custom delegate type. It'll be interesting to see if a few common shapes for such delegates emerge such that we can create more generically named ones, or whether we should create them for each API named to correspond to that API.

stephentoub on 20 Jun 2017

👍1

Or StringBuilder can return the mathematical opposite of callbacks - enumerator :)

omariom on 20 Jun 2017

👍1

Vance, how do you see this behaving if the callback mutates the StringBuilder in ways that change/invalidate the chunks?

stephentoub on 20 Jun 2017

👍1

Regarding:
C# public void ForEachChunk(Func<ReadOnlySpan<char>, bool> callback)

As a nit, isn't Span as ref-like type prohibited from being used as a generic type argument?

Yes, it is prohibited. Span cannot be used as a type argument since it is a stack only type.

cc @VSadov

Edit: Added response to @stephentoub's comment.

ahsonkhan on 20 Jun 2017

I'm not a fan of callback methods like this for a couple of reasons:

You need the state callback which doesn't work well for locals (you still need to allocate)
You might need an async version in case the thing you're trying to do in the callback is async.

Is there any way we can revisit the ISpanEnumerable<T> pattern that @jaredpar suggested way back when.

davidfowl on 20 Jun 2017

👍2

I would also prefer enumerable, because:

The API is easier to understand and use. (An enumerable type instead of a callback, a context and a custom delegate type.)
Using it does not have to involve any virtual calls and involves less allocations than ForEachChunk.

How come enumerable allocates less, considering that the proposed implementation of ForEachChunk uses stack-based recursion and enumerable can't do that?

Because ForEachChunk can't do it either: for large StringBuilders, it would cause StackOverflowException. For example, the following code does that for me:

```c#
var sb = new StringBuilder();

for (int i = 0; i < 20_000; i++)
sb.Append(new string('a', 8000));

sb.ForEachChunk(c => true);
```

So, how could chunk enumerable or ForEachChunk be implemented instead? I can see four options:

Use Stack<StringBuilder> as an explicit stack that's not limited by the size of the thread stack. This requires an allocation, potentially a large one.
Use a trick where you reverse the linked list on the way to the front and then reverse it back on the way back. This is probably the most efficient solution, but would require some form of "locking" to ensure no other operation (read or write) operates on the StringBuilder at the same time.
After every chunk, iterate from the start of the linked list to find the next chunk. This approach is O(n²) (all the others are O(n)), but it does not not allocate and is also not otherwise invasive.
Turn StringBuilder from a singly-linked list to a doubly-linked list. This would require adding a pointer-sized field to every StringBuilder for a niche use case.

So, all the implementations I could think of have pretty serious drawbacks. Did I miss some? Does one of these stand out as the best option?

svick on 20 Jun 2017

but would require some form of "locking" to ensure no other operation (read or write) operates on the StringBuilder at the same time.

StringBuilder currently isn't safe to be used concurrently... why would this operation need to be? Or maybe you just meant what I asked about earlier, the callback itself accessing the builder?

stephentoub on 20 Jun 2017

@stephentoub Yeah, I didn't mean making StringBuilder thread-safe. But with that option, even reading the StringBuilder from the same thread while iterating the chunks would not be safe.

For example, consider code like this (using enumerable for cleaner code, but ForEachChunk would have the same issue):

c# foreach (var chunk in sb.EnumerateChunks()) { Console.WriteLine($"{sb} {chunk}"); }

This code would have to be disallowed, because it's reading the StringBuilder, while it's being iterated. The implicit call to StringBuilder.ToString() couldn't give the correct result, because all the m_ChunkPrevious references might currently point the wrong way.

So, while the StringBuilder's chunks are being iterated, the StringBuilder has to be "locked", even from access from the same thread, if my option 2 was chosen.

svick on 20 Jun 2017

But with that option, even reading the StringBuilder from the same thread while iterating the chunks would not be safe.

Yes, this is actually why I personally prefer the callback in this situation; it makes the scoping very clear, and, at least in my eyes (though little technically different about it), discourages accessing the StringBuilder directly.

Regardless, in any of these approaches, whether callback or enumerables, unless a copy of the data is made (which defeats the purpose of the method), StringBuilder would need to detect and prevent usage while the callback/enumerables was still outstanding, which at the very least would add checks to most other operations on the StringBuilder; not pay-for-play, though I've not measured to see if it would actually show up.

I'm also not personally concerned about the reading/writing difference you mention... it's rare in my experience to read from a StringBuilder (other than ToString, which this would be used in place of in certain situations), and when you do, it's usually to check whether to write something (e.g. checking whether it ends with a slash to know whether to add one).

stephentoub on 20 Jun 2017

We do have prior art in this space https://msdn.microsoft.com/en-us/library/bwabdf9z(v=vs.110).aspx but this pattern isn't pervasive in the BCL. This method doesn't take a state parameter so it's pretty much impossible to use without allocating.

The benefit of a push model like this is that we can do 0 allocations (using the string builder's memory directly without allocating the string). If we're going to do this I suggest we do it across the board on things that allocate internally. We need a sync and async version as well:

C# sb.CopyTo(callback, state); await sb.CopyToAsync(callback, state);

davidfowl on 20 Jun 2017

More generally, maybe StringBuilder should be a TextWriter and there should have been a CopyTo and CopyToAsync on TextWriter that took a TextReader similar to Stream.CopyToAsync.

davidfowl on 20 Jun 2017

👍1

We need a sync and async version as well

Can you share examples from ASP.NET where these would be used?

stephentoub on 20 Jun 2017

there should have been a CopyTo and CopyToAsync on TextWriter that took a TextReader

I'm not following. Did you mean CopyTo{Async} on TextReader that took a TextWriter? And that StringBuilder should have been a TextReader(which would be counterintuitive)?

stephentoub on 20 Jun 2017

I suspected that this API would solicit lively feedback and that certainly seems to be the case. Lets write down some principles that hopefully will help in deciding more details of the design.

StringBuilder is not thread-safe (this is true today). Thus we don't care if it values are not well defined on concurrent access, however we would strongly prefer that even in the presence of concurrency that we don't break type/memory safety (e.g. you don't access things beyond the bounds of the array that holds them).
This new API is not essential (after all we have lived without it for > 10 years), and the primary motivation for it is efficiency. Thus we should be catering to a more advanced user who is willing to sacrifice some ease of use to get that efficiency.
Implementations should be biased toward small strings (e.g. likely statistics are 90% < 1K (1 chunk), 99% of strings are < 1M), but of course should handle strings of any size up to 2GB characters (the limit).
This API is meant to NOT mutate. We may not be able to enforce that completely, but the intent is for it to be non-mutating. I would also argue that doing mutation in a non-mutating call (e.g. reversing pointers and unreversing them), is not a good idea.
It is likely that a common scenario for this API is to perform I/O on the string (write it out to a stream) and these APIs are likely to be Async (they may block). This argues that the proposed API above not sufficient (we need to work through that scenario).

For what it is worth, issues about reversing the chunks that were brought up is definitely solvable. Most likely by having a stack-based structure that works for cases < some number of chunks, and it bails to either an N*N or heap allocated case for very large number of chunks. Thus I would not worry about this. The big issue in my mind is the Enumerable vs Callback (which).

We should explore the Enumerable design, but it is not clear to me, how this interacts with Span and async. I need to think about it a bit...

vancem on 20 Jun 2017

Can you share examples from ASP.NET where these would be used?

I don't know of anywhere we use a StringBuilder today in this way. In fact we wrote our own string struct based StringBuilder ( https://github.com/aspnet/Common/blob/530513107e37ab25d8a6182ce2cceb3ade369533/src/Microsoft.Extensions.Primitives/InplaceStringBuilder.cs).

I'm not following. Did you mean CopyTo{Async} on TextReader that took a TextWriter? And that StringBuilder should have been a TextReader(which would be counterintuitive)?

StringBuilder should be a TextWriter and there should be an AsTextReader() method to get at the internal buffers without reallocation to consume them. What I'm really trying to get at is the fact that we should have a push based API handing buffers up stack without allocating. We already do this with Stream so lets do it in more places.

IMO this really isn't a discussion about StringBuilder per se but it's an example of what sorts of APIs we could expose in the BCL to reduce allocations and potentially copies in general. It's a push vs pull model and all the tradeoffs @vancem mentions apply.

I'd also like to see us focus a bit more on abstractions rather than just fixing individual APIs.

davidfowl on 20 Jun 2017

👍1

What I'm really trying to get at is the fact that we should have a push based API handing buffers up stack without allocating.

Sure. I was just asking about your specific comment on how things should have been. From an abstraction perspective, I don't see how the base TextWriter implementation could provide a meaningful CopyTo{Async} or AsTextReader implementation. I'm thinking about it in terms of what we could add now rather than what could have been defined 15 years ago.

In any event, I think we are talking about abstractions and individual APIs; this issue just happens to be more focused on the latter. I agree with you that in general we need both discussions.

stephentoub on 20 Jun 2017

I agree that there is a larger discussion about what text writing abstraction we should be using, and I agree that TextWriter is the abstraction that people would prefer to use, not StrinBuilder.Append.

We should be thinking in these bigger terms (in terms of new services using ASP.NET 5).

That being said, I do think there is room for making 'easy' tactical improvements to StringBuilder for those who for whatever reason are tied to the current API.

vancem on 20 Jun 2017

👍1

There could be low level, very fast, with no dynamic dispatch API
and a more traditional one - with objects, abstractions etc.

omariom on 20 Jun 2017

API review: This should come together with larger Span-ification of BCL (@stephentoub drives that).
It does not make sense to approve the APIs one by one. We would likely end up also with inconsistency.

karelz on 20 Jun 2017

I am OK with considering both together. However please note that typically the 'spanification' of the BCL is just providing Span overloads for things that used to take arrays. This is different in the sense that we the use of Span was incidental, our goal was to expose the chunks of data inside the StringBuilder and Span just happened to be a way of doing that.

vancem on 20 Jun 2017

I mention StringBuilder in https://github.com/dotnet/corefx/issues/21281 and the few core methods I think we should add to it, but I agree with Vance that this one is unique enough that it's worth discussing as a separate entity.

stephentoub on 21 Jun 2017

Can you share examples from ASP.NET where these would be used?

@davidfowl @stephentoub

We definitely had a perf issue at one point in the view engine copying character-by-character out of a StringBuilder. I believe we optimized it by writing something custom 😲

This isn't the original issue, but another place we hit it. https://github.com/aspnet/Mvc/commit/43071319aa0c347a1d2dfcc8f14042807ddebe51#diff-d04c49a8be708ed4bfb0d8231f71d4e8L199

Exposing the spans of the underlying buffers would be super useful 👍

Bonus points if we can use a memory pool to create a string builder.

Regarding StringBuilder + TextWriter - StringWriter already does it's job. I don't need or want any change to these types other than getting access to the buffers.

rynowak on 21 Jun 2017

@vancem is this something that you're still considering for the 2.1 milestone?

joshfree on 26 Jan 2018

I have changed the milestone to Future.

vancem on 26 Jan 2018

@stephentoub you had a preference for callback, how do you expect the callback to look given that we can't use the original form as Span can't be used as a type argument?

StringBuilder would need to detect and prevent usage while the callback/enumerables was still outstanding

Given it's not threadsafe, is that strictly necessary, or more a "debugging helper" like the version number on our collection enumerables is?

danmosemsft on 21 Feb 2018

how do you expect the callback to look given that we can't use the original form as Span can't be used as a type argument?

Taking a ReadOnlySpanAction instead of an Action:
https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/Action.cs#L37
or something like that.

I also don't have a problem with an enumeration scheme if someone comes up with a good design for one.

stephentoub on 21 Feb 2018

BTW. I wonder if restrictions on type arguments could be relaxed for delegates. We know that delegates do not store their arguments/results to the heap.
Perhaps ref structs would be ok with Funcs/Actions?

VSadov on 21 Feb 2018

It likely needs more thinking to conclude there are no safety holes around things like Func<Span<int>>, but at first glance it feels like it could be safe.

VSadov on 21 Feb 2018

👍2

I have not been pushing this forward like I should.

I have created a design for the IEnumerable pattern. Note that this is not a true IEnumerable (since it returns a Span, which can't be a T in a generic type). However it does follow the same pattern (and thus foreach works).

partial class StringBuilder { 
        public StringBuilderEnumerator GetEnumerator() => new StringBuilderEnumerator(this);

        // This has to be public since we can't use the IEnumerable<T> interface for it.   
        public struct StringBuilderEnumerator
        {
            internal StringBuilderEnumerator(StringBuilder stringBuilder)
            {
                _firstChunk = stringBuilder;
                _currentChunk = null;
                _manyChunks = null;

                // When we have a modest number of chunks avoid allocation and simply 
                // computing the chunk as you go.  For stringbuilders with many chunks
                // allocate an array to avoid N*N behavior.  
                int chunkCount = ChunkCount(stringBuilder);
                if (8 < chunkCount)
                    _manyChunks = new ManyChunkInfo(stringBuilder, chunkCount);
            }

            public bool MoveNext()
            {
                if (_currentChunk == _firstChunk)
                    return false;

                if (_manyChunks != null)
                    return _manyChunks.MoveNext(ref _currentChunk);

                StringBuilder next = _firstChunk;
                while (next.m_ChunkPrevious != _currentChunk)
                    next = next.m_ChunkPrevious;
                _currentChunk = next;
                return true;
            }

            public ReadOnlySpan<char> Current => new Span<char>(_currentChunk.m_ChunkChars, 0, _currentChunk.m_ChunkLength);

            #region private

            private static int ChunkCount(StringBuilder stringBuilder)
            {
                int ret = 0;
                while(stringBuilder != null)
                {
                    ret++;
                    stringBuilder = stringBuilder.m_ChunkPrevious;
                }
                return ret;
            }

            /// <summary>
            /// Used to hold all the chunks indexes when you have many chunks. 
            /// </summary>
            private class ManyChunkInfo
            {
                public bool MoveNext(ref StringBuilder current)
                {
                    int pos = ++_chunkPos;
                    if (_chunks.Length <= pos)
                        return false; 
                    current = _chunks[pos];
                    return true;
                }

                public ManyChunkInfo(StringBuilder stringBuilder, int chunkCount)
                {
                    _chunks = new StringBuilder[chunkCount];
                    while(0 < --chunkCount )
                    {
                        _chunks[chunkCount] = stringBuilder;
                        stringBuilder = stringBuilder.m_ChunkPrevious;
                    }
                    _chunkPos = -1;
                }
                StringBuilder[] _chunks;    // These are in normal order (first chunk first) 
                int _chunkPos;
            }

            StringBuilder _firstChunk;
            StringBuilder _currentChunk;
            ManyChunkInfo _manyChunks;        // Only used for long string builders with many chunks
            #endregion
        }
}

It is all pretty standard, a GenEnumerator() which returns a object with a MoveNext and Current. Thus the C# 'foreach' pattern works.

       void Test(StringBuilder sb, Action<char> action) {
            foreach (ReadOnlySpan<char> chunk in sb)
            {
                for (int i = 0; i < chunk.Length; i++)
                    action(chunk[i]);     // Do something with each character
            }
     }

The implementation is pretty straightforward except that you have to reverse the chain of chunks. We don't do this for small number of chunks (in the code above 8) but after that we allocate an array and iterate over that.

Ultimately I also prefer the enumerator model over the callback model, and while the implementation is more complex, it is not that bad (it is conceptually reasonably simple).

If there are no objections, I will submit it for design review. Note that I am currently NOT expecting this for 2.1 (no one really needs it desperately).

Comments?

vancem on 21 Feb 2018

Oh, to answer @stephentoub s question about semantics of changing the StringBuilder while you are enumerating it, the answer (like all enumerators) is that is illegal (undefined). The code above does not check for this. I am reluctant to add checking logic because this API is only meant for advanced users. Same goes for thread safety (behavior is undefined if it is sequential access is not maintained).

vancem on 21 Feb 2018

@vancem I like it. One thing I would change is to have a method that lets you enumerate the chunks, e.g. foreach (var chunk in sb.EnumerateChunks(), instead of using the StringBuilder in a foreach directly. The reasons for that are:

It's an API for advanced users, so it should be a little bit hidden.
I think that most users are not going to expect that iterating a StringBuilder will give you some weird internal structure.
I think that logically, StringBuilder is not a collection of chunks, so it should not behave like one.

svick on 22 Feb 2018

+1 for the EnumerateChunks() or something like that.

I was a bit surprised that StringBuilder does not implement foreach already. Perhaps one day we will want that and expect the default iteration element to be char...

VSadov on 22 Feb 2018

stephentoub on 22 Feb 2018

👍2

I agree with both @svick and @stephentoub 's feedback. I will update the proposal.

vancem on 22 Feb 2018

StringBuilderEnumerator should probably be ref struct

omariom on 22 Feb 2018

👎1

StringBuilderEnumerator should probably be ref struct

That would be very limiting, e.g. you couldn't enumerate it in an iterator or async method. If we're exposing something like this, we should accept that it's advanced and you need to know what you're doing, as is the case any time a data structure's internals are exposed.

stephentoub on 22 Feb 2018

👍1

~~@stephentoub~~

How about using ReadOnlyMemory instead of ReadOnlySpan for each chunk?

Do I understand it correctly that the existing and proposed options to read the whole StringBuilder are:

| | Allocations per chunk | Notes |
|-|-|-|
| ToString() | roughly 8000 B | very convenient |
| ReadOnlyMemory<char>-based EnumerateChunks() | 40 B (32 for ROM + 8 for _chunks) | fairly convenient |
| ReadOnlySpan<char>-based EnumerateChunks() | 8 B (for _chunks) | fairly inconvenient |
| char indexer | 0 | terrible performance |

I'm assuming that the number of bytes allocated is the most important metric, because that decides how often the GC runs.

So, compared with ToString(), both ReadOnlyMemory<char> and ReadOnlySpan<char> vastly reduce allocations, so both sound good. At the same time, the version based on ReadOnlySpan<char> allocates 5 times less than the one based on ReadOnlyMemory<char>, which is a lot.

I haven't personally used Span enough to gauge how painful its limitations are (especially considering that trading some pain for performance is often worth it in performance critical code). So I'm asking: do you have use cases for this API where ReadOnlyMemory<char> would be better than ReadOnlySpan<char>? Without those, I think I would lean towards using ReadOnlySpan<char>.

svick on 22 Feb 2018

At the same time, the version based on ReadOnlySpan allocates 5 times less than the one based on ReadOnlyMemory, which is a lot.

@svick, why would there be any heap allocation difference between ROM and ROS? Nothing beyond the underlying chunks array would be allocated on the heap for either of them. In @vancem's pseudocode, Current is creating the ROS ref struct around the chunk on demand; the same thing would be done for the ROM struct. They're both value types.

So I'm asking: do you have use cases for this API where ReadOnlyMemory would be better than ReadOnlySpan?

Yes, e.g.
C# foreach (ReadOnlyMemory<char> chunk in sb.EnumerateChunks()) { await textWriter.WriteAsync(chunk); }
You can't do that with span.

stephentoub on 22 Feb 2018

👍1

@stephentoub I'm sorry, I am very confused. I assumed that ReadOnlyMemory<T> is a class, but it's not. (I think what confused me is that in earlier versions, using ReadOnlyMemory<T> always required an allocation of OwnedMemory<T>.) So, disregard what I said.

svick on 22 Feb 2018

@svick, no problem :smile:

stephentoub on 22 Feb 2018

Here is the revised code (not tested, but compiles)

class String Builder {
        public ChunkEnumerable EnumerateChunks() => new ChunkEnumerable(this);

        public struct ChunkEnumerable : IEnumerable<ReadOnlyMemory<char>>
        {
            internal ChunkEnumerable(StringBuilder stringBuilder) { _stringBuilder = stringBuilder; }
            public ChunkEnumerator GetEnumerator() => new ChunkEnumerator(this._stringBuilder);
            IEnumerator<ReadOnlyMemory<char>> IEnumerable<ReadOnlyMemory<char>>.GetEnumerator() => GetEnumerator();
            IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
            private StringBuilder _stringBuilder;
        }

        public struct ChunkEnumerator : IEnumerator<ReadOnlyMemory<char>>
        {
            public bool MoveNext()
            {
                if (_currentChunk == _firstChunk)
                    return false;

                if (_manyChunks != null)
                    return _manyChunks.MoveNext(ref _currentChunk);

                StringBuilder next = _firstChunk;
                while (next.m_ChunkPrevious != _currentChunk)
                    next = next.m_ChunkPrevious;
                _currentChunk = next;
                return true;
            }
            public ReadOnlyMemory<char> Current => new ReadOnlyMemory<char>(_currentChunk.m_ChunkChars, 0, _currentChunk.m_ChunkLength);
            object IEnumerator.Current => Current;
            public void Dispose() { }
            public void Reset() => throw new NotImplementedException();

            #region private
            internal ChunkEnumerator(StringBuilder stringBuilder)
            {
                _firstChunk = stringBuilder;
                _currentChunk = null;
                _manyChunks = null;

                // When we have a modest number of chunks avoid allocation and simply 
                // computing the chunk as you go.  For stringbuilders with many chunks
                // allocate an array to avoid N*N behavior.  
                int chunkCount = ChunkCount(stringBuilder);
                if (8 < chunkCount)
                    _manyChunks = new ManyChunkInfo(stringBuilder, chunkCount);
            }

            private static int ChunkCount(StringBuilder stringBuilder)
            {
                int ret = 0;
                while (stringBuilder != null)
                {
                    ret++;
                    stringBuilder = stringBuilder.m_ChunkPrevious;
                }
                return ret;
            }

            /// <summary>
            /// Used to hold all the chunks indexes when you have many chunks. 
            /// </summary>
            private class ManyChunkInfo
            {
                public bool MoveNext(ref StringBuilder current)
                {
                    int pos = ++_chunkPos;
                    if (_chunks.Length <= pos)
                        return false;
                    current = _chunks[pos];
                    return true;
                }

                public ManyChunkInfo(StringBuilder stringBuilder, int chunkCount)
                {
                    _chunks = new StringBuilder[chunkCount];
                    while (0 < --chunkCount)
                    {
                        _chunks[chunkCount] = stringBuilder;
                        stringBuilder = stringBuilder.m_ChunkPrevious;
                    }
                    _chunkPos = -1;
                }
                StringBuilder[] _chunks;    // These are in normal order (first chunk first) 
                int _chunkPos;
            }

            StringBuilder _firstChunk;
            StringBuilder _currentChunk;
            ManyChunkInfo _manyChunks;        // Only used for long string builders with many chunks
            #endregion
        }
}

You now use it like this

        static void Test(StringBuilder sb, Action<char> action)
        {
            foreach (ReadOnlyMemory<char> chunk in sb.EnumerateChunks())
            {
                var span = chunk.Span;
                for (int i = 0; i < span.Length; i++)
                    action(span[i]);
            }
        }

The code above should execute without object allocation if the StringBuilder is of reasonable size (< 64K characters), otherwise there will be one allocation (for an array). We can tune the exact cutoff.

It is not clear there is alot of value in actually being an IEnumerable>. I just don't see this being passed as a 'generic' enumerable. Thus I think I would recommend the following stripped down version which removes stuff we did just to make it an IEnumerable. (we can add back in IEnumerable in the future if we wish).

        public ChunkEnumerable EnumerateChunks() => new ChunkEnumerable(this);

        public struct ChunkEnumerable
        {
            internal ChunkEnumerable(StringBuilder stringBuilder) { _stringBuilder = stringBuilder; }
            public ChunkEnumerator GetEnumerator() => new ChunkEnumerator(this._stringBuilder);
            private StringBuilder _stringBuilder;
        }

        public struct ChunkEnumerator
        {
            public bool MoveNext()
            {
                 // MoveNext body unchanged ...
            }
            public ReadOnlyMemory<char> Current => new ReadOnlyMemory<char>(_currentChunk.m_ChunkChars, 0, _currentChunk.m_ChunkLength);
            #region private
            // private region unchanged ...
            #endregion 
        }

Comments? I will certainly do some perf measurments, but the idea is to make this as small and lean as possible.

vancem on 22 Feb 2018

Please see https://github.com/dotnet/coreclr/pull/17530 for concrete proposal.

vancem on 12 Apr 2018

Closing this issue, It is being addressed by issue dotnet/corefx#29770

vancem on 17 May 2018

Was this page helpful?

0 / 5 - 0 ratings