Runtime: Proposal: using ArrayPool<byte> in MemoryStream with new APIs

Created on 22 Jun 2017  路  25Comments  路  Source: dotnet/runtime

When not constructed with a specific byte[], MemoryStream allocates byte[]s every time it needs to grow. It would be tempting to just change the implementation to use ArrayPool<byte>.Shared.Rent to get that array, but this is problematic for a few reasons, mostly to do with existing code:

  1. It's fairly common to not dispose MemoryStreams, as currently the Dispose is effectively a nop. But with wrapping buffers from ArrayPool<byte>, it's important to release the currently used buffer back to the pool when the stream is disposed. And it would be expensive to make MemoryStream finalizable to deal with this.
  2. MemoryStream by default allows the buffers it creates to be exposed through its TryGetBuffer method. That means folks today could be taking and using these buffers, potentially even after the MemoryStream is disposed, which could lead to code using the array after the array is returned to the pool and potentially used by someone else.

One solution would just be to introduce a new Stream type, e.g.
```C#
namespace System.IO
{
public sealed class ArrayPoolStream : Stream
{
public ArrayPoolStream() : this(ArrayPool.Shared) { }
public ArrayPoolStream(ArrayPool pool);
~ArrayPoolStream();

    public ArraySegment<byte> DangerousGetArray();

    ... // override all the relevant members on Stream
}

}
```

That has its own downsides, though. In particular, there's something nice about this just being a part of MemoryStream, being easily used in existing code that uses MemoryStreams, etc. Another option would be to just plumb this into MemoryStream, but in an opt-in manner, and accept some of the resulting limitations because they're opt-in:

  • Add a ctor: public MemoryStream(ArrayPool<byte> pool).
  • When that ctor is used, buffers come from the pool rather than being new'd up.
  • If you don't Dispose, you leak the buffer, but since this only happens when you use the new ctor, we're at least not impacting existing code, and it's little different than just using the pool directly and not returning a buffer.
  • We either make TryGetBuffer return false, or as with Dispose we accept that you had to opt-in to this by using the new ctor.

I'm leaning towards the second option: just add the new MemoryStream(ArrayPool<byte>) ctor, and allow TryGetBuffer to work; devs just need to know that when they create the stream with this ctor, they should dispose the stream, and they shouldn't use the array retrieved from TryGetBuffer after doing something else to the stream.

api-needs-work area-System.IO

Most helpful comment

@Timovzl Are you saying you Dispose() all Tasks in your async code?

@svick I uh... always use ValueTask! :P Just kidding, you had me there.

Still, I maintain my position, exceptions notwithstanding. As this very issue demonstrates, allowing our class documentation to say, "hey, you may use _this_ class without dispose, just for laziness", puts us in this awkward position when the implementation can no longer support that liberty.

Even if disposing is not necessary, why share this in the documentation at all? Why let people skip dispose on _some_ Stream types? Just have them disposed as usual, even where it was not strictly necessary.

And then we can keep the exceptions minimal, such as for Task. The fewer exceptions, the less to remember. The fewer exceptions, the fewer questions raised by code reviewers, such as when they spot undisposed Streams. And our implementations are free to change.

All 25 comments

It's fairly common to not dispose MemoryStreams, as currently the Dispose is effectively a nop.

BTW, I have recently added a note to the documentation of MemoryStream that says that it doesn't have to be disposed. So changing that now would be breaking a promise the documentation makes. (And if it's not an appropriate promise, the documentation should be changed back.)

Would it be less dangerous to return ReadOnlySpan in another overload of TryGetBuffer ?

Would it be less dangerous to return ReadOnlySpan in another overload of TryGetBuffer ?

A little maybe, but many of the same problems apply, e.g.
```C#
stream.TryGetBuffer(out ArraySegment buffer);
stream.Write(...);
Use(buffer);

```C#
stream.TryGetBuffer(out ReadOnlySpan<byte> buffer);
stream.Write(...);
Use(buffer);

In both cases, the stream may have replaced the array referenced by buffer, in which case buffer is now pointing to something old.

Plus, many of the cases for which TryGetBuffer really need access to the array.

We use MemoryStreamall the time, it would be great if bytes array come from ArrayPoolinstead of allocating . Please plan this feature for your next milestone.

@svick If a particular class that implements IDisposable does not _need_ to be disposed, I would always consider that an implementation detail, never to be part of the public documentation. Clients should honor the IDisposable interface, or ignore it at their (our) own peril. The implementation must be able to change, in my opinion.

How about having a MemoryStreamPool that provides a MemoryStream backed by a pooled array, and having very clear Return() or Dispose() requirements? The MemoryStreamPool.Shared instance could use arrays from ArrayPool<byte>.Shared, and constructed instances could create their own pool.

@Timovzl Are you saying you Dispose() all Tasks in your async code?

I think that in the cases where Dispose does not do anything, it's just unnecessary ceremony. But since it's hard to find out whether that's the case, and whether it will stay that way, I think documentation should make this clear.

Just adding another voice to this, given the widespread use of MemoryStream in our apps the ability to reduce (generally quite large) allocs would be great. I'm comfortable with an opt-in approach if required, although the 2.1 release showed how awesome it is when 'things just get faster'.

re: Disposing memory streams: There's so much tooling, helpers, VS Add-Ins, guidance etc that all falls to a pit of 'always dispose your disposables'. We treat any undisposed (especially I/O) class as a red flag in code reviews, and feel dirty when forced to hand such things off with no concrete confirmation of cleanup ( MVC's FileResult() is a great example there)

@Timovzl Are you saying you Dispose() all Tasks in your async code?

@svick I uh... always use ValueTask! :P Just kidding, you had me there.

Still, I maintain my position, exceptions notwithstanding. As this very issue demonstrates, allowing our class documentation to say, "hey, you may use _this_ class without dispose, just for laziness", puts us in this awkward position when the implementation can no longer support that liberty.

Even if disposing is not necessary, why share this in the documentation at all? Why let people skip dispose on _some_ Stream types? Just have them disposed as usual, even where it was not strictly necessary.

And then we can keep the exceptions minimal, such as for Task. The fewer exceptions, the less to remember. The fewer exceptions, the fewer questions raised by code reviewers, such as when they spot undisposed Streams. And our implementations are free to change.

@markrendle , if you need MemoryStreamPool - you can just use
Microsoft.IO.RecyclableMemoryStream.

Looks like it has all that you want:

  • It uses chunks internally (which is nice to re-send huge files)
  • It re-uses byte arrays created

@imanushin RecyclableMemoryStream looks interesting. Note that its GetBuffer() method returns the actual (current) buffer. It matches the docs' current documentation and gives you the power to do things - to do things wrong, too. Of course, in part, this was already the case with MemoryStream, but now you can retain access to a buffer that will be returned to the pool and used by something else. I'm not too fond of this potential for misuse. Granted, the alternative isn't great...

It uses chunks internally (which is nice to re-send huge files)

Maybe something that could be considered here also.

It uses chunks internally (which is nice to re-send huge files)

Maybe something that could be considered here also.

Using chunks internally seems inconsistent with supporting GetBuffer().

I do think using chunks internally is interesting to consider, but I'm not sure how to square this with the GetBuffer() API; seems like we might need a separate type for this.

It gets a new larger contiguous buffer if necessary to present a single buffer with all content:

https://github.com/microsoft/Microsoft.IO.RecyclableMemoryStream/blob/master/src/RecyclableMemoryStream.cs#L54-L57

The only problem there is that the buffer might be from the pool, and now you have a second potential owner of said buffer. Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

In that case, at the very least an array from the array pool could be returned. That leaves the user with the option of returning it, making the code mostly allocation-free. As for users unaware of this... renting an array from the array pool and not returning it has no downsides over just allocating a new array, right?

Good point, as long as the MemoryStream doesn't return it when Disposed in that case. That's the tricky part of it... the buffer can't be owned by both the MemoryStream and the consumer because it's fine if neither returns it to the pool but bad if one returns it and the other continues using it.

Good point, as long as the MemoryStream doesn't return it when Disposed in that case [...] because it's fine if neither returns it to the pool but bad if one returns it and the other continues using it.

Agreed. Once you call GetBuffer(), you become the owner of the array. Either you return it neatly with ArrayPool<byte>.Return() (best reuse) or you let it be garbage collected (nothing goes wrong, at least).

This does require some attention to the single-block case too. Do we implement a special case to relinquish ownership of the single block? That seems hard to keep track of. We might need to switch to "large buffers" as soon as GetBuffer() is invoked.

Any clever ideas? It would be so nice if the single-block case could stay copy-free...

Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

GetBuffer allows you to also change the data that is in the MemoryStream, which would not be the case if a fresh non-pooled array was returned.

Using chunks internally seems inconsistent with supporting GetBuffer().

  • When backed by a chunks, GetBuffer could be disallowed, similar as when MemoryStream is constructed with publiclyVisible to false using:
public MemoryStream (byte[] buffer, int index, int count, bool writable, bool publiclyVisible);
  • Or, the same parameter could be added to the new constructor:
public MemoryStream(ArrayPool<byte> pool, bool publiclyVisible)

When publiclyVisible is set to false, the implementation can use chunks.

  • Or, when someone calls GetBuffer, the implementation could automagically switch to non-chunked mode.

The more I've thought about this, the more I think it needs to be done separately from MemoryStream:

  • It's very common to not Dispose a MemoryStream; not doing so with pooled buffers will "leak" / drain the pool.
  • MemoryStream.ToArray() is explicitly usable after MemoryStream.Dispose(); that's called out in the docs and a non-trivial amount of code relies on that behavior. That breaks badly if Dispose returns the buffer to the pool, and there's no way for the MemoryStream to know whether ToArray will be called. Using a different constructor could change the behavior of ToArray to disallow that, but that breaks the MemoryStream contract.
  • Existing misuses of MemoryStream, such as concurrent accesses, could result in spooky-action-at-a-distance bugs, with arrays being returned to the pool while still being used, resulting in new very hard to diagnose failure modes. When we've introduced ArrayPool usage into other existing streams, we've generally only done so when there was either existing synchronization we could guard usage behind or where the overhead of adding such synchronization wasn't prohibitive, but I fear it would be in MemoryStream.
  • Changing the MemoryStream behavior from growing an array to maintaining a linked list of arrays has observable performance impact, both positive for some cases (avoiding copying for append-only writes) and negative for others (e.g. random access via seeking, consuming more buffers of varying sizes from the pool). It's hard to predict whether the impact will be net positive or negative across the vast usage of MemoryStream.

Changing the MemoryStream behavior from growing an array to maintaining a linked list of arrays has observable performance impact

Also the GetBuffer() issues mentioned above.

@tmds

GetBuffer allows you to also change the data that is in the MemoryStream, which would not be the case if a fresh non-pooled array was returned.

Actually, with the above discussion about this, the intent was that the returned array _becomes_ the stream's own buffer (at least until you write to the stream), so changing the data certainly works. Hopefully the discussion makes more sense with that in mind.

When backed by a chunks, GetBuffer could be disallowed [...]

I imagine a less ideal GetBuffer (that sometimes needs to do copying) is preferable to one that sometimes simply does not work. The caller doesn't necessarily know when it's using small vs. large buffers. Having the method throw _sometimes_, in a way that is nontransparent to the caller, would render the method a lot less usable.

Or, when someone calls GetBuffer, the implementation could automagically switch to non-chunked mode.

That would be the easiest. It is a bit disappointing that the simple case of a single chunk would then involve copying, so I'm still holding out hopes that we can come up with something more clever. But it might be acceptable.

The more I've thought about this, the more I think it needs to be done separately from MemoryStream:

I think a separate, more suitable API would certainly be worth pursuing. I'm wondering, when we wish to supply a low-allocation implementation, in how many cases we _must_ pass a MemoryStream. Can we practically always get away with passing any Stream? Or do we have example use cases where MemoryStream is a requirement?

If a MemoryStream tends to be the requirement, we should probably tackle both issues, but separately. If any Stream will do, then let's forget about MemoryStream and make something ideal.

I'm not fond of the idea of honoring API misuse. If you neglect to dispose an IDisposable object, invoke methods on a disposed object, or concurrently access an object that is not thread-safe... then there may be consequences - now, or when the implementation changes.

If a type unrelated to MemoryStream turns out to be the way to go anyway, then it's a plus that API misuse isn't punished. But should that consideration be a priority?

Admittedly, as for not disposing, the documentation says that's allowed, so that needs to be addressed. (Strange implementation detail to mention in the documentation, but I digress.) If we can make sure that _not_ disposing merely degenerates to the equivalent of using non-pooled streams (analog to not returning arrays rented from an array pool), then this is a non-issue, right? Potential gains, with no potential losses.

If we can make sure that not disposing merely degenerates to the equivalent of using non-pooled streams (analog to not returning arrays rented from an array pool), then this is a non-issue, right?

Arrays from the pool are more "valuable" than others, as they're much more likely to have been around for longer, be in higher generations, etc. There's also been discussion about using pinned/aligned arrays in the pool. So taking arrays from the pool and not returning them is generally worse on a system than allocating new ones, degrading other unrelated components using the pool.

I'm not fond of the idea of honoring API misuse. If you neglect to dispose an IDisposable object, invoke methods on a disposed object, or concurrently access an object that is not thread-safe... then there may be consequences - now, or when the implementation changes.

Not disposing a MemoryStream is not misuse, nor is using ToArray after disposal.

Concurrent use is misuse, but at the same time, MemoryStream is used by millions of libraries and applications; there is guaranteed to be misuse. And we can't afford to introduce changes into such a type that could introduce a myriad of difficult to diagnose race conditions in entirely unrelated parts of the app, e.g. misuse of a MemoryStream over here causes sensitive data to be leaked to an http response over there. We need to be cognizant of such issues for new APIs, but it's a much bigger deal for existing ones, especially ones as core s MemoryStream.

Arrays from the pool are more "valuable" than others, as they're much more likely to have been around for longer, be in higher generations, etc. There's also been discussion about using pinned/aligned arrays in the pool. So taking arrays from the pool and not returning them is generally worse on a system than allocating new ones, degrading other unrelated components using the pool.

I did not realize! That does limit our options... and support your proposition to not inherit from MemoryStream.

I'm still very curious if we even have examples where inheriting from MemoryStream poses an advantage. I don't seem to recall any APIs that take one as a parameter.

I don't seem to recall any APIs that take one as a parameter.

To my knowledge there aren't any public APIs in the core libraries strongly-typed to accept a MemoryStream. No doubt there are cases outside of the core libraries.

Was this page helpful?
0 / 5 - 0 ratings