grpc-go 🚀 - transport: reuse memory on read path

Specific areas where we would like to do this:

Codec output buffer (requires codec API extension)
Reading from transport (I believe this is #869)

dfawley on 22 Aug 2017

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

muirdm on 2 Dec 2017

I would prefer that the codec object itself is in a sync pool and is
assumed to be unsafe to use concurrently. This way the codec itself can
optimize its buffer use. This codec that I would like to use with grpc is
the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, notifications@github.com wrote:

Is someone actively working on this? I've been playing around with
different buffer pool strategies, but don't want to look too much more if
I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of
buffer sizes. For example, if you ask for a 700 byte buffer, it might pull
a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size
bucket) and a leaky buffer implementation (one buffered chan []byte per
buffer size bucket). They certainly help, but neither one has plucked all
the low-hanging garbage fruit on the first try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3
.

awalterschulze on 2 Dec 2017

Here is the correct link
https://github.com/gogo/protobuf/blob/master/codec/codec.go

On Sat, 2 Dec 2017, 10:42 Walter Schulze, awalterschulze@gmail.com wrote:

I would prefer that the codec object itself is in a sync pool and is
assumed to be unsafe to use concurrently. This way the codec itself can
optimize its buffer use. This codec that I would like to use with grpc is
the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, notifications@github.com wrote:

Is someone actively working on this? I've been playing around with
different buffer pool strategies, but don't want to look too much more if
I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of
buffer sizes. For example, if you ask for a 700 byte buffer, it might pull
a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size
bucket) and a leaky buffer implementation (one buffered chan []byte per
buffer size bucket). They certainly help, but neither one has plucked all
the low-hanging garbage fruit on the first try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3
.

awalterschulze on 2 Dec 2017

There is notable garbage produced outside of proto marshaling in the http2 transport, so I was hoping to address that too. I was looking at more generic []byte buffer reuse since it could be used by the existing codec, and used by the http2 transport.

How is pooling the codec better than just pooling the []byte slice? The existing codec could pull its []byte from []byte pool, and also implement similar gogoproto optimized logic taking advantage of Size() and MarshalTo() as appropriate.

muirdm on 2 Dec 2017

Thats interesting. I didn't think that the http2 transport could also benefit from buffer pooling. At the previous company I worked, we had a similar buffer pooling scheme for a different purpose. The API asks for a buffer size and returns that size of byte buffer for you. It does this by looking through its slice of buffers for a big enough buffer and then returning the resized to your smaller size. When buffers are returned they are stretched back to their original capacity.
This buffer pool also had retention policies, which were a bit more complex, but implemented orthogonally with the buffer pool.

awalterschulze on 2 Dec 2017

My thinking for the codec was to add an optional method to it that, if implemented, would be called when the buffer it returns from Marshal is no longer needed by grpc. Then the codec could re-use that memory. Without something like this, it's impossible for the codec to safely reuse old memory buffers.

The http2 transport uses byte buffers to hold data it reads from the network; this can and should be re-used with a pool as well.

dfawley on 2 Dec 2017

My thought was instead of returning the buffer to the codec, return the buffer to the generic buffer pool instead. Then anyone can pull a buffer from the pool, and consumers of the buffers can put them back in the pool when they are done. It is safe to never return a buffer, or even to return a buffer that you did not get from the pool originally. On the other hand it is very unsafe to return a buffer and then still use it, or a slice of it. We would need some approach to make ourselves confident that use-after-release doesn't happen.

muirdm on 2 Dec 2017

What are other languages doing to optimize this situation? Maybe we can
learn something from c++ or even Java.

On Sat, 2 Dec 2017, 23:26 Muir Manders, notifications@github.com wrote:

My thought was instead of returning the buffer to the codec, return the
buffer to the generic buffer pool instead. Then anyone can pull a buffer
from the pool, and consumers of the buffers can put them back in the pool
when they are done. It is safe to never return a buffer, or even to return
a buffer that you did not get from the pool originally. On the other hand
it is very unsafe to return a buffer and then still use it, or a slice of
it. We would need some approach to make ourselves confident that
use-after-release doesn't happen.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348724659, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLa5exKkGKa5OEtS5Rb-3p0UBGuf_ks5s8c6GgaJpZM4O_JA3
.

awalterschulze on 3 Dec 2017

👍2 😄1

The lowest hanging fruit, IMHO, is to reuse the memory buffer allocated to copy bytes from data frame, as @dfawley suggested.
https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/http2_client.go#L869

This buffer can be recycled when the application has read it and copied to its preallocated chunck to be used for unmarshling.
https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/transport.go#L143

We can have a few sync.Pool with different sizes increasing in say, powers of 4, and capped to the max size of http2MaxFrameLen. Also it'd be great if the API is such that the base(4 in this example) is configurable so we can try different values with our benchmarks.

One problem with sync.Pool is that it drops all the memory during every GC cycle. In following efforts we can work with the Go team to resolve that. Using a leaky buffer comes with the onus of doing our own GC cycles for that memory. I don't think that's a good idea.

Things that C++ and Java do don't necessarily apply to Go.
C-core takes a user configurable parameter to cap the max amount of memory to be used by the binary. They start releasing memory as pressure builds up. So, it's essentially a library specific GC.

Java, IIRC, allocates memory from the OS space and keeps reusing that without ever freeing it again. This memory is not freed by the GC either because it's outside of JVM's jurisdiction.

I personally think, in gRPC-Go, we shouldn't go about allocating a big a chunk and doing our memory management on that. We should at least give sync.Pool a chance and see if that can be optimized to fit our use case.

MakMukhi on 4 Dec 2017

Fair enough. It would be great to compare sync.Pool with classic buffer reuse.

awalterschulze on 4 Dec 2017

sync.Pool is not enough as follow issue say, https://github.com/golang/go/issues/23199 . Unbound capacity buffer will cause worse GC or memory overhead . Maybe chunked capacity pool/buffer would be better

Zeymo on 21 Dec 2017

@Zeymo Thanks for the suggestions. We are aware of sync.Pool's behavior to not return the most fitting buffer. In the suggestion above I mention using buckets of sync.Pool with different sizes.
However, as an exercise it may be interesting to know how would a single sync.Pool perform for the transport layer since we know no single buffer can be greater than http2MaxFrameLen which is 16K for now.

MakMukhi on 21 Dec 2017

@MakMukhi

I've got another idea to reuse the buffer for stream.

Add a buf field in parser

type parser struct {
    r io.Reader
    header [5]byte
        buf []byte
}

Make the buf if the length is greater than buf length

if length > cap(p.buf) {
        p.buf = make([]byte, int(length))
}
msg = p.buf[:length]

What do you think?

coocood on 23 Dec 2017

@coocood

For your second code snippet, I think you'd actually want to compare against the buffer's capacity rather than length. You can actually slice up to the capacity of a buffer, not just the length. The gob package does something very similar:

https://github.com/golang/go/blob/f3f507b2d86bfb38c7e466a465d5b6463cfd4184/src/encoding/gob/decode.go#L64

That way, if you need buffers of size X, X-1, and then X, you aren't allocating twice.

havoc-io on 24 Dec 2017

@havoc-io
snippet updated, thanks.

coocood on 24 Dec 2017

@coocood This actually seems like a good idea to reuse buffer on the gRPC layer (perhaps a variation of this might work on the transport layer as well). It's simple, doesn't require global shared pools. However, the only thing that worries me is the case when we have long lived streams with intermittent reading/writing activity of large messages (say >=1MB). In such a case, this stream will carry a large buffer for its life. Now if there were thousands of such streams we're in serious trouble. Perhaps, this problem can be solved by freeing this buffer every once in a while from the stream monitor loop here:
https://github.com/grpc/grpc-go/blob/7f2472bbc6ac269af51a9c47ae8ef8c625383e1f/stream.go#L283

MakMukhi on 27 Dec 2017

@MakMukhi
How about just adding a CallOption to disable the buffer reuse for the extreme use case?
Freeing a buffer by another goroutine adds a lot of complexity.

coocood on 28 Dec 2017

@coocood I like that idea too. You're right accessing the buffer from two different goroutines would require use of locks which is an additional performance overhead. CallOption seems like a good way to go about it.
I'm not actively working on memory reuse currently, though it's high priority on my list of things to do. If you want to take a stab at it, go ahead. Otherwise, hopefully, I'll get to it Q1 2018.
Also, thanks for these great ideas. :)

MakMukhi on 28 Dec 2017

@MakMukhi
Great!
I'll send a PR in a few days.

coocood on 28 Dec 2017

What about pooled transport.Options and transport.CallHdr ?

Zeymo on 5 Jan 2018

@MakMukhi
any update on the reworked stream flow to reduce buffer use as mentioned here? we've noticed an issue where bombarding a stream endpoint with small messages causes lots of memory to be allocated on recvBuffer. backlog by HandleStreams(). since our handler function cannot pull messages off the backlog quickly enough, memory usage balloons until the server is killed.

it'd be nice if the buffer were at least some fixed length so that we could block the client's sends until there was room in the buffer. as it stands now, the server doesn't have any way of throttling the client and we have to rely on the client to send at a manageable rate.

scotthew1 on 7 Mar 2018

Hey, @scotthew1
Sorry about the late response. I have written a design doc for this optimization but I haven't started its implementation yet because we are working on a major refactoring of our transport.
This change will follow that refactor and I anticipate about 4 more weeks for this change to be merged.

MakMukhi on 19 Mar 2018

no problem @MakMukhi, thanks for the response. i look forward to the change!

scotthew1 on 20 Mar 2018

Following the aforementioned design, I have created #1987. Once this is merged, we will experiment with reusing the buffer.

MakMukhi on 12 Apr 2018

@MakMukhi It looks like #1987 was reverted by #2049. Are there plans for a different strategy?

havoc-io on 8 May 2018

@havoc-io Yes soon after we finished the implementation we identified potential security issues with that approach.
Currently, we have following ideas to get rid of extraneous memory and our efforts are around identifying which one is most feasible in short term.

Update the framer library to not reuse a frame's buffer until told to.
Update proto deserialization to be able to work on multiple buffers.
Use our own framing implementation to be able to use scatter/gather on underlying net.conn

MakMukhi on 8 May 2018

👍4

I've opened an issue on golang protobuf repo to provide an API to deserialize from multiple buffers proto.UnmarshalV, which is point 2 in the previous post.
I'll be working on it further to hopefully provide an implementation for it.

MakMukhi on 17 May 2018

Just as an input for discussion, this is the allocation framegraph for an API gateway in our system, as you can see almost half of the allocations are due to gRPC opening connections to the backends:

GC is using around ~10% of overall CPU time, so if gRPC caused sigificantly less allocations it would likely decrease overall CPU usage by ~4%.

CAFxX on 26 Apr 2019

@dfawley
Can we reopen my old PR https://github.com/grpc/grpc-go/pull/1774

coocood on 26 Apr 2019

@CAFxX - This is very interesting. We haven't yet considered allocations during connection creation, since that is not typically considered a hot path. Do you understand why your application is (seemingly?) creating new connections so frequently, and not sending many RPCs on them?

@coocood - It's possible. The design that @MakMukhi referenced when closing that PR was implemented but ended up not being acceptable (pre-allocating memory like this is a DoS risk).

@canguler - can you take a look at #1774 and see if you think it's still applicable / relevant after the memory reuse changes you're working on now?

dfawley on 26 Apr 2019

@dfawley the flamegraph is from an API gateway in front of ~100 microservices (and growing). I will dig a bit more but I suspect it should be something related to having a big number of backends (~300).

CAFxX on 27 Apr 2019

@dfawley I went through the code again and I can't really find anything that looks out of the ordinary: we create a ClientConn per group of backends at startup and we use google.golang.org/grpc/balancer/roundrobin to load balance between backends within a group; the ClientConns are long-lived and are closed only when the process is shutting down.

If you have any suggestions about additional potential causes I can triple-check.

CAFxX on 28 Apr 2019

Just a note for @coocood: in https://github.com/grpc/grpc-go/pull/1774#issuecomment-354602167 you wrote:

I think it's very unlikely a client keeps long-lived connections to thousands of servers.

While I agree it's rare, this is exactly the case here (the API gateway the flamegraph is from is a "client" for a few hundred backends, and it keeps long-lived connections to them)

CAFxX on 28 Apr 2019

@dfawley - Change at #1774 is still applicable. If we get an unmarshalv implementation for proto, then it will no longer be applicable.

canguler on 30 Apr 2019

I am experimenting with the change at #1774. Even if we were to get an unmarshalv implementation in near future, I think this change (#1774) is small enough. So, if it gives an improvement without any regressions, we may still go ahead and take it.

canguler on 30 Apr 2019

@canguler @dfawley

I think UnmarshalV is not an ideal solution, the biggest problem is that the generated Unmarshal code cannot be used. It will be much slower for complex proto. If we generate UnmarshalV code for each type, that is too expensive for the existing projects.

coocood on 1 May 2019

@dfawley do you need me looking into anything else? I'm not 100% sure though this issue is the best place to discuss the problem we're seeing as, although related, as you mentioned they're likely different issues.

CAFxX on 15 May 2019

@CAFxX - could you file a separate issue for the problems you're encountering, please?

In terms of additional information to gather, it would be interesting to know how often connections are being created in your setup. It seems like you have not only a large pool of connections, but new connections are also being created and destroyed frequently - is this the case? Also, I don't see any raw numbers in the flame graph - what is the total amount of allocations you are seeing (cumulative and per-minute/etc)?

dfawley on 15 May 2019

Narrowing the focus of this bug to a single item. Filed #2816 and #2817 to track other memory reuse areas.

dfawley on 15 May 2019

How about to use sync.Pool to reuse memory buffers which creates by recvMsg method of parser to read gRPC message from i/o layer. In my project this is about 5% of total memory allocations.
I've created pool request #3220 to do this with minimal changes of code.

alextomaili on 2 Dec 2019

@MakMukhi are there any plans to continue with this work ?

kevinconaway on 8 Jan 2021

@dfawley should be able to help with prioritization/resource allocation here.

MakMukhi on 8 Jan 2021

We don't have resources to focus on performance improvements at this time. This is unlikely to be prioritized for at least the next 3-6 months. If someone is interested in contributing, we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

dfawley on 8 Jan 2021

@dfawley Are you able to detail why https://github.com/grpc/grpc-go/pull/1987 was reverted? I saw a vague reference to security objections but it wasn't clear what they were and how a different approach might avoid them.

kevinconaway on 8 Jan 2021

If this is the one I'm thinking of, it allowed one side to force potentially large allocations on the other side without actually sending a similar amount of data. I.e. the sender declares it will send 1GB of data on a stream; the receiver makes the allocation; then the sender doesn't send anything further on that stream. This could very quickly OOM the peer (across many streams).

dfawley on 8 Jan 2021

we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

From reading this issue and related ones, it seems there is a tension between users who have mostly small message types and do not (or may not) use compression and those who have much larger message types. It sounds like the former group does not need, and may actively be harmed by, adding any type of pooling or memory reuse while the latter group would see a substantial benefit from it.

If that is accurate, it doesn't sound like there is a one size fits all solution here. What are your thoughts on being able to add a type via a DialOption or CallOption that allows plugging in a type to help manage the memory used in the recv workflow? Based on my limited analysis, it looks like there are two places where we could reuse memory, 1.) reading the message from the transport stream 2.) decompressing the data from 1. in to a separate buffer for codec unmarshaling.

Perhaps something like:

type RecvBuffers interface {
    GetReadBuffer(size int) *bytes.Buffer
    ReturnReadBuffer(b *bytes.Buffer)

    GetDecompressionBuffer(size int) *bytes.Buffer
    ReturnDecompressionBuffer(b *bytes.Buffer)
}

Implementers could decide to return a pooled buffer based on the requested size (or not). The default implementation could behave as the recv flow does now by allocating the requested memory each time.

Assuming this could be implemented in a way that doesn't harm the "out of the box" performance characteristics, this would allow users to control their own memory usage.

Thoughts?

kevinconaway on 11 Jan 2021

👍2

@dfawley Thoughts on the above? I'm happy to attempt this but I don't want to go off on the wrong path

kevinconaway on 21 Jan 2021

Grpc-go: transport: reuse memory on read path

Most helpful comment

All 48 comments

Related issues