Grpc-go: transport: reuse memory on read path

Created on 22 Aug 2017  Â·  48Comments  Â·  Source: grpc/grpc-go

One possible idea is to use a leaky buffer.

P2 Performance

Most helpful comment

@havoc-io Yes soon after we finished the implementation we identified potential security issues with that approach.
Currently, we have following ideas to get rid of extraneous memory and our efforts are around identifying which one is most feasible in short term.

  1. Update the framer library to not reuse a frame's buffer until told to.

  2. Update proto deserialization to be able to work on multiple buffers.

  3. Use our own framing implementation to be able to use scatter/gather on underlying net.conn

All 48 comments

Specific areas where we would like to do this:

  • Codec output buffer (requires codec API extension)
  • Reading from transport (I believe this is #869)

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

I would prefer that the codec object itself is in a sync pool and is
assumed to be unsafe to use concurrently. This way the codec itself can
optimize its buffer use. This codec that I would like to use with grpc is
the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, notifications@github.com wrote:

Is someone actively working on this? I've been playing around with
different buffer pool strategies, but don't want to look too much more if
I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of
buffer sizes. For example, if you ask for a 700 byte buffer, it might pull
a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size
bucket) and a leaky buffer implementation (one buffered chan []byte per
buffer size bucket). They certainly help, but neither one has plucked all
the low-hanging garbage fruit on the first try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3
.

Here is the correct link
https://github.com/gogo/protobuf/blob/master/codec/codec.go

On Sat, 2 Dec 2017, 10:42 Walter Schulze, awalterschulze@gmail.com wrote:

I would prefer that the codec object itself is in a sync pool and is
assumed to be unsafe to use concurrently. This way the codec itself can
optimize its buffer use. This codec that I would like to use with grpc is
the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, notifications@github.com wrote:

Is someone actively working on this? I've been playing around with
different buffer pool strategies, but don't want to look too much more if
I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of
buffer sizes. For example, if you ask for a 700 byte buffer, it might pull
a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size
bucket) and a leaky buffer implementation (one buffered chan []byte per
buffer size bucket). They certainly help, but neither one has plucked all
the low-hanging garbage fruit on the first try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3
.

There is notable garbage produced outside of proto marshaling in the http2 transport, so I was hoping to address that too. I was looking at more generic []byte buffer reuse since it could be used by the existing codec, and used by the http2 transport.

How is pooling the codec better than just pooling the []byte slice? The existing codec could pull its []byte from []byte pool, and also implement similar gogoproto optimized logic taking advantage of Size() and MarshalTo() as appropriate.

Thats interesting. I didn't think that the http2 transport could also benefit from buffer pooling. At the previous company I worked, we had a similar buffer pooling scheme for a different purpose. The API asks for a buffer size and returns that size of byte buffer for you. It does this by looking through its slice of buffers for a big enough buffer and then returning the resized to your smaller size. When buffers are returned they are stretched back to their original capacity.
This buffer pool also had retention policies, which were a bit more complex, but implemented orthogonally with the buffer pool.

My thinking for the codec was to add an optional method to it that, if implemented, would be called when the buffer it returns from Marshal is no longer needed by grpc. Then the codec could re-use that memory. Without something like this, it's impossible for the codec to safely reuse old memory buffers.

The http2 transport uses byte buffers to hold data it reads from the network; this can and should be re-used with a pool as well.

My thought was instead of returning the buffer to the codec, return the buffer to the generic buffer pool instead. Then anyone can pull a buffer from the pool, and consumers of the buffers can put them back in the pool when they are done. It is safe to never return a buffer, or even to return a buffer that you did not get from the pool originally. On the other hand it is very unsafe to return a buffer and then still use it, or a slice of it. We would need some approach to make ourselves confident that use-after-release doesn't happen.

What are other languages doing to optimize this situation? Maybe we can
learn something from c++ or even Java.

On Sat, 2 Dec 2017, 23:26 Muir Manders, notifications@github.com wrote:

My thought was instead of returning the buffer to the codec, return the
buffer to the generic buffer pool instead. Then anyone can pull a buffer
from the pool, and consumers of the buffers can put them back in the pool
when they are done. It is safe to never return a buffer, or even to return
a buffer that you did not get from the pool originally. On the other hand
it is very unsafe to return a buffer and then still use it, or a slice of
it. We would need some approach to make ourselves confident that
use-after-release doesn't happen.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1455#issuecomment-348724659, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABvsLa5exKkGKa5OEtS5Rb-3p0UBGuf_ks5s8c6GgaJpZM4O_JA3
.

The lowest hanging fruit, IMHO, is to reuse the memory buffer allocated to copy bytes from data frame, as @dfawley suggested.
https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/http2_client.go#L869

This buffer can be recycled when the application has read it and copied to its preallocated chunck to be used for unmarshling.
https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/transport.go#L143

We can have a few sync.Pool with different sizes increasing in say, powers of 4, and capped to the max size of http2MaxFrameLen. Also it'd be great if the API is such that the base(4 in this example) is configurable so we can try different values with our benchmarks.

One problem with sync.Pool is that it drops all the memory during every GC cycle. In following efforts we can work with the Go team to resolve that. Using a leaky buffer comes with the onus of doing our own GC cycles for that memory. I don't think that's a good idea.

Things that C++ and Java do don't necessarily apply to Go.
C-core takes a user configurable parameter to cap the max amount of memory to be used by the binary. They start releasing memory as pressure builds up. So, it's essentially a library specific GC.

Java, IIRC, allocates memory from the OS space and keeps reusing that without ever freeing it again. This memory is not freed by the GC either because it's outside of JVM's jurisdiction.

I personally think, in gRPC-Go, we shouldn't go about allocating a big a chunk and doing our memory management on that. We should at least give sync.Pool a chance and see if that can be optimized to fit our use case.

Fair enough. It would be great to compare sync.Pool with classic buffer reuse.

sync.Pool is not enough as follow issue say, https://github.com/golang/go/issues/23199 . Unbound capacity buffer will cause worse GC or memory overhead . Maybe chunked capacity pool/buffer would be better

@Zeymo Thanks for the suggestions. We are aware of sync.Pool's behavior to not return the most fitting buffer. In the suggestion above I mention using buckets of sync.Pool with different sizes.
However, as an exercise it may be interesting to know how would a single sync.Pool perform for the transport layer since we know no single buffer can be greater than http2MaxFrameLen which is 16K for now.

@MakMukhi

I've got another idea to reuse the buffer for stream.

  1. Add a buf field in parser
type parser struct {
    r io.Reader
    header [5]byte
        buf []byte
}
  1. Make the buf if the length is greater than buf length
if length > cap(p.buf) {
        p.buf = make([]byte, int(length))
}
msg = p.buf[:length]

What do you think?

@coocood

For your second code snippet, I think you'd actually want to compare against the buffer's capacity rather than length. You can actually slice up to the capacity of a buffer, not just the length. The gob package does something very similar:

https://github.com/golang/go/blob/f3f507b2d86bfb38c7e466a465d5b6463cfd4184/src/encoding/gob/decode.go#L64

That way, if you need buffers of size X, X-1, and then X, you aren't allocating twice.

@havoc-io
snippet updated, thanks.

@coocood This actually seems like a good idea to reuse buffer on the gRPC layer (perhaps a variation of this might work on the transport layer as well). It's simple, doesn't require global shared pools. However, the only thing that worries me is the case when we have long lived streams with intermittent reading/writing activity of large messages (say >=1MB). In such a case, this stream will carry a large buffer for its life. Now if there were thousands of such streams we're in serious trouble. Perhaps, this problem can be solved by freeing this buffer every once in a while from the stream monitor loop here:
https://github.com/grpc/grpc-go/blob/7f2472bbc6ac269af51a9c47ae8ef8c625383e1f/stream.go#L283

@MakMukhi
How about just adding a CallOption to disable the buffer reuse for the extreme use case?
Freeing a buffer by another goroutine adds a lot of complexity.

@coocood I like that idea too. You're right accessing the buffer from two different goroutines would require use of locks which is an additional performance overhead. CallOption seems like a good way to go about it.
I'm not actively working on memory reuse currently, though it's high priority on my list of things to do. If you want to take a stab at it, go ahead. Otherwise, hopefully, I'll get to it Q1 2018.
Also, thanks for these great ideas. :)

@MakMukhi
Great!
I'll send a PR in a few days.

What about pooled transport.Options and transport.CallHdr ?

@MakMukhi
any update on the reworked stream flow to reduce buffer use as mentioned here? we've noticed an issue where bombarding a stream endpoint with small messages causes lots of memory to be allocated on recvBuffer. backlog by HandleStreams(). since our handler function cannot pull messages off the backlog quickly enough, memory usage balloons until the server is killed.

it'd be nice if the buffer were at least some fixed length so that we could block the client's sends until there was room in the buffer. as it stands now, the server doesn't have any way of throttling the client and we have to rely on the client to send at a manageable rate.

Hey, @scotthew1
Sorry about the late response. I have written a design doc for this optimization but I haven't started its implementation yet because we are working on a major refactoring of our transport.
This change will follow that refactor and I anticipate about 4 more weeks for this change to be merged.

no problem @MakMukhi, thanks for the response. i look forward to the change!

Following the aforementioned design, I have created #1987. Once this is merged, we will experiment with reusing the buffer.

@MakMukhi It looks like #1987 was reverted by #2049. Are there plans for a different strategy?

@havoc-io Yes soon after we finished the implementation we identified potential security issues with that approach.
Currently, we have following ideas to get rid of extraneous memory and our efforts are around identifying which one is most feasible in short term.

  1. Update the framer library to not reuse a frame's buffer until told to.

  2. Update proto deserialization to be able to work on multiple buffers.

  3. Use our own framing implementation to be able to use scatter/gather on underlying net.conn

I've opened an issue on golang protobuf repo to provide an API to deserialize from multiple buffers proto.UnmarshalV, which is point 2 in the previous post.
I'll be working on it further to hopefully provide an implementation for it.

Just as an input for discussion, this is the allocation framegraph for an API gateway in our system, as you can see almost half of the allocations are due to gRPC opening connections to the backends:

image

GC is using around ~10% of overall CPU time, so if gRPC caused sigificantly less allocations it would likely decrease overall CPU usage by ~4%.

@dfawley
Can we reopen my old PR https://github.com/grpc/grpc-go/pull/1774

@CAFxX - This is very interesting. We haven't yet considered allocations during connection creation, since that is not typically considered a hot path. Do you understand why your application is (seemingly?) creating new connections so frequently, and not sending many RPCs on them?

@coocood - It's possible. The design that @MakMukhi referenced when closing that PR was implemented but ended up not being acceptable (pre-allocating memory like this is a DoS risk).

@canguler - can you take a look at #1774 and see if you think it's still applicable / relevant after the memory reuse changes you're working on now?

@dfawley the flamegraph is from an API gateway in front of ~100 microservices (and growing). I will dig a bit more but I suspect it should be something related to having a big number of backends (~300).

@dfawley I went through the code again and I can't really find anything that looks out of the ordinary: we create a ClientConn per group of backends at startup and we use google.golang.org/grpc/balancer/roundrobin to load balance between backends within a group; the ClientConns are long-lived and are closed only when the process is shutting down.

If you have any suggestions about additional potential causes I can triple-check.

Just a note for @coocood: in https://github.com/grpc/grpc-go/pull/1774#issuecomment-354602167 you wrote:

I think it's very unlikely a client keeps long-lived connections to thousands of servers.

While I agree it's rare, this is exactly the case here (the API gateway the flamegraph is from is a "client" for a few hundred backends, and it keeps long-lived connections to them)

@dfawley - Change at #1774 is still applicable. If we get an unmarshalv implementation for proto, then it will no longer be applicable.

I am experimenting with the change at #1774. Even if we were to get an unmarshalv implementation in near future, I think this change (#1774) is small enough. So, if it gives an improvement without any regressions, we may still go ahead and take it.

@canguler @dfawley

I think UnmarshalV is not an ideal solution, the biggest problem is that the generated Unmarshal code cannot be used. It will be much slower for complex proto. If we generate UnmarshalV code for each type, that is too expensive for the existing projects.

@dfawley do you need me looking into anything else? I'm not 100% sure though this issue is the best place to discuss the problem we're seeing as, although related, as you mentioned they're likely different issues.

@CAFxX - could you file a separate issue for the problems you're encountering, please?

In terms of additional information to gather, it would be interesting to know how often connections are being created in your setup. It seems like you have not only a large pool of connections, but new connections are also being created and destroyed frequently - is this the case? Also, I don't see any raw numbers in the flame graph - what is the total amount of allocations you are seeing (cumulative and per-minute/etc)?

Narrowing the focus of this bug to a single item. Filed #2816 and #2817 to track other memory reuse areas.

How about to use sync.Pool to reuse memory buffers which creates by recvMsg method of parser to read gRPC message from i/o layer. In my project this is about 5% of total memory allocations.
I've created pool request #3220 to do this with minimal changes of code.

@MakMukhi are there any plans to continue with this work ?

@dfawley should be able to help with prioritization/resource allocation here.

We don't have resources to focus on performance improvements at this time. This is unlikely to be prioritized for at least the next 3-6 months. If someone is interested in contributing, we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

@dfawley Are you able to detail why https://github.com/grpc/grpc-go/pull/1987 was reverted? I saw a vague reference to security objections but it wasn't clear what they were and how a different approach might avoid them.

If this is the one I'm thinking of, it allowed one side to force potentially large allocations on the other side without actually sending a similar amount of data. I.e. the sender declares it will send 1GB of data on a stream; the receiver makes the allocation; then the sender doesn't send anything further on that stream. This could very quickly OOM the peer (across many streams).

we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

From reading this issue and related ones, it seems there is a tension between users who have mostly small message types and do not (or may not) use compression and those who have much larger message types. It sounds like the former group does not need, and may actively be harmed by, adding any type of pooling or memory reuse while the latter group would see a substantial benefit from it.

If that is accurate, it doesn't sound like there is a one size fits all solution here. What are your thoughts on being able to add a type via a DialOption or CallOption that allows plugging in a type to help manage the memory used in the recv workflow? Based on my limited analysis, it looks like there are two places where we could reuse memory, 1.) reading the message from the transport stream 2.) decompressing the data from 1. in to a separate buffer for codec unmarshaling.

Perhaps something like:

type RecvBuffers interface {
    GetReadBuffer(size int) *bytes.Buffer
    ReturnReadBuffer(b *bytes.Buffer)

    GetDecompressionBuffer(size int) *bytes.Buffer
    ReturnDecompressionBuffer(b *bytes.Buffer)
}

Implementers could decide to return a pooled buffer based on the requested size (or not). The default implementation could behave as the recv flow does now by allocating the requested memory each time.

Assuming this could be implemented in a way that doesn't harm the "out of the box" performance characteristics, this would allow users to control their own memory usage.

Thoughts?

@dfawley Thoughts on the above? I'm happy to attempt this but I don't want to go off on the wrong path

Was this page helpful?
0 / 5 - 0 ratings