Hi, I'm not a Go expert so please excuse me if there is another way to acomplish my objective.
I'm looking for a way to use Unmarshall method mainly, having the io.Reader type as an input. Especifically the io.ReadCloser wrapped in the Response type of the net/http library.
My problem is that the current Unmarshall method of the proto library only receive bytes, and I have to use the ioutils.ReadAll(...) method to get them wich allocates innecesary memory; this happens because there is no way to know the body content length for sure. Even if I get the content length using the net/http, the API warns me about potentially runtime error.
I already know that out there are http libraries based on Slices that avoid the memory allocation, like fasthttp, but they don't implement the net/http standard, so I can't use them.
Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader instead of bytes and having the same, or better, performance than the encoding/json or encoding/xml Unmarshall libraries?
Thanks for your patient and time.
Hi,
Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader
It is certainly possible to make the proto API support io.Reader and io.Writer, but it's actually not going to help performance since the problem of when some underlying []byte is allocated is simply just shifted to the proto package. For this reason, it is actually generally more flexible for the API to deal directly with []byte when it comes to performance.
In order to achieve better performance, ioutil.ReadAll is not a great choice as you mentioned since it always allocates. I recommend that you create a cached pool of buffers on the server side to read the stream into. You could use sync.Pool or a buffered chan []byte to act as the pool. If you do choose to do buffer caching, you should make sure there is an upper limit on the size of buffers that may be returned to the pool (see golang/go#23199).
@dsnet: I'm curious about your reasoning here - if I understand you correctly, the performance you're talking about concerns small (kilobytes) protobuf messages where the number of malloc equivalent calls is the bottleneck? And the cached pool of buffers would eliminate the allocation/deallocation of the memory?
I'm thinking that for applications that handle larger protobuf messages, API support for io.Reader and io.Writer would provide some benefit. The problem right now if I have a .pb file with say 100 MB of data is that it would first need to read it into memory as a []byte, and then that byte array would be read and parsed by protobuf to create the structs. Before the initial []byte is deallocated, we'd be using at least 2 * 100 MB of memory.
If we could stream data from the file in smaller chunks, we would only need at most 100 MB + X kB of memory allocated, where X is the chunk size. If this is better for performance depends how you define performance I guess, but it'd at least use less memory.
Perhaps large protobuf messages on file isn't a very common use-case, and that's why this API doesn't exist/isn't asked for? Another option is that I have misunderstood something, and that's why it doesn't exist 🙂
And right after I wrote the above I found https://developers.google.com/protocol-buffers/docs/techniques (read the section "Large data sets"), which states:
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?
An unserialized proto message uses memory proportional to the encoded wire data. Thus, even if the input was an io.Reader, the proto package is still going to read that all into memory in order to deserialize. If we wanted to avoid allocations and re-use buffers, something similar to #609 seems to be more of a step in the right direction.
I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?
Let's suppose you have stream of proto messages of type Foo, where all the individual foo messages together is really large, but individually fairly small. You could encode and decode the data in such a way that it is compatible to a proto message like this:
message FooList {
repeated Foo = 1;
}
When encoding, this would be written as:
var ms []*foopb.Foo = ...
var b bytes.Buffer // or some other io.Writer
for _, m := range ms {
b.WriteByte(1<<3 | 2) // write a tag of field 1 of type bytes
b2 := proto.Marshal(m)
var b3 [binary.MaxVarintLen64]byte
b.Write(b3[:binary.PutUvarint(b3[:], uint64(len(b2)))]) // write the length prefix
b.Write(b2) // write the message data itself
}
Encoding in this form as the benefit that []*foopb.Foo is wire equivalent to the FooList shown above. That being said, you're welcome to use whatever framing format you desire, even if it's as simple as prefixing each message with a fixed uint64.
See https://developers.google.com/protocol-buffers/docs/encoding.
@dsnet: Thanks for taking the time to answer!
Regarding memory allocation, I understand protobuf needs to read everything into memory (after all, it deserializes into structs). I was only reasoning about the intermediate memory used when deserializing a huge message - we first need to read the full message into []byte before even sending it to protobuf, which then allocates the same amount of memory again for the actual structs. If we could deserialize in chunks, i.e.:
This would mean we wouldn't need to first allocate a []byte for the full message, and then pass it into protobuf. The only extra memory we would need is the temporary 50 kB chunk size. Given the constraint mentioned during "Large data sets", I can see why this isn't a concern though - proto messages are not meant to be several hundred MBs of data.
Thanks for the example regarding repeated vs just encoding the []*Foo array!
what about reading a stream and knowing when to stop for the unmarshalling to work?
The protobuf binary encoding doesn't include any end-of-record marker, so any framing needs to be added externally.
That is, there's no way for the proto decoder to know when to stop. You need to tell it.
alright that's what I figured, does grpc prepends a length field or something?
I'm not personally familiar with the gRPC protocol, but probably.
indeed. Thanks!
I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.
I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.
See #912.
This is a thing I'd like to do, but getting the design right is tricky and it's a substantial amount of work.
Edit: is tricky, getting the design right is tricky.
Ah, yeah, if it's still under consideration that's good enough for me. This
was just the bug that showed up from a search.
On Mon, Mar 16, 2020, 13:51 Damien Neil notifications@github.com wrote:
I think this should be reopened. It's a fairly straightforward use-case
and would improve usability substantially.See #912 https://github.com/golang/protobuf/issues/912.
This is a thing I'd like to do, but getting the design right isn't tricky
and it's a substantial amount of work.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/protobuf/issues/507#issuecomment-599752748,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJR7Z5DPYUBZTGG4KLGMQTRH2GOJANCNFSM4EO5OJXQ
.