IMO C# protobuf implementation could strongly benefit performance-wise from allowing some unsafe code.
First of all I am not saying that unsafe should be enabled in every build/every platform, having 100% managed library is always a nice feature. All I am suggesting is again some conditional feature on some platform.
However there is a trend to move towards unsafe mixing we can observe in .NET Core.
For example a lot of .NET Core libraries use unsafe code due to need to interop with different kind of unmanaged libraries or just to handle AOT compiled code.
Today CodedInputStream works in two modes: streaming with byte[] buffer and with a fixed buffer.
It would be really nice to be able to provide byte * and length to CodedInputStream constructor.
Alternatively we could create an UnmanagedCodedInputStream which would work on unmanaged memory.
This would require to abstracting CodedInputStream as an interface
Async API #3166 makes absolutely no sense for UnmanagedCodedInputStream because it assumes everything have been already read.
User is using native NoSQL database (for example RocksDB/LevelDB) and is using protobuf for data persistence. Database returns native allocated pointer with buffer. Instead of copying entire buffer to the memory user can deserialize record straight from the returned memory native memory and then free the pointer. Very little GC is involved and there is practically no overhead.
For performance reasons to decrease heap allocations and allocate buffer on stack. This scenario can be alternatively handled by Buffer Pooling, however stackalloc seems to be more efficient.
https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md - this work in progress improvement can allow to achieve the same goals in a managed way by providing some platform improvements.
Came here to say the same thing. Byte array copying inside Protobuf3 is one of the biggest sources of GC pressure in Akka.NET; we would like to be able to use pooled byte arrays (or Spans when they become part of the framework) and not have ByteString do a deep clone of them each time we use it.
Yes, we get it - it's dangerous. Understood, but let the end-users take the risk in order to put Protobuf to work in high performance contexts.
Protobuf Span API could look like this
```C#
namespace Google.Protobuf
{
public class CodedInputStream
{
public CodedInputStream(ReadOnlySpan
β¦
}
public class CodedOutputStream
{
public CodedOutputStream(Span<byte> buffer);
β¦
}
public class MessageParser
{
public IMessage ParseFrom(ReadOnlySpan<byte> input);
β¦
}
public static class MessageExtensions
{
public static void MergeFrom(this IMessage message, ReadOnlySpan<byte> data);
public static void WriteTo(this IMessage message, Span<byte> output);
β¦
}
}
```
@Aaronontheweb At least this class should be public
https://github.com/google/protobuf/blob/master/csharp/src/Google.Protobuf/ByteString.cs#L60
Thanks for backing me up :). If this proposal is accepted I am happy to create a proper PR. Problem with Span/ReadOnlySpan is that AFAIK it is not available yet and also that @jskeet will not be very keen on accepting PR requiring very-new compiler support and runtime support.
So I would be probably rather for the unsafe approach which can be widely used today.
@mkosieradzki if they don't want to go forward with just exposing the Unsafe methods they have today, adding some overloads which use ArraySegment<byte> would also be a workable option in lieu of Span
@mkosieradzki System.Memory package supports netstandard1.0
@alexvaluyskiy Thanks. I have just checked that this package has been released in preview 3 months ago.
Are you aware whether it requires some specific compiler version? Or are you aware of any roadmap/official documentation?
@Aaronontheweb ArraySegment is not comparable to Span, because ArraySegment requires byte[] (does not support byte *) and Span can work on everything including byte[] and byte* or any other generic kind of memory.
I'm in two minds about this. I do see the point - but I'm definitely concerned about the level of complexity we end up with, in terms of lots of different builds for different scenarios. (Adding a dependency for Span also makes me nervous, but I'm gradually warming to the idea of ValueTask for the async part...) The async aspect adds another layer of complexity here - by the time we've got a matrix of managed/unmanaged, async/non-async, multiple versions of .NET supported, I think it's going to get tricky. It makes it really easy to break customers with what may seem like a trivial change.
There's also the aspect of time commitment from Google to maintain all of this. I'm not on the protobuf team, and have a full plate already - so @anandolee would need to be very comfortable with it all.
In terms of ByteString - no, the Unsafe class should not be made public, in the same way that I would never expect Microsoft to release a public String.Unsafe class that made it trivial to mutate strings. ByteStrings are intended to be immutable; ByteString.Unsafe exists to allow us to be more efficient where the library can guarantee that sharing a byte array will not exhibit mutability.
One option you may want to consider is forking the library entirely, creating an OptimizedAndVeryUnsafe.Protobuf library which is wire compatible but exposes everything you've ever wanted to, pays less attention to being massively backward compatible all the time, takes whatever dependencies it wants and generally gives the user more of a loaded shotgun. I've always tried to optimize within the bounds of making the code easy to use safely and hard to shoot yourself in the foot. If you want to remove those restrictions, that's fine - but it shouldn't be in the same library.
@jskeet Thanks a lot for your response. My opinion would be to take the System.Memory path (without explicit unsafe). Span<byte> and ReadOnlySpan<byte> are incredibly great and safe wrappers for both byte[] and byte *.
If we could use System.Memory it would do all the heavy lifting (without requiring any UnmanagedCodedInputStream). We could just replace internal buffer with Span
So I would suggest waiting for System.Memory release or I can try to create a fork using the preview version and we can merge it into the main library when ready.
Knowing that System.Memory is going to be released in a predictable future I think that special handling for byte * makes no sense.
I also think it's very similiar case as with ValueTask... (and System.Buffers for a shared buffer pool) - it's yet another dependency required to achieve optimal performance.
BTW. For more distant future I am also researching different ways to achieve Arena allocations for C#. There is a promising project called Snowflake https://www.microsoft.com/en-us/research/publication/project-snowflake-non-blocking-safe-manual-memory-management-net/# .
Yes, I'm definitely happier with Span than anything to do with unsafe code. However, I wouldn't want to do anything with it until it's had a full (non-beta) release. We could create PRs before then of course, but we shouldn't merge them.
@jskeet Thanks! I will try to create a PR in August so we can preview this.
@mkosieradzki @jskeet I think the new Span API should be similar to java's protobuf ByteBuffer Api
Exactly as I was afraid - Span requires new compiler version: https://github.com/aspnet/Announcements/issues/268 ...
Let's see where it lands, in terms of requirements. It may be that it'll be harmless to expose it so long as Google.Protobuf is compiled with C# 7.2, which I'd be comfortable with (after that's released).
Tooling support for the C# 7.2 Span framework types was added with VS2017 15.5, which was RTMed on Dec 4. Span is now well documented and there are a couple of good collateral pieces, including one by Stephen Toub, linked here.
https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/
A couple of issues for discussion:
I'd be willing to work on this if there were to be a clear interest (or acceptance criteria) with respect to number two. and we're within about six weeks on number one (@anandolee & @jskeet).
I think it's reasonable to require that anyone developing the library uses VS2017 15.5 or the equivalent .NET Core SDK. There's a chunk of work required in order to update the SDK for continuous integration though.
Whether the Span methods are included in all output libraries is a different matter though. We'll need to look at the dependencies and whether there are issues consuming Span from older versions of VS/C#. (For example, if VS2015 users could easily end up using Span in a dangerous way, we may want to add a netstandard2.0 target and only expose Span there. We'll have to see.)
I think the first port of call should be some prototyping and benchmarking.
I think there's a good potential for using Span optimizations in gRPC C# (general idea: there's the grpc_csharp_ext native library and the messages need to be moved between managed and native layer and copying can be expensive. Being able to transparently address both managed/unmanaged memory with the same code seems useful.) - but we haven't really looked into these optimizations in detail (and there's complications so it's not possible to just say if it is going to be worth it without proper analysis and some experimenting).
I have started some experimentation in #3530 . I need to revisit my experiments since 15.5 is out. Before proper tooling support Span was super-slow (as expected).
Also it's important to remember that Span is a stack-only type - what has a lot of serious implications:
For example we might need to:
ref struct (doesn't seem right)Memory instead of SpanThis is especially difficult in context of #3166 - I have a lot of doubts related to the proper async-support.
I would really love to start a high level design discussion.
Perhaps we should first discuss where we believe the value for C# protobuf could originate with the new framework types and then try to align on a preliminary approach to assess whether that value actually exists. I'm mindful that this is already a well-crafted code-base and we also face the barrier/cost @jskeet noted of changing the SDK. No one wishes to waste their time developing PRs that never see the light of day, so we should try to fail sooner rather than later in the case that there isn't much value.
The rationale for the first wave of these framework types can be boiled down to a few major categories, based on what I'm seeing. Feel free to add or subtract from this picture:
I did a little preliminary mechanical code analysis/triage and here is what I found. Outside of the context of unit tests, there are about 150 instances in the code of obvious allocations as indicated from the following starter list of search terms: "new byte["; ".Copy"; ".BlockCopy"; ".ToArray("; ".Substring("; "new MemoryStream"; ".Clone()." If you are interested, you can see the list in the attached spreadsheet.
Of these, probably only a small subset would be worth attacking, at least initially, and specifically those that:
(As we know, there is no unsafe code presently in the code base.)
Purely for the purposes of initial discussion, a generic plan for the first couple of iterations in this type of situation, where we're skeptical of the benefits and want to proceed cautiously to avoid time wastage, might be as follows:
Let's have a good round of discussion on these details, alternative approaches, and any other thoughts. If something like the above turns out to make sense, I've also included a list of logistical questions in the attached file, which someone from Google could address. There are also some resources on this type of code-optimization effort.
@mkosieradzki can you clarify the backwards compatibility of Span:
@jtattermusch
That's topic worth researching - I don't know exactly, and the answer might be very difficult, because it consist of 3 elements:
System.Memory requires .NET Standard 1.1ref struct concept and safety checks)According to: https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md
Runtimes that support by-ref fields and returns will get the fast ref-field
Span<T>. Other runtimes will get the slower three-fieldSpan<T>.
From benchmarks at http://adamsitnik.com/Span/ we can see 5-10% performance degradation of Spans vs Arrays on reads (on pre .NET Core 2.0 runtimes), however I believe that introducing safe stackallocs might help regain the performance even on older runtimes.
The biggest benefit from Span<T> is ability to have a single API for unmanaged and managed. However it will definitely cost us a huge API redesign as Span<T> cannot be member of non stack-only type. Let me emphasize this: You cannot make Span
To be able to benefit from using Span<T> we need to talk about designing a completely different API.
And having this in mind. We have 2 possible approaches:
1) Let's go fully unmanaged - Use pointers
pros: fastest, reverse compatible, we can consider keeping buffer as a CodedInputStream class member, compatible with async approach
cons: will not work on partial trust environments (this is a legacy concept), require more serious code reviews, when using unmanaged memory we need to do managed pointer pinning.
Span<T>I believe that Span<T> is a lot about using call stack to guarantee lifetime of the pinned memory pointer.
Also to address problems about async support:
- Despite I am the author of async support (see #3173 ), I am not 100% convinced this is the way to go. I believe it can be a reasonable approach to use protocol buffers in a following manner:
1. Pre-fetch entire buffer (handling delimiters)
2. Schedule parsing using fast path only on a pre-fetched buffer.
This manner should be also compatible with gRPC streaming approach (and this is a level where I believe asynchrony fits the best).
This minimizes the time buffer needs to be pinned and strongly simplifies the code.
As I have mentioned before I have created a prototype for unsafe version of protobuf introducing even arena allocator approach to protobuf (see #3530 ).
I believe that introducing arena allocation together with unsafe approach will bring more and more benefits.
The last but not least: using unmanaged memory can allow implementing faster parsing by unsafe casting buggers into primitive types like int or long. Both protobuf and x86 are little-endian what should significantly improve parsing speed.
On more point about the arena allocations support:
It could be reasonable to use the memory from the original buffer in that case it makes parsing even faster. The most important thing to remember about this parsing approach is that arena-allocated + buffer-sourced data should be short-lived and the API should provide a way to transform it into a long-living heap-allocated C# objects if required by the developer.
In many scenarios long-living heap-allocated objects are not required or helpful at all.
The fact that we are using C# should not force us to downgrade to a Java-like approach with zero control over the allocations.
I'm thinking of Span as useful primarily in the narrow case of discrete, single threaded, synchronous functions and their helpers. In that case, the ability to potentially put a short array on the stack or to encapsulate neatly a reference to a slice or substring that can be passed to a helper without incurring an allocation or muddying the water with offsets is a nice (but limited) gain.
As far as the (different) topic of a public interface for core memory-encapsulating buffer objects, I'll defer to the prior participants in that conversation except to note that in my own work, which heavily depends on these, I almost always use such objects now in the context of various (multi-threaded and often parallelized) pipeline patterns in which exposing (heap-based) arrays as public properties is indeed indispensable. I try to avoid GC-pressure from a ton of small buffer or string allocations in those patterns; size buffers based on what's found to be optimal for a machine architecture (not individual document/message/cell size); and I reuse managed arrays in rotating, multi-segment buffers (with coarse-grained write vs. read gating across stages). For what it's worth, my own benchmarking has not demonstrated performance benefits for sequential access scenarios from using pointers with or without native memory regions in that particular context (though I do use pointers, unsafe code, and native memory/file handles for other use cases). All the same, a reusable core buffer strategy requires considerable bookkeeping. A preference for managed core buffer objects with ordinary indexed access may come down simply to managed lifetime as has been noted and cross-platform portability.
If folks ultimately conclude that Spans vs. core buffer refactoring are different topics, perhaps we should spin off the Spans to a separate issue?
@mkosieradzki I looked a bit more into what would the dependency on Span<> mean for Protobuf and gRPC:
System.Memory depends on netstandard1.0 (I believe it will use the unoptimized "fallback" version unless you target netstandard2.0), which means you can use it in net45 projects, but you will need a newer version of nuget and a newer IDE to be able to build those projects (they need to have knowledge of what "netstandard1.0" is)
gRPC currently targets net45 and netstandard1.5, Protobuf targets net45 and netstandard1.0.
As both gRPC and Protobuf currently target net45 explicitly, adding a dependency on System.Memory would have this effect:
Also see: http://adamsitnik.com/Span/#how-to-use-it-today
@jtattermusch I might be wrong, but my understanding is:
Span or not.Span as a stack only type and to be able to use ref struct concept. If you are using older compiler you can create incorrect library: I mean something that will throw an exception when an optimized version of Span is used. It also allows you to use type-safe stackallocs etc without using /unsafe. It might also introduce some compile-time optimizations, but I am not sure about this one.System.Memory package.Coming to this discussion late.
Having gathered a little bit experience working on a Span
From a user persective the simplest things would be to add an overload of WriteTo like this:
public void WriteTo(Span<byte> output)
and one for parsing, like this:
public T ParseFrom(ReadOnlySpan<byte> data)
Benefits are two-fold:
@kwaclaw the API you are suggesting sounds reasonable (but would also mean that Google.Protobuf would need to depend on System.Memory nuget and we should think twice before adding any new dependencies)
There's one more thing to consider for interop with native libraries: for RPC libraries that are using a native layer (such as gRPC), it won't be uncommon that the native buffers delivered on the wire will be fragmented - there's no guarantee that a single protobuf message delivered by gRPC will be present in memory in a single continguous area (gRPC has a concept of byte buffer which consists of N "slices" that represent continguous spans in memory) - so being able to parse a single message from a collection of Span
@jtattermusch Good point! (on non-contiguous memory slices)
Regarding CodedInputStream we need to wait for @jskeet to validate my points about it.
IMO CodedInputStream cannot be used together with ReadOnlySpan<byte>.
Regarding non-contiguous memory slices from native library maybe it's a good idea to transition from pull approach and CodedInputStream and switch to the push approach and use Pipelines instead?
See: https://github.com/dotnet/corefxlab/blob/master/docs/roadmap.md
and:
https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines.md
and:
https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines-io.md
@jtattermusch regarding non-contiguous memory slices: .NET Core 2.1 preview 1 contains new class ReadOnlySequence<T> (https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/ReadOnlySequence.cs) which should be helpful in this scenario.
@mkosieradzki I've been recently looking into some gRPC/Protobuf optimizations and the ReadOnlySequence<T> type is looking pretty good. here are some advantages that caught my eye:
System.Buffers which supports a large range of target frameworks.Overall, I think experimenting with adding a CodedInputStream constructor that consumes ReadOnlySequence would definitely be worth it.
Btw, since Span<> and Memory<> now exist (which allows accessing both managed and unmanaged memory), I think providing support for parsing from byte* not relevant anymore (and we should be discussing Span<> and Memory<> support instead).
@jtattermusch Awesome, thanks for having a look. I have now some spare time (I think a week or two) and I would love to spend it on contributing to protobuf/gRPC.
I have done a lot of experimentation with ReadOnlySequence<T>, Span<T> 10 weeks ago and even created some protoypes - please see: https://github.com/google/protobuf/issues/3166#issuecomment-386865048
@mkosieradzki very interesting prototypes - thanks for sharing!
I had a few questions:
@jtattermusch
The code I did most of the experimentation (in the other thread) uses PipeReader (actually very well described in the blog post you have mentioned) instead of Stream.
I think that idea of providing a CodedInputStream constructor with fixed ReadOnlySequence<T> is one of the solutions however it has some limitations.
ReadOnlySequence<T> then we do not use streaming at all - so it's a pretty neat idea and makes it easy to retain compatibility.But honestly from my perspective if we assume that for gRPC all messages can be prefetched into non-contiguous buffer it's a perfectly good solution (assuming 4MB limit). It also helps to avoid uncontrolled working set. This decision is up to you. I am OK with both solutions prefeteched and async streamed (without gRPC streaming)..
@jtattermusch
Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)
Conceptually you are right. But the problem is that Span<T> can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream unless you make CodedInputStream a stack-only type⦠as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence<T> -> ReadOnlyMemory<T>, but you cannot optimize ReadOnlyMemory<T> -> ReadOnlySpan<T> which must be done every call to CodedInputStream⦠and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan<T>) so the span was always on stack.
Unfortunately I don't have figures for this and I apologize for that⦠I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.
could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)
I see your point here, but I really can't promise that ReadOnlySequence<T> variant is feasible without adding the generated void MergeFrom(ReadOnlySpan<T>) and I honestly regret I have not saved the numbers to prove this.
@mkosieradzki
I see your point and the tradeoffs, but right now I'm driven mostly by what could be made useful in the near future (I believe incremental improvements are the only way forward for big problems like this).
If you want to take a look at my (very much in progress) code: https://github.com/grpc/grpc/compare/master...jtattermusch:performance_slicing (the point is basically to retrieve the buffer segments for a received message without any copying and make them available to the C# layer - right now the grpc_byte_buffer gets extracted to what's called ReadOnlySliceBuffer but it's basically the same idea as ReadOnlySequence
On Thu, Jul 12, 2018 at 2:34 PM mkosieradzki notifications@github.com
wrote:
@jtattermusch https://github.com/jtattermusch
Is it because of the async or because of the non-contiguous? It seems to
me that non-continguous buffers as input should have virtually no extra
overhead (assuming the buffer segments have reasonable size - e.g. sth in
the ~8KB ballbark)Conceptually you are right. But the problem is that Span
can live only
on stack and as a consequence you cannot keep it alive inside
CodedInputStream unless you make CodedInputStream a stack-only type⦠as
well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence
-> ReadOnlyMemory, but you cannot optimize ReadOnlyMemory ->
ReadOnlySpanwhich must be done every call to CodedInputStream⦠and
that was somehing I was trying to overcome with ultra-fast void
MergeFrom(ReadOnlySpan) so the span was always on stack. That's a very good point - thanks for bringing that up.
So is ReadOnlyMemory-> ReadOnlySpan slow? I assume it's pretty
fast, but I understand that with the way CodedInputStream is now written,
we would be performing the conversion practically for each byte of input.
Unfortunately I don't have figures for this and I apologize for that⦠I am
pretty sure I had a prototype but it wasn't very fast. But also not as slow
as the async version.could the CodedInputStream here take a ReadOnlySequence directly instead
of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader
is one level abstraction higher and requires System.IO.Pipelines (which
doesn't have such good platform support as System.Buffers)I see your point here, but I really can't promise that ReadOnlySequence
variant is feasible without adding the generated void
MergeFrom(ReadOnlySpan) and I honestly regret I have not saved the
numbers to prove this.Do you still have the MergeFrom(ReadOnlySpan
) prototype?
β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/3431#issuecomment-404496467,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJeq5D3daUceQn_om5vh_UB2FlEVc-Ivks5uF0JegaJpZM4OoqYG
.
--
Jan
The benchmarks are basically here:
https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Benchmarks/ParseAddressBook.cs
Here is the hand-crafted MergeFrom(ReadOnlySpan<T> - prototype
https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Addressbook.cs#L336
That gives me an idea that it might be a good idea to compare MergeFrom(ReadOnlySpan<T> with MergeFrom(ReadOnlyMemory<T> - I think I will give it a shot. That result should be somewhat conclusive :).
@jtattermusch I have created two new benchmarks to capture the overhead of ReadOnlyMemory<T> -> ReadOnlySpan<T> and as I expected the overhead is signifant (because it involves memory pinning etc)β¦
Method | Mean | Error | StdDev | Rank |
------------------------------------------------ |---------:|----------:|----------:|-----:|
ParseUsingCodedInputReader | 37.56 us | 0.1789 us | 0.1674 us | 5 |
ParseUsingClassic | 18.13 us | 0.0575 us | 0.0509 us | 3 |
ParseUsingCodedInputParser | 72.76 us | 0.4153 us | 0.3884 us | 6 |
ParseUsingCodedInputSpanParser | 15.40 us | 0.0761 us | 0.0675 us | 2 |
ParseUsingCodedInputSpanPosParser | 15.11 us | 0.0513 us | 0.0480 us | 1 |
ParseUsingCodedInputSpanPosParserReadOnlyMemory | 18.81 us | 0.0824 us | 0.0688 us | 4 |
ParseUsingCodedInputSpanPosParser is passing using pos instead of ref⦠and is the winner today ;) I will push my current benchmarks to the repo so you can reproduce it on your side.
The comparison you want to look at is ParseUsingCodedInputSpanPosParserReadOnlyMemory vs ParseUsingCodedInputSpanPosParser - the only difference is conversion between ReadOnlyMemory<T> and ReadOnlySpan<T> on each call - what simulates what you wanted to do initially.
@jtattermusch
I have pushed my latest benchmarks to the repo. Please give it a try. I hope I have proven my point that void MergeFrom(ref ReadOnlySpan<T>) or void MergeFrom(in ReadOnlySpan<T>, ref int pos) MUST be generated to ensure the optimal performance.
Next point to learn is whether we want to add async Pipes support OR stick to synchronous parsing.
I am perfectly OK with synchronous parsing.
If that's the case, we need to figure out what is the size threshold when copying non-contiguous memory to contiguous (and pinned memory) for parsing time (can be stack memory for small buffers or buffer pool for huge buffers) is efficient and depending on the result we need to decide whether it makes sense to create a void MergeFrom(in ReadOnlySequence<T>, ref int pos) or not i.e. whether threshold is above or significantly below 4MB.
Alternatively we can start with your idea of ReadOnlySequence
void MergeFrom(in ReadOnlySpan<T>, ref int pos).Does it make sense to you?
Honestly, I don't think it makes any sense to start parsing before the entire message has arrived. because it makes the RPC server more prone to a slow-sender class of DoS attacks while giving no real performance benefits except slightly decreased latency, but I really trust in your opinion here. What do you think?
@jtattermusch I have started work on branch https://github.com/mkosieradzki/protobuf/tree/read-only-sequence - this is basically implementation of CodedInputStream using ReadOnlySequence<byte> instead of byte[] as buffer. Let's see what comes out of this.
@jtattermusch I have restarted my work; on branch https://github.com/mkosieradzki/protobuf/tree/netcore2_1
Currently I have added the least intrusive version that adds support for .NET Core 2.1 (to make place for special optimizations), creates a place for passing ReadOnlySequence<byte> and ReadOnlyMemory<byte>, but does not break compatibility.
Now I am going to make sure I do not introduce any performance regressions by switching to the ReadOnlyMemory<T> abstraction instead of array of T.
If I won't be able to ensure this (especially on the pre-.NET Core 2.1 platforms) this I will need to add a fast path using compatBuffer which I am already using on pre-.NET Core 2.1 for Stream.Read.
I have run a bunch of further tests and I am... a bit disappointed, but have some important conclusions,
ValueTask-based variant of ParseUsingCodedInputSpanPosParser just by wrapping results into ValueTask<T> - it's slow beyond all recognition. So async as the only variant is NOT an option.ReadOnlyMemory<T> and T[] in context of CodedInputStream think it's unfeasible to implement a ReadOnlyMemory<T>-based variant without significant performance regressions. VarInt and Tag parsing is so slow⦠despite multiple optimization attempts including pinning the memory once and creating span from the already pinned pointer.We are talking about performance regressions orders of magnitude larger than just copying entire non-contiguous memory to a single array.
Maybe I am missing something.
Just to sum up. IMO the array-based buffer should stay in the CodedInputParser.
I think it would be worth productizing ParseUsingCodedInputSpanPosParser for its simplicity, efficiency and elasticity (you can use it with any contiguous piece of memory). I would also recommend adding codegen for void MergeFrom(in ReadOnlySpan<T>, ref int pos) - because without this codegen ParseUsingCodedInputSpanPosParser is useless.
Also this codegen should be conditionally used by the standard MergeFrom(CodedInputStream) if the entire message is already either in contiguous buffer or available from a ReadOnlySequence<byte> and is reasonably small.
@jtattermusch Please let me know whether I should create a PR (I will need to productize my PoC) as described, or if you need to do more investigation or you simply disagree.
I was playing a bit with aggressive inlining⦠and I went from:
Method | Mean | Error | StdDev | Rank |
---------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 18.39 us | 0.0996 us | 0.0932 us | 2 |
ParseUsingCodedInputSpanPosParser | 15.19 us | 0.0998 us | 0.0934 us | 1 |
down to:
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.75 us | 0.1096 us | 0.1025 us | 4 |
ParseUsingClassicInline | 15.97 us | 0.0528 us | 0.0494 us | 3 |
ParseUsingCodedInputSpanPosParser | 12.47 us | 0.0864 us | 0.0808 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.31 us | 0.0304 us | 0.0254 us | 1 |
only by adding [MethodImpl(MethodImplOptions.AggressiveInlining)] on some methods and improving constructors (to avoid using Func<T> factories)...
I think that this optimization alone deserves a PR. In general I see a huge potential in improved codegen. I mean: what's the point of generating (de-)serialization code if it still uses indirect calls (interfaces, callbacks, etc. on the hot path) and does not excercise inlining? IMO it's a good idea to generate a predictable code and help the compiler/jitter as much as you can...,
So this is my high level optimization plan proposal:
FieldCoded abstraction).ReadOnlySequence<byte> as an alternative to Stream in CodedInputStream without introducing any significant changes (just refill array-based buffer from the sequence instead of Stream if sequence is available) - but to be honest I don't see any huge advantages over implementing a ReadOnlySequence<T>-based Stream⦠I really tried.ReadOnlySpan<byte> and use this path conditionally if the entire message fits the buffer.ArrayPool/MemoryPool where applicable,Utf8String to avoid UTF8 decoding which is super slow and accounts for around 5us in my example.All of this together should result in a drop on my benchmark from 18us to around 7.5us in parsing time.
@mkosieradzki
Ad the other optimizations you are suggesting (AggresiveInlining, using ArrayPool/MemoryPool and Utf8String) - I think they are orthogonal changes and we should handle them separately (e.g. the agressive inlining can be done for both existing CodedInputStream as for any of its future variants regardless what exactly they are).
@mkosieradzki I spent some time looking at your code and also experimenting myself. Some results:
ref ReadOnlySpan<T> - you need to be able to update the position the the current buffer. That's definitely one of the possible approaches, but my impression was that the CodedInputStream actually maintains several extra fields of state that have important security consequences (e.g. the tracking max recursion depth etc.) and it might be difficult to pass all of these along as function parameters (and if we need to add a few more context variables in the future, we might be in trouble).someMessage.MergeFrom(ParsingContext context, ReadOnlySpan<byte> immediateBuffer), where ParsingContext is not necessarily a class called like that, but it is something that maintains the parsing state as immediateBuffer is processed (it tracks the position, recursion depth and other context variables). It also should have the ability to provide the next buffer segment in form of a ReadOnlySpanParsingContext concept from above basically corresponds to the existing class CodedInputStream (where the immediate buffer would be used instead of the existing private field byte[] buffer). So the new more general parsing API could even look like someMessage.MergeFrom(CodedInputStream cis, ReadOnlySpan<byte> immediateBuffer) and the older backward-compatible API would be pretty simple to implement in terms of the newer (hopefully faster) parsing API.ParsingContext (open to better naming) explicitly as an actual class (or a struct which would reference CodedInputStream internally)What do you think?
Btw, thanks again for spending time prototyping various approaches, it is useful to have an actual code to look at.
Because the changes with the approach I'm suggesting would be relatively significant I'd suggest creating a new experimental nuget package with this new implementation of parsing first (and see how that goes), rather than trying to merge these changes to upstream.
@jtattermusch Dropping the async requirement enables us to create a ref struct to be used as ParsingContext.
So, in that case we can declare ref struct CodedInputParsingContext which is a stack-only type and is allowed to contain ReadOnlySpan<T> etc.
This means we can probably create an implicit conversion between CodedInputStream and CodedInputParsingContext and use a single code path for CodedInputParsingContext.
Do you like the idea of stack-only parsing context? I think I can try to prototype it.
Regarding the ref ReadOnlySpan<T>: the version with in ReadOnlySpan<T> and ref int pos is slightly faster - it is called ParseUsingCodedInputSpanPosParser .
@mkosieradzki I've thought about the ref struct and I'm not sure if it's a good idea. The good thing is the ReadOnlySpan can be part of this struct, but you'll need a reference to some class with context anyway (as non-immutable struct seems like a bad idea in general). I'm also unsure about the backward compatibility of ref struct (which is a C# 7.2 concept) - how would the code look like to someone using an older runtime (e.g. net45)?
In a way, the difference between using ref struct CodedInputParsingContext { SomeContextClass context, ReadOnlySpan buffer} and using two separate parsing parameters SomeContextClass context, ReadOnlySpan buffer in the API seems only like syntactic sugar - both implementations would probably look almost exactly the same, so I don't quite see the added value of using ref struct and potentially risking backward compatibility issues with older runtimes.
@jtattermusch I agree.
@mkosieradzki to prevent doing unnecessary changes, I'd say a prototype that parses from 2 arguments:
CodedInputStream context, ReadOnlySpan<byte> immediateBuffer would be useful - that should be a sufficient proof of concept that minimizes the amount of changes needed. (Basically every ReadXXX method on CodedInputStream would get an overload that accepts those two args).
@jtattermusch I was rather thinking about ref ReadOnlySpan<T> immediateBuffer so while iterating over ReadOnlySequence we can move pointer to the next buffer.
When iterating over a Stream or an array based buffer it's enough to use immutable reference ReadOnlySpan<T>, but our goal is to handle ReadOnlySequence<T> efficiently.
BTW. My preliminary results
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.97 us | 0.1073 us | 0.1004 us | 5 |
ParseUsingClassicInline | 15.97 us | 0.1102 us | 0.0977 us | 4 |
ParseUsingNew | 14.97 us | 0.0939 us | 0.0878 us | 3 |
ParseUsingCodedInputSpanPosParser | 13.06 us | 0.0493 us | 0.0461 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.49 us | 0.0803 us | 0.0751 us | 1 |
ParseUsingNew is a heavy-inlined version of classic (classic already benefits from some inlinining so it is not a true baseline) version optimized with an additional parameter ref ReadOnlySpan<T> (as you suggested). I need to track down where the 2 us are lost.
The current experimental code is available: https://github.com/mkosieradzki/protobuf/tree/spans/csharp/src/TestProtoPiper/Benchmarks - it looks pretty bad, but it's only to prototype.
Found 0.7us ;) and accidentally boosted classic parser
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.19 us | 0.0571 us | 0.0534 us | 5 |
ParseUsingClassicInline | 14.86 us | 0.0341 us | 0.0319 us | 4 |
ParseUsingNew | 14.24 us | 0.0361 us | 0.0338 us | 3 |
ParseUsingCodedInputSpanPosParser | 12.84 us | 0.0471 us | 0.0441 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.23 us | 0.0505 us | 0.0473 us | 1 |
It was hidden in an ineffient IsAtEnd implementation which didn't take the fast track if Stream was null. Still ~1.4us to find.
Found another 0.2us in toxic inlining of slow-path, ~1.2 to go :)
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.31 us | 0.0957 us | 0.0896 us | 5 |
ParseUsingClassicInline | 14.73 us | 0.0480 us | 0.0449 us | 4 |
ParseUsingNew | 14.05 us | 0.0543 us | 0.0481 us | 3 |
ParseUsingCodedInputSpanPosParser | 12.78 us | 0.0624 us | 0.0583 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.24 us | 0.0527 us | 0.0493 us | 1 |
Picked-up the low-hanging fruit of indirect interface call. I think that the rest of the overhead resides inside the PushLimit / PopLimit / RecomputeBufferSizeAfterLimit - I think it can be optimized if the current message fits the current buffer - but let's leave this for later.
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.27 us | 0.1190 us | 0.1055 us | 5 |
ParseUsingClassicInline | 14.85 us | 0.0533 us | 0.0499 us | 4 |
ParseUsingNew | 13.79 us | 0.0602 us | 0.0563 us | 3 |
ParseUsingCodedInputSpanPosParser | 12.67 us | 0.0683 us | 0.0605 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.55 us | 0.0490 us | 0.0459 us | 1 |
@jtattermusch
Long story short - I think that the concept of adding ref ReadOnlySpan<T> immediateBuffer is feasible (without generating dedicated code just for Span<T>) and worth the effort. Let's create a separate branch and create an experimental implementation.
@jtattermusch I have created a branch https://github.com/mkosieradzki/protobuf/tree/spans-pr with added spans support. (also there is branch when I do most of active development https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace)
It does not have all of the aforementioned (but significant amount) optimizations, however it is feature complete, passes all tests, has Code Access Security attributes.
I have split my changes into 4 commits: SDK-change, protoc-change, regenerated code, updated tests
This branch should be a nice starting point to add ReadOnlySequence<byte> support as no code in CodedInputStream depends on byte[] buffer. Everything utilizes ReadOnlySpan<byte> immediateBuffer except of the stream reading. I am anticipating you can add ReadOnlySequence<byte> in a similar way to Stream. As a source for immediate buffers (I can do this if you like).
As a small bonus I have added ReadOnlySpan<byte> support to ByteString what gives a secure access to the ByteString. I would like to suggest to change string to ByteString (or at least add such an option) as this is the best UTF-8 representation available today. One can use the exposed span together with the System.Buffers.Text.Utf8Parser.
How should we distribute/name/etc this package before it lands in the upsteam repo?
Progress report for https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace
ReadOnlySpan<T> when deserializingSpan<T> when serializingMemory<T>-based buffersIBufferWriter<byte>)ReadOnlySequence<byte>)Future work:
Pipelines (should be extremely easy - PipeReader returns ReadOnlySequence<byte> and PipeWriter implements IBufferWriter<byte>) - so I am not even sure if we should implement this or it would be a trivial element of gRPC Server I am planning to implement basing on .NET Core 2.2 HTTP/2 support (it's based on Pipelines) when .NET Core 2.2 Preview 1 will be public.@jtattermusch
@jtattermusch code in branch https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace is now entirely Span-compatible, it is also hard-generated instead of the inefficient FieldCodec-abstraction. I expect a huge performance boost. (btw. the best effects will be observable when used from netcoreapp2.1).
@jtattermusch I'm assigning this issue to you since you have been following up with the proposal.
@jtattermusch It's ready. I don't plan any significant work on this repo now. I have also merged spans-pr-workspace to span-pr.
This was a great discussion to read and I'm happy to share that we went through a very similar exercise when we were writing the HTTP/1.1 parser for Kestrel. The conclusion was that for optimal performance when using ReadOnlySequence
Simple example here -https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L41-L62
More complex example here - https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L198-L209
We're looking at adding a ref struct BufferReader/BufferWriter for the next version to make things a a bit easier to use and more efficient in the common case. We ended up doing this in Kestrel and work is happening now to make it part of the BCL (https://github.com/dotnet/corefxlab/blob/08a1874a0ca4a1a889b2045801c309a1e8575458/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader.cs).
The other interesting thing in this discussion was about async parsing vs sync parsing. I believe that for optimal performance, the best thing to do is to write low level synchronous parsers and build asynchronous parsing on top of that. It looks like you guys have come to a similar conclusion here. In my mind there are 2 patters:
The first makes the parser very simple but consumers to worry about the max message size to avoid a DoS.
The second parser is much harder to write and is where async helps (it basically automatically stores the state on the heap and resumes on your behalf). This state now needs to be managed manually by the parser itself.
Using pipelines though the unconsumed buffer is managed managed on your behalf so the parser just needs to know where it left off and needs to be able to resume based on the parser context.
Here's how I think each of these could look:
```C#
async Task StatelessParsing()
{
var input = connection.Input;
while (true)
{
var result = await input.ReadAsync();
var buffer = result.Buffer;
try
{
if (result.IsCanceled)
{
break;
}
if (!buffer.IsEmpty)
{
while (Protocol.TryParseMessage(ref buffer, out var message))
{
await _dispatcher.DispatchMessageAsync(connection, message);
}
}
if (result.IsCompleted)
{
if (!buffer.IsEmpty)
{
throw new InvalidDataException("Connection terminated while reading a message.");
}
break;
}
}
finally
{
// The buffer was sliced up to where it was consumed, so we can just advance to the start.
// We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
// before yielding the read again.
input.AdvanceTo(buffer.Start, buffer.End);
}
}
}
### Stateful Parsing
```C#
async Task StatefulParsing()
{
var input = connection.Input;
ParserContext parserContext;
while (true)
{
var result = await input.ReadAsync();
var buffer = result.Buffer;
try
{
if (result.IsCanceled)
{
break;
}
if (!buffer.IsEmpty)
{
while (Protocol.TryParseMessage(ref buffer, ref parserContext, out var message))
{
await _dispatcher.DispatchMessageAsync(connection, message);
}
}
if (result.IsCompleted)
{
if (!buffer.IsEmpty)
{
throw new InvalidDataException("Connection terminated while reading a message.");
}
break;
}
}
finally
{
// The buffer was sliced up to where it was consumed, so we can just advance to the start.
// We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
// before yielding the read again.
input.AdvanceTo(buffer.Start, buffer.End);
}
}
}
Of course this is just the outer code, the more complicated code exists in the parser itself as each of the methods needs to flow the context in order to resume parsing.
Glad to see all of the energy and progress here!
@davidfowl Thank you for all your valuable points. I will take a closer look on the suggestions, but I really think we want to sacrifice async parsing for simplicity and performance.
I was considering stateful parsing as an option before, but I think it would make the codegen much more complicated - and especially after removing this FieldCodec abstraction responsible for virtcalls - current codegen is so incredibly fast. There also security considerations with gRPC - you don't want to start message parsing before it is already there.
I have created a new branch: https://github.com/mkosieradzki/protobuf/tree/spans-pr-rebased
It is a rebased core code to make code-review, and further development easier.
It does not contain generated code to make rebasing to the latest version much easier - learned the hard-way when creating async PR.
Also I code on Windows (VS2017) and previously had problems with line endings in the generated code - so the code should be generated on Linux (or Bash on Ubuntu on Windows :) ).
On Sat, Jul 21, 2018 at 9:37 PM David Fowler notifications@github.com
wrote:
This was a great discussion to read and I'm happy to share that we went
through a very similar exercise when we were writing the HTTP/1.1 parser
for Kestrel. The conclusion was that for optimal performance when using
ReadOnlySequence, you need to parse a Span at a time, and only use the
higher level multi buffer operations when crossing buffer boundaries.Simple example here -
https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L41-L62More complex example here -
https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L198-L209We're looking at adding a ref struct BufferReader/BufferWriter for the
next version to make things a a bit easier to use and more efficient in the
common case. We ended up doing this in Kestrel and work is happening now to
make it part of the BCL (
https://github.com/dotnet/corefxlab/blob/08a1874a0ca4a1a889b2045801c309a1e8575458/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader.cs
).The other interesting thing in this discussion was about async parsing vs
sync parsing. I believe that for optimal performance, the best thing to do
is to write low level synchronous parsers and build asynchronous parsing on
top of that. It looks like you guys have come to a similar conclusion here.
In my mind there are 2 patters:
- The parser is stateless and works on complete messages
- The parsers is stateful and resumable
3.
- The state can be passed into the parser
4.
- The state can be kept by the parser
Thanks for the good points, to add my 50cents about the stateless/stateful
parsing:
Currently for Google.Protobuf the CodedInputStream is the storage for
parsing state - it keeps track of where we are in the buffer, tracks
recursion depth, max message size and other things important for security.
So in a way, we are in the situation, where our current parser is "stateful
and resumable"+"state is kept by the parser". Changing that would require a
significant revamp, so unless that's turns out to be a blocker, we should
go this route.
The first makes the parser very simple but consumers to worry about the
max message size to avoid a DoS.The second parser is much harder to write and is where async helps (it
basically automatically stores the state on the heap and resumes on your
behalf). This state now needs to be managed manually by the parser itself.Using pipelines though the unconsumed buffer is managed managed on your
behalf so the parser just needs to know where it left off and needs to be
able to resume based on the parser context.Here's how I think each of these could look:
Stateless Parsingasync Task StatelessParsing()
{
var input = connection.Input;while (true) { var result = await input.ReadAsync(); var buffer = result.Buffer; try { if (result.IsCanceled) { break; } if (!buffer.IsEmpty) { while (Protocol.TryParseMessage(ref buffer, out var message)) { await _dispatcher.DispatchMessageAsync(connection, message); } } if (result.IsCompleted) { if (!buffer.IsEmpty) { throw new InvalidDataException("Connection terminated while reading a message."); } break; } } finally { // The buffer was sliced up to where it was consumed, so we can just advance to the start. // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data // before yielding the read again. input.AdvanceTo(buffer.Start, buffer.End); } }}
Stateful Parsing
async Task StatefulParsing()
{
var input = connection.Input;
ParserContext parserContext;while (true) { var result = await input.ReadAsync(); var buffer = result.Buffer; try { if (result.IsCanceled) { break; } if (!buffer.IsEmpty) { while (Protocol.TryParseMessage(ref buffer, ref parserContext, out var message)) { await _dispatcher.DispatchMessageAsync(connection, message); } } if (result.IsCompleted) { if (!buffer.IsEmpty) { throw new InvalidDataException("Connection terminated while reading a message."); } break; } } finally { // The buffer was sliced up to where it was consumed, so we can just advance to the start. // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data // before yielding the read again. input.AdvanceTo(buffer.Start, buffer.End); } }}
Of course this is just the outer code, the more complicated code exists in
the parser itself as each of the methods needs to flow the context in order
to resume parsing.Glad to see all of the energy and progress here!
β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/3431#issuecomment-406818931,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJeq5HIiXbimws8G3zCTTee0-dTC5A3Rks5uI4LxgaJpZM4OoqYG
.
@mkosieradzki I started reviewing your code, but haven't noticed the spans-pr-rebased branch initially, which made the review more complicated. Here are some initial comments:
β I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.
β (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (letβs create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.
β Providing the public ReadOnlySpan<byte> Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan<byte> bytes).
β aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).
β what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.
β I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.
β I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.
I'm currently trying to review spans-pr-rebased
@mkosieradzki you've come up with a great amount of good changes, but I'm now thinking of how to make the review process efficient and how we can integrate this. Here's what I suggest:
I created a branch in my fork: https://github.com/jtattermusch/protobuf/tree/span_support_experimental (currently in sync with protobuf master). I think we could review the code piece-by-piece by creating smaller PRs against that branch. That way we have control of what's being accepted and a good way to handle review comments before things get merged. The very first PR can add a dependency on System.Memory and then we can take it from there (lot of the optimizations you're suggesting are logically independent). I think the most important item is to support the non-contiguous parsing in it's simplest possible form.
I also created a myget.org feed where I can upload the experimental version of Google.Protobuf nuget at any time without disturbing the official version - anyone who want's to experiment with that package can just add the feed to their dev environment and use the experimental packages instead of the official vanilla one. (I think this is better than creating a nuget.org package under fake name).
What do you think?
β I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.
System.Memory requires 1.1 - so it simplified the PR a lot. If you take a look on the compatibility .NET Standard compatbility table the only different is Windows Phone Silverlight. So I think this sounds like a reasonable compromise between amount of #if and compatibility.
β (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (letβs create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.
Yup this is exaclty why I have created spans-pr-rebased.
β Providing the public ReadOnlySpan
Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan bytes).
Fair enough. Not sure if the second is not used somewhere in the process...
β aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).
Makes sense.
β what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.
Wrapped is dedicated to handle the wrapped messages like ValueInt32. Previously Wrapped messages were being handled by codecs with a lot of abstraction. The wrapped message structure is:
β I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.
I hope I didn't introduce any logic changes there! Please let me know where. I have only added support for wrapped messages.
β I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.
I'm ok with both approaches.
@jtattermusch
Please give the spans-rebased-pr branch a try. I really tried to make this specific branch very clean and split into separate meaningful commits. I think that because we are dropping compatbility with 3.x branch by deleting FieldCodec abstraction the entire PR should be treated as a major change.
If you want to play with what I have built, I have created yet another branch spans-pr-rebased-build (with custom build process) and connected it to the AppVeyor. You can access the package from the AppVeyor nuget feed.
My apologies, but creating this PR as is has consumed most of my free time (it was a considerable amount of work, most in experimentation), I can't promise I will be able to split the PR into smaller pieces very soon
https://ci.appveyor.com/project/mkosieradzki/protobuf/build/1.0.30 - this is the best build I have currently.. There is one artifact with protoc for windows and one artifact with SDK nuget package. It is also available through this feed: as 4.0.0-pre-30 https://ci.appveyor.com/nuget/protobuf-spans
@mkosieradzki I plan to take a closer look at spans-rebased-pr tomorrow.
I followed the trail of related issues all the way to this PR.
I'm running a locally-built version of Protobuf where I have implemented some of these same optimizations. This issue and related PRs look dead. @mkosieradzki has done so much work on this I hate to put in effort and submit PRs of my own, but I don't see this issue moving otherwise.
What's the best path forward to get these optimizations in Protobuf?
@prat0088 we still plan to move ahead with the Span-based parsing but it's unclear which exact approach we're going to take (design discussions are ongoing). IMHO the PR that's the closest to the ideal design is https://github.com/protocolbuffers/protobuf/pull/5888.
@mkosieradzki definitely deserves recognition for prototyping the approach, his code is really useful as a reference (regardless of which exact approach eventually end up landing).
This has been added in https://github.com/protocolbuffers/protobuf/pull/7351
Most helpful comment
@prat0088 we still plan to move ahead with the Span-based parsing but it's unclear which exact approach we're going to take (design discussions are ongoing). IMHO the PR that's the closest to the ideal design is https://github.com/protocolbuffers/protobuf/pull/5888.
@mkosieradzki definitely deserves recognition for prototyping the approach, his code is really useful as a reference (regardless of which exact approach eventually end up landing).