Protobuf: [CSharp] Allow Span<byte>-based parsing in CodedInputStream

Created on 31 Jul 2017  Β·  70Comments  Β·  Source: protocolbuffers/protobuf

IMO C# protobuf implementation could strongly benefit performance-wise from allowing some unsafe code.

Background

First of all I am not saying that unsafe should be enabled in every build/every platform, having 100% managed library is always a nice feature. All I am suggesting is again some conditional feature on some platform.

However there is a trend to move towards unsafe mixing we can observe in .NET Core.
For example a lot of .NET Core libraries use unsafe code due to need to interop with different kind of unmanaged libraries or just to handle AOT compiled code.

API change

Today CodedInputStream works in two modes: streaming with byte[] buffer and with a fixed buffer.

It would be really nice to be able to provide byte * and length to CodedInputStream constructor.

Alternatively we could create an UnmanagedCodedInputStream which would work on unmanaged memory.

This would require to abstracting CodedInputStream as an interface

Affected APIs

Async API #3166 makes absolutely no sense for UnmanagedCodedInputStream because it assumes everything have been already read.

Benefits

  1. It's extremely useful for interop (See scenario 1)
  2. It can work with stackalloc (See scenario 2)
  3. It allows even quicker deserialization we can quicker deserialize byte * to for example fixed-size primitive types by pointer dereference. (Endiannes might be a problem here)

Scenarios

Scenario 1. NoSQL Database

User is using native NoSQL database (for example RocksDB/LevelDB) and is using protobuf for data persistence. Database returns native allocated pointer with buffer. Instead of copying entire buffer to the memory user can deserialize record straight from the returned memory native memory and then free the pointer. Very little GC is involved and there is practically no overhead.

Scenario 2. Stackalloc-ated buffers

For performance reasons to decrease heap allocations and allocate buffer on stack. This scenario can be alternatively handled by Buffer Pooling, however stackalloc seems to be more efficient.

Affected by

https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md - this work in progress improvement can allow to achieve the same goals in a managed way by providing some platform improvements.

P3 c# enhancement

Most helpful comment

@prat0088 we still plan to move ahead with the Span-based parsing but it's unclear which exact approach we're going to take (design discussions are ongoing). IMHO the PR that's the closest to the ideal design is https://github.com/protocolbuffers/protobuf/pull/5888.
@mkosieradzki definitely deserves recognition for prototyping the approach, his code is really useful as a reference (regardless of which exact approach eventually end up landing).

All 70 comments

Came here to say the same thing. Byte array copying inside Protobuf3 is one of the biggest sources of GC pressure in Akka.NET; we would like to be able to use pooled byte arrays (or Spans when they become part of the framework) and not have ByteString do a deep clone of them each time we use it.

Yes, we get it - it's dangerous. Understood, but let the end-users take the risk in order to put Protobuf to work in high performance contexts.

Protobuf Span API could look like this
```C#
namespace Google.Protobuf
{
public class CodedInputStream
{
public CodedInputStream(ReadOnlySpan buffer);
…
}

public class CodedOutputStream
{
    public CodedOutputStream(Span<byte> buffer);
    …
}

public class MessageParser
{
    public IMessage ParseFrom(ReadOnlySpan<byte> input);
    …
}

public static class MessageExtensions
{
    public static void MergeFrom(this IMessage message, ReadOnlySpan<byte> data);
    public static void WriteTo(this IMessage message, Span<byte> output);
    …
}

}
```

Thanks for backing me up :). If this proposal is accepted I am happy to create a proper PR. Problem with Span/ReadOnlySpan is that AFAIK it is not available yet and also that @jskeet will not be very keen on accepting PR requiring very-new compiler support and runtime support.

So I would be probably rather for the unsafe approach which can be widely used today.

@mkosieradzki if they don't want to go forward with just exposing the Unsafe methods they have today, adding some overloads which use ArraySegment<byte> would also be a workable option in lieu of Span

@mkosieradzki System.Memory package supports netstandard1.0

@alexvaluyskiy Thanks. I have just checked that this package has been released in preview 3 months ago.

Are you aware whether it requires some specific compiler version? Or are you aware of any roadmap/official documentation?

@Aaronontheweb ArraySegment is not comparable to Span, because ArraySegment requires byte[] (does not support byte *) and Span can work on everything including byte[] and byte* or any other generic kind of memory.

I'm in two minds about this. I do see the point - but I'm definitely concerned about the level of complexity we end up with, in terms of lots of different builds for different scenarios. (Adding a dependency for Span also makes me nervous, but I'm gradually warming to the idea of ValueTask for the async part...) The async aspect adds another layer of complexity here - by the time we've got a matrix of managed/unmanaged, async/non-async, multiple versions of .NET supported, I think it's going to get tricky. It makes it really easy to break customers with what may seem like a trivial change.

There's also the aspect of time commitment from Google to maintain all of this. I'm not on the protobuf team, and have a full plate already - so @anandolee would need to be very comfortable with it all.

In terms of ByteString - no, the Unsafe class should not be made public, in the same way that I would never expect Microsoft to release a public String.Unsafe class that made it trivial to mutate strings. ByteStrings are intended to be immutable; ByteString.Unsafe exists to allow us to be more efficient where the library can guarantee that sharing a byte array will not exhibit mutability.

One option you may want to consider is forking the library entirely, creating an OptimizedAndVeryUnsafe.Protobuf library which is wire compatible but exposes everything you've ever wanted to, pays less attention to being massively backward compatible all the time, takes whatever dependencies it wants and generally gives the user more of a loaded shotgun. I've always tried to optimize within the bounds of making the code easy to use safely and hard to shoot yourself in the foot. If you want to remove those restrictions, that's fine - but it shouldn't be in the same library.

@jskeet Thanks a lot for your response. My opinion would be to take the System.Memory path (without explicit unsafe). Span<byte> and ReadOnlySpan<byte> are incredibly great and safe wrappers for both byte[] and byte *.

If we could use System.Memory it would do all the heavy lifting (without requiring any UnmanagedCodedInputStream). We could just replace internal buffer with Span probably without changing a lot of code...

So I would suggest waiting for System.Memory release or I can try to create a fork using the preview version and we can merge it into the main library when ready.

Knowing that System.Memory is going to be released in a predictable future I think that special handling for byte * makes no sense.

I also think it's very similiar case as with ValueTask... (and System.Buffers for a shared buffer pool) - it's yet another dependency required to achieve optimal performance.

BTW. For more distant future I am also researching different ways to achieve Arena allocations for C#. There is a promising project called Snowflake https://www.microsoft.com/en-us/research/publication/project-snowflake-non-blocking-safe-manual-memory-management-net/# .

Yes, I'm definitely happier with Span than anything to do with unsafe code. However, I wouldn't want to do anything with it until it's had a full (non-beta) release. We could create PRs before then of course, but we shouldn't merge them.

@jskeet Thanks! I will try to create a PR in August so we can preview this.

@mkosieradzki @jskeet I think the new Span API should be similar to java's protobuf ByteBuffer Api

Exactly as I was afraid - Span requires new compiler version: https://github.com/aspnet/Announcements/issues/268 ...

Let's see where it lands, in terms of requirements. It may be that it'll be harmless to expose it so long as Google.Protobuf is compiled with C# 7.2, which I'd be comfortable with (after that's released).

Tooling support for the C# 7.2 Span framework types was added with VS2017 15.5, which was RTMed on Dec 4. Span is now well documented and there are a couple of good collateral pieces, including one by Stephen Toub, linked here.

https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/

A couple of issues for discussion:

  1. At what point would it make sense to introduce the requirement for C# Google.Protobuf developers to be on VS2017 15.5+?
  2. For the project owner: would adding this to the mix be seen as an enhancement or rather as an "optimization" seeking a problem?

I'd be willing to work on this if there were to be a clear interest (or acceptance criteria) with respect to number two. and we're within about six weeks on number one (@anandolee & @jskeet).

I think it's reasonable to require that anyone developing the library uses VS2017 15.5 or the equivalent .NET Core SDK. There's a chunk of work required in order to update the SDK for continuous integration though.

Whether the Span methods are included in all output libraries is a different matter though. We'll need to look at the dependencies and whether there are issues consuming Span from older versions of VS/C#. (For example, if VS2015 users could easily end up using Span in a dangerous way, we may want to add a netstandard2.0 target and only expose Span there. We'll have to see.)

I think the first port of call should be some prototyping and benchmarking.

I think there's a good potential for using Span optimizations in gRPC C# (general idea: there's the grpc_csharp_ext native library and the messages need to be moved between managed and native layer and copying can be expensive. Being able to transparently address both managed/unmanaged memory with the same code seems useful.) - but we haven't really looked into these optimizations in detail (and there's complications so it's not possible to just say if it is going to be worth it without proper analysis and some experimenting).

I have started some experimentation in #3530 . I need to revisit my experiments since 15.5 is out. Before proper tooling support Span was super-slow (as expected).

Also it's important to remember that Span is a stack-only type - what has a lot of serious implications:
For example we might need to:

  1. make CodedInputStream a ref struct (doesn't seem right)
  2. or use heap-friendly Memory instead of Span
  3. or something else

This is especially difficult in context of #3166 - I have a lot of doubts related to the proper async-support.

I would really love to start a high level design discussion.

Perhaps we should first discuss where we believe the value for C# protobuf could originate with the new framework types and then try to align on a preliminary approach to assess whether that value actually exists. I'm mindful that this is already a well-crafted code-base and we also face the barrier/cost @jskeet noted of changing the SDK. No one wishes to waste their time developing PRs that never see the light of day, so we should try to fail sooner rather than later in the case that there isn't much value.

The rationale for the first wave of these framework types can be boiled down to a few major categories, based on what I'm seeing. Feel free to add or subtract from this picture:

  • Reducing cost of allocations

    • Allocations resulting from slicing/substring operations

    • Allocations that could now conditionally be put on the stack and in a safe/managed manner (when they are local arrays of primitives that turn out to be small)

  • Increasing managed code goodness

    • Reducing the unsafe footprint in the use and passing around of unsafe pointers in our library code

I did a little preliminary mechanical code analysis/triage and here is what I found. Outside of the context of unit tests, there are about 150 instances in the code of obvious allocations as indicated from the following starter list of search terms: "new byte["; ".Copy"; ".BlockCopy"; ".ToArray("; ".Substring("; "new MemoryStream"; ".Clone()." If you are interested, you can see the list in the attached spreadsheet.

Of these, probably only a small subset would be worth attacking, at least initially, and specifically those that:

  • can be avoided with the new types/overloads;
  • are in hot code regions (i.e. can deliver significant improvement in the benchmarks)

(As we know, there is no unsafe code presently in the code base.)

Purely for the purposes of initial discussion, a generic plan for the first couple of iterations in this type of situation, where we're skeptical of the benefits and want to proceed cautiously to avoid time wastage, might be as follows:

  1. Establish benchmark details (procedures, platform(s) and artifacts)
  2. Generate benchmarks for the current C# release (tagged as 3.5.1).
  3. Seek code sites for proof of concept ("POC") interventions by looking to the intersection of: (a) hot code regions (as identified by a tool such as PerfView while running the benchmark cases); and, (b) code regions that have buffer-oriented allocations that might benefit from the new framework types
  4. Select the most promising 2 - 8 code sites for preliminary interventions with span etc.
  5. Within a fork of the same release used above, complete a quick-and-dirty POC iteration.
  6. Obtain preliminary comparative benchmarks for the POC and come back up for air to the discussion in order to evaluate the value of proceeding with a more systematic/disciplined effort.
  7. Decide whether to proceed, on what timing, and who to assign as reviewer(s) etc.
  8. If it's a "Go", begin working on the actual PR(s)...

Let's have a good round of discussion on these details, alternative approaches, and any other thoughts. If something like the above turns out to make sense, I've also included a list of logistical questions in the attached file, which someone from Google could address. There are also some resources on this type of code-optimization effort.

Allocations Inventory, Questions and Resources.xlsx

@mkosieradzki can you clarify the backwards compatibility of Span:

  • What is the lowest version of .NET where Span can be used in the code? (not necessarily with seeing performance gains, not making things slower is enough - my understanding is that Span has a fallback implementation in case it's not supported by the runtime). Google.Protobuf needs to support at least net45+
  • What is the oldest runtime where one can see performance gains from Span (= where Span is properly supported)?

@jtattermusch
That's topic worth researching - I don't know exactly, and the answer might be very difficult, because it consist of 3 elements:

  • library (compatibility) - I believe System.Memory requires .NET Standard 1.1
  • compiler (ref struct concept and safety checks)
  • runtime optimizations (which are distributed among different versions of runtime starting with .NET Core 2.0 (http://adamsitnik.com/Span/) but I would not be surprised if there would be another bunch of significant improvements in .NET Core 2.1 and later.

According to: https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md

Runtimes that support by-ref fields and returns will get the fast ref-field Span<T>. Other runtimes will get the slower three-field Span<T>.

From benchmarks at http://adamsitnik.com/Span/ we can see 5-10% performance degradation of Spans vs Arrays on reads (on pre .NET Core 2.0 runtimes), however I believe that introducing safe stackallocs might help regain the performance even on older runtimes.

The biggest benefit from Span<T> is ability to have a single API for unmanaged and managed. However it will definitely cost us a huge API redesign as Span<T> cannot be member of non stack-only type. Let me emphasize this: You cannot make Span member of CodedInputStream and this defies all simple refactorings.

To be able to benefit from using Span<T> we need to talk about designing a completely different API.

And having this in mind. We have 2 possible approaches:
1) Let's go fully unmanaged - Use pointers
pros: fastest, reverse compatible, we can consider keeping buffer as a CodedInputStream class member, compatible with async approach
cons: will not work on partial trust environments (this is a legacy concept), require more serious code reviews, when using unmanaged memory we need to do managed pointer pinning.

  1. Let's go Span<T>
    pros: safe, modern, requires pointer pinning, might work on some fully managed enviroments (I am not sure)
    cons: slower on legacy runtimes, need really serious redesign, will not work with async

I believe that Span<T> is a lot about using call stack to guarantee lifetime of the pinned memory pointer.

Also to address problems about async support:
- Despite I am the author of async support (see #3173 ), I am not 100% convinced this is the way to go. I believe it can be a reasonable approach to use protocol buffers in a following manner:
1. Pre-fetch entire buffer (handling delimiters)
2. Schedule parsing using fast path only on a pre-fetched buffer.

This manner should be also compatible with gRPC streaming approach (and this is a level where I believe asynchrony fits the best).

This minimizes the time buffer needs to be pinned and strongly simplifies the code.

As I have mentioned before I have created a prototype for unsafe version of protobuf introducing even arena allocator approach to protobuf (see #3530 ).

I believe that introducing arena allocation together with unsafe approach will bring more and more benefits.

The last but not least: using unmanaged memory can allow implementing faster parsing by unsafe casting buggers into primitive types like int or long. Both protobuf and x86 are little-endian what should significantly improve parsing speed.

On more point about the arena allocations support:
It could be reasonable to use the memory from the original buffer in that case it makes parsing even faster. The most important thing to remember about this parsing approach is that arena-allocated + buffer-sourced data should be short-lived and the API should provide a way to transform it into a long-living heap-allocated C# objects if required by the developer.

In many scenarios long-living heap-allocated objects are not required or helpful at all.

The fact that we are using C# should not force us to downgrade to a Java-like approach with zero control over the allocations.

I'm thinking of Span as useful primarily in the narrow case of discrete, single threaded, synchronous functions and their helpers. In that case, the ability to potentially put a short array on the stack or to encapsulate neatly a reference to a slice or substring that can be passed to a helper without incurring an allocation or muddying the water with offsets is a nice (but limited) gain.

As far as the (different) topic of a public interface for core memory-encapsulating buffer objects, I'll defer to the prior participants in that conversation except to note that in my own work, which heavily depends on these, I almost always use such objects now in the context of various (multi-threaded and often parallelized) pipeline patterns in which exposing (heap-based) arrays as public properties is indeed indispensable. I try to avoid GC-pressure from a ton of small buffer or string allocations in those patterns; size buffers based on what's found to be optimal for a machine architecture (not individual document/message/cell size); and I reuse managed arrays in rotating, multi-segment buffers (with coarse-grained write vs. read gating across stages). For what it's worth, my own benchmarking has not demonstrated performance benefits for sequential access scenarios from using pointers with or without native memory regions in that particular context (though I do use pointers, unsafe code, and native memory/file handles for other use cases). All the same, a reusable core buffer strategy requires considerable bookkeeping. A preference for managed core buffer objects with ordinary indexed access may come down simply to managed lifetime as has been noted and cross-platform portability.

If folks ultimately conclude that Spans vs. core buffer refactoring are different topics, perhaps we should spin off the Spans to a separate issue?

@mkosieradzki I looked a bit more into what would the dependency on Span<> mean for Protobuf and gRPC:

System.Memory depends on netstandard1.0 (I believe it will use the unoptimized "fallback" version unless you target netstandard2.0), which means you can use it in net45 projects, but you will need a newer version of nuget and a newer IDE to be able to build those projects (they need to have knowledge of what "netstandard1.0" is)

gRPC currently targets net45 and netstandard1.5, Protobuf targets net45 and netstandard1.0.

As both gRPC and Protobuf currently target net45 explicitly, adding a dependency on System.Memory would have this effect:

  • users using older versions of Visual Studio (e.g. 2013 and I think some older versions of 2015) and mono unable to build their projects (as they wouldn't be able to resolve the System.Memory dependency). Users would be able to build their legacy projects if they upgraded their IDE / toolchain.

Also see: http://adamsitnik.com/Span/#how-to-use-it-today

@jtattermusch I might be wrong, but my understanding is:

  1. Framework target has no impact on performance at compile time. Neither package target framework version nor current compilation target version. Specifically if you target for example .NET Standard 1.1 you have a single assembly that does not need any recompilation - no matter what the compile target of the entire application is, and moreover no matter whether target runtime will support in-boxed Span or not.
  2. If I understand correctly fast version of Span is provided by the runtime (as an in-boxed type) (newer .NET Framework version installed on the machine or newer .NET Core runtime). My understanding is that if a compatible runtime is used, then bindings are redirected to the in-box version of Span.
  3. Newer version of compiler is required to enforce safety checks for Span as a stack only type and to be able to use ref struct concept. If you are using older compiler you can create incorrect library: I mean something that will throw an exception when an optimized version of Span is used. It also allows you to use type-safe stackallocs etc without using /unsafe. It might also introduce some compile-time optimizations, but I am not sure about this one.
  4. https://www.nuget.org/packages/System.Memory/4.4.0-preview1-25305-02 - if you look at the dependencies for this package - it has a reverse compatible target for net45, so you should still be able
    to compile with double targets net45;netstandard15 and net45;netstandard10 - and use System.Memory package.

Coming to this discussion late.

Having gathered a little bit experience working on a Span based API for LMDB (https://github.com/LMDB/lmdb) I would agree that an entirely new API is necessary.

From a user persective the simplest things would be to add an overload of WriteTo like this:
public void WriteTo(Span<byte> output)

and one for parsing, like this:
public T ParseFrom(ReadOnlySpan<byte> data)

Benefits are two-fold:

  1. Span can wrap native pointers, it really becomes a universal buffer API.
  2. This makes it possible to have zero-copy interop with native libraries.

@kwaclaw the API you are suggesting sounds reasonable (but would also mean that Google.Protobuf would need to depend on System.Memory nuget and we should think twice before adding any new dependencies)

There's one more thing to consider for interop with native libraries: for RPC libraries that are using a native layer (such as gRPC), it won't be uncommon that the native buffers delivered on the wire will be fragmented - there's no guarantee that a single protobuf message delivered by gRPC will be present in memory in a single continguous area (gRPC has a concept of byte buffer which consists of N "slices" that represent continguous spans in memory) - so being able to parse a single message from a collection of Span objects might be useful (which would probably result in some new APIs added to the CodedInputStream object).

@jtattermusch Good point! (on non-contiguous memory slices)

Regarding CodedInputStream we need to wait for @jskeet to validate my points about it.
IMO CodedInputStream cannot be used together with ReadOnlySpan<byte>.

Regarding non-contiguous memory slices from native library maybe it's a good idea to transition from pull approach and CodedInputStream and switch to the push approach and use Pipelines instead?

See: https://github.com/dotnet/corefxlab/blob/master/docs/roadmap.md

and:

https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines.md

and:

https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines-io.md

@jtattermusch regarding non-contiguous memory slices: .NET Core 2.1 preview 1 contains new class ReadOnlySequence<T> (https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/ReadOnlySequence.cs) which should be helpful in this scenario.

@mkosieradzki I've been recently looking into some gRPC/Protobuf optimizations and the ReadOnlySequence<T> type is looking pretty good. here are some advantages that caught my eye:

  1. It allows parsing a non-continguous buffers which is very useful, because most IO-related operations that don't do copying will probably end up with buffers that are not continguous (file and network access tends to read data in chunks).
  2. ReadOnlySequence is a standard API used by System.IO.Pipelines (and they seem to be the way to go for high performance IO in .NET Core - see a very well written blog here: https://blogs.msdn.microsoft.com/dotnet/2018/07/09/system-io-pipelines-high-performance-io-in-net/)
  3. gRPC internally uses non-continguous buffers as well
  4. ReadOnlySequence itself is a struct (=no allocations) and it has allocation-free constructors from more basic types like Memory or array[] - so by supporting parsing from ReadOnlySequence, parsing from array[] or continguous memory buffers becomes super trivial. For multi-segment buffers, the ReadOnlySequenceSegment instances classes, but they are abstract, so custom implementations that use pooling are possible IMHO.
  5. Supporting parsing from ReadOnlySequence gives us support for parsing from Memory for free (see 4).
  6. ReadOnlySequence if in the nuget package System.Buffers which supports a large range of target frameworks.

Overall, I think experimenting with adding a CodedInputStream constructor that consumes ReadOnlySequence would definitely be worth it.

Btw, since Span<> and Memory<> now exist (which allows accessing both managed and unmanaged memory), I think providing support for parsing from byte* not relevant anymore (and we should be discussing Span<> and Memory<> support instead).

@jtattermusch Awesome, thanks for having a look. I have now some spare time (I think a week or two) and I would love to spend it on contributing to protobuf/gRPC.

I have done a lot of experimentation with ReadOnlySequence<T>, Span<T> 10 weeks ago and even created some protoypes - please see: https://github.com/google/protobuf/issues/3166#issuecomment-386865048

@mkosieradzki very interesting prototypes - thanks for sharing!

I had a few questions:

  • You're mentioning that "Asynchronous parser using non-contiguous memory from Pipe is roughly 3x slower than the original parser. It is also zero alloc." 3x slower should prohibitively high. Do we understand why that is - Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)
  • could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)

@jtattermusch
The code I did most of the experimentation (in the other thread) uses PipeReader (actually very well described in the blog post you have mentioned) instead of Stream.

I think that idea of providing a CodedInputStream constructor with fixed ReadOnlySequence<T> is one of the solutions however it has some limitations.

  1. We need not to use Stream. However my understanding is that if we use ReadOnlySequence<T> then we do not use streaming at all - so it's a pretty neat idea and makes it easy to retain compatibility.
  2. The only limitation is that it cannot be used efficiently with streams what might be a case with large message parsing. I was thinking about 1-4MB range of messages which might be considered a bit too large to buffer or some customers using 4MB-100MB size of gRPC messages because they didn't move to streaming yet.

But honestly from my perspective if we assume that for gRPC all messages can be prefetched into non-contiguous buffer it's a perfectly good solution (assuming 4MB limit). It also helps to avoid uncontrolled working set. This decision is up to you. I am OK with both solutions prefeteched and async streamed (without gRPC streaming)..

@jtattermusch

Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)

Conceptually you are right. But the problem is that Span<T> can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream unless you make CodedInputStream a stack-only type… as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence<T> -> ReadOnlyMemory<T>, but you cannot optimize ReadOnlyMemory<T> -> ReadOnlySpan<T> which must be done every call to CodedInputStream… and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan<T>) so the span was always on stack.

Unfortunately I don't have figures for this and I apologize for that… I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.

could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)

I see your point here, but I really can't promise that ReadOnlySequence<T> variant is feasible without adding the generated void MergeFrom(ReadOnlySpan<T>) and I honestly regret I have not saved the numbers to prove this.

@mkosieradzki
I see your point and the tradeoffs, but right now I'm driven mostly by what could be made useful in the near future (I believe incremental improvements are the only way forward for big problems like this).

  • for gRPC, when you receive a message you'll get basically a segmented buffer (in the native code, it's grpc_byte_buffer, that can be read segment by segment by grpc_byte_buffer_reader) that maps relatively well to the concept of C# ReadOnlySequence. By the time the message is received (when the C# layer gets notified about receiving a messge), the number of segments is already known.
  • Looks like taking a dependency on System.IO.Pipeline is more problematic than System.Buffers

If you want to take a look at my (very much in progress) code: https://github.com/grpc/grpc/compare/master...jtattermusch:performance_slicing (the point is basically to retrieve the buffer segments for a received message without any copying and make them available to the C# layer - right now the grpc_byte_buffer gets extracted to what's called ReadOnlySliceBuffer but it's basically the same idea as ReadOnlySequence). Another thing I'm trying to accomplish is to avoid allocating so the ReadOnlySliceBuffer is getting reused (I can replace my custom ReadOnlySliceBuffer by ReadOnlySequence if I'll be able to avoid allocations).

On Thu, Jul 12, 2018 at 2:34 PM mkosieradzki notifications@github.com
wrote:

@jtattermusch https://github.com/jtattermusch

Is it because of the async or because of the non-contiguous? It seems to
me that non-continguous buffers as input should have virtually no extra
overhead (assuming the buffer segments have reasonable size - e.g. sth in
the ~8KB ballbark)

Conceptually you are right. But the problem is that Span can live only
on stack and as a consequence you cannot keep it alive inside
CodedInputStream unless you make CodedInputStream a stack-only type… as
well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence
-> ReadOnlyMemory, but you cannot optimize ReadOnlyMemory ->
ReadOnlySpan which must be done every call to CodedInputStream… and
that was somehing I was trying to overcome with ultra-fast void
MergeFrom(ReadOnlySpan) so the span was always on stack.

That's a very good point - thanks for bringing that up.
So is ReadOnlyMemory -> ReadOnlySpan slow? I assume it's pretty
fast, but I understand that with the way CodedInputStream is now written,
we would be performing the conversion practically for each byte of input.

Unfortunately I don't have figures for this and I apologize for that… I am
pretty sure I had a prototype but it wasn't very fast. But also not as slow
as the async version.

could the CodedInputStream here take a ReadOnlySequence directly instead
of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader
is one level abstraction higher and requires System.IO.Pipelines (which
doesn't have such good platform support as System.Buffers)

I see your point here, but I really can't promise that ReadOnlySequence
variant is feasible without adding the generated void
MergeFrom(ReadOnlySpan) and I honestly regret I have not saved the
numbers to prove this.

Do you still have the MergeFrom(ReadOnlySpan) prototype?

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/3431#issuecomment-404496467,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJeq5D3daUceQn_om5vh_UB2FlEVc-Ivks5uF0JegaJpZM4OoqYG
.

--

Jan

The benchmarks are basically here:
https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Benchmarks/ParseAddressBook.cs

Here is the hand-crafted MergeFrom(ReadOnlySpan<T> - prototype
https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Addressbook.cs#L336

That gives me an idea that it might be a good idea to compare MergeFrom(ReadOnlySpan<T> with MergeFrom(ReadOnlyMemory<T> - I think I will give it a shot. That result should be somewhat conclusive :).

@jtattermusch I have created two new benchmarks to capture the overhead of ReadOnlyMemory<T> -> ReadOnlySpan<T> and as I expected the overhead is signifant (because it involves memory pinning etc)…

                                          Method |     Mean |     Error |    StdDev | Rank |
------------------------------------------------ |---------:|----------:|----------:|-----:|
                      ParseUsingCodedInputReader | 37.56 us | 0.1789 us | 0.1674 us |    5 |
                               ParseUsingClassic | 18.13 us | 0.0575 us | 0.0509 us |    3 |
                      ParseUsingCodedInputParser | 72.76 us | 0.4153 us | 0.3884 us |    6 |
                  ParseUsingCodedInputSpanParser | 15.40 us | 0.0761 us | 0.0675 us |    2 |
               ParseUsingCodedInputSpanPosParser | 15.11 us | 0.0513 us | 0.0480 us |    1 |
 ParseUsingCodedInputSpanPosParserReadOnlyMemory | 18.81 us | 0.0824 us | 0.0688 us |    4 |

ParseUsingCodedInputSpanPosParser is passing using pos instead of ref… and is the winner today ;) I will push my current benchmarks to the repo so you can reproduce it on your side.

The comparison you want to look at is ParseUsingCodedInputSpanPosParserReadOnlyMemory vs ParseUsingCodedInputSpanPosParser - the only difference is conversion between ReadOnlyMemory<T> and ReadOnlySpan<T> on each call - what simulates what you wanted to do initially.

@jtattermusch

I have pushed my latest benchmarks to the repo. Please give it a try. I hope I have proven my point that void MergeFrom(ref ReadOnlySpan<T>) or void MergeFrom(in ReadOnlySpan<T>, ref int pos) MUST be generated to ensure the optimal performance.

Next point to learn is whether we want to add async Pipes support OR stick to synchronous parsing.
I am perfectly OK with synchronous parsing.

If that's the case, we need to figure out what is the size threshold when copying non-contiguous memory to contiguous (and pinned memory) for parsing time (can be stack memory for small buffers or buffer pool for huge buffers) is efficient and depending on the result we need to decide whether it makes sense to create a void MergeFrom(in ReadOnlySequence<T>, ref int pos) or not i.e. whether threshold is above or significantly below 4MB.

Alternatively we can start with your idea of ReadOnlySequence in CodedInputStream and add two optimizations:

  1. If message is in contiguous memory use void MergeFrom(in ReadOnlySpan<T>, ref int pos).
  2. If message is in non-contiguous memory but is below memory size limit - stackalloc a buffer and use solution from point 1
  3. Otherwise use classic parsing (CodedInputStream) and if there will be a submessage small enough it will fall into cases 1 or 2.

Does it make sense to you?

Honestly, I don't think it makes any sense to start parsing before the entire message has arrived. because it makes the RPC server more prone to a slow-sender class of DoS attacks while giving no real performance benefits except slightly decreased latency, but I really trust in your opinion here. What do you think?

@jtattermusch I have started work on branch https://github.com/mkosieradzki/protobuf/tree/read-only-sequence - this is basically implementation of CodedInputStream using ReadOnlySequence<byte> instead of byte[] as buffer. Let's see what comes out of this.

@jtattermusch I have restarted my work; on branch https://github.com/mkosieradzki/protobuf/tree/netcore2_1
Currently I have added the least intrusive version that adds support for .NET Core 2.1 (to make place for special optimizations), creates a place for passing ReadOnlySequence<byte> and ReadOnlyMemory<byte>, but does not break compatibility.

Now I am going to make sure I do not introduce any performance regressions by switching to the ReadOnlyMemory<T> abstraction instead of array of T.

If I won't be able to ensure this (especially on the pre-.NET Core 2.1 platforms) this I will need to add a fast path using compatBuffer which I am already using on pre-.NET Core 2.1 for Stream.Read.

I have run a bunch of further tests and I am... a bit disappointed, but have some important conclusions,

  1. I have implemented a yet another alternative ValueTask-based variant of ParseUsingCodedInputSpanPosParser just by wrapping results into ValueTask<T> - it's slow beyond all recognition. So async as the only variant is NOT an option.
  2. After experimenting and making a direct comparison between ReadOnlyMemory<T> and T[] in context of CodedInputStream think it's unfeasible to implement a ReadOnlyMemory<T>-based variant without significant performance regressions. VarInt and Tag parsing is so slow… despite multiple optimization attempts including pinning the memory once and creating span from the already pinned pointer.

We are talking about performance regressions orders of magnitude larger than just copying entire non-contiguous memory to a single array.

Maybe I am missing something.

Just to sum up. IMO the array-based buffer should stay in the CodedInputParser.

I think it would be worth productizing ParseUsingCodedInputSpanPosParser for its simplicity, efficiency and elasticity (you can use it with any contiguous piece of memory). I would also recommend adding codegen for void MergeFrom(in ReadOnlySpan<T>, ref int pos) - because without this codegen ParseUsingCodedInputSpanPosParser is useless.

Also this codegen should be conditionally used by the standard MergeFrom(CodedInputStream) if the entire message is already either in contiguous buffer or available from a ReadOnlySequence<byte> and is reasonably small.

@jtattermusch Please let me know whether I should create a PR (I will need to productize my PoC) as described, or if you need to do more investigation or you simply disagree.

I was playing a bit with aggressive inlining… and I went from:

                            Method |     Mean |     Error |    StdDev | Rank |
---------------------------------- |---------:|----------:|----------:|-----:|
                 ParseUsingClassic | 18.39 us | 0.0996 us | 0.0932 us |    2 |
 ParseUsingCodedInputSpanPosParser | 15.19 us | 0.0998 us | 0.0934 us |    1 |

down to:

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.75 us | 0.1096 us | 0.1025 us |    4 |
                 ParseUsingClassicInline | 15.97 us | 0.0528 us | 0.0494 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.47 us | 0.0864 us | 0.0808 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.31 us | 0.0304 us | 0.0254 us |    1 |

only by adding [MethodImpl(MethodImplOptions.AggressiveInlining)] on some methods and improving constructors (to avoid using Func<T> factories)...

I think that this optimization alone deserves a PR. In general I see a huge potential in improved codegen. I mean: what's the point of generating (de-)serialization code if it still uses indirect calls (interfaces, callbacks, etc. on the hot path) and does not excercise inlining? IMO it's a good idea to generate a predictable code and help the compiler/jitter as much as you can...,

So this is my high level optimization plan proposal:

  1. Improve the codegen and utilize more inlining (this might mean getting rid of the entire FieldCoded abstraction).
  2. Consider adding support for ReadOnlySequence<byte> as an alternative to Stream in CodedInputStream without introducing any significant changes (just refill array-based buffer from the sequence instead of Stream if sequence is available) - but to be honest I don't see any huge advantages over implementing a ReadOnlySequence<T>-based Stream… I really tried.
  3. Add codegen for ReadOnlySpan<byte> and use this path conditionally if the entire message fits the buffer.
  4. Avoid allocating temporary buffers and use ArrayPool/MemoryPool where applicable,
  5. Eventually in .NET Core 2.2/3.0 consider using the upcoming Utf8String to avoid UTF8 decoding which is super slow and accounts for around 5us in my example.

All of this together should result in a drop on my benchmark from 18us to around 7.5us in parsing time.

@mkosieradzki

  1. I'd leave async aside for now as it makes things more complicated than they need to be
  2. I'd be generally open to adding logic that allows parsing from Spans, but IMHO we do want to prevent having two almost identical generated methods (MergeFrom(CodedInputStream) and MergeFrom (ReadOnlySpan)), so I'd still want to think about some trick how to express both variants using a single generated method.

Ad the other optimizations you are suggesting (AggresiveInlining, using ArrayPool/MemoryPool and Utf8String) - I think they are orthogonal changes and we should handle them separately (e.g. the agressive inlining can be done for both existing CodedInputStream as for any of its future variants regardless what exactly they are).

@mkosieradzki I spent some time looking at your code and also experimenting myself. Some results:

  • I finally understood why for the ReadOnlySpan-based methods you are using ref ReadOnlySpan<T> - you need to be able to update the position the the current buffer. That's definitely one of the possible approaches, but my impression was that the CodedInputStream actually maintains several extra fields of state that have important security consequences (e.g. the tracking max recursion depth etc.) and it might be difficult to pass all of these along as function parameters (and if we need to add a few more context variables in the future, we might be in trouble).
  • The ReadOnlySpan-based approach can only achieve parsing from a single ReadOnlySpan (if I have a non-continguous buffer in which message boundaries don't match the buffer segment boundaries, I can't really use that approach).
  • The abstract API that is the most powerful is something like someMessage.MergeFrom(ParsingContext context, ReadOnlySpan<byte> immediateBuffer), where ParsingContext is not necessarily a class called like that, but it is something that maintains the parsing state as immediateBuffer is processed (it tracks the position, recursion depth and other context variables). It also should have the ability to provide the next buffer segment in form of a ReadOnlySpan (which be used as a new immediateBuffer once the current immediateBuffer will become empty).
  • As you can notice, the ParsingContext concept from above basically corresponds to the existing class CodedInputStream (where the immediate buffer would be used instead of the existing private field byte[] buffer). So the new more general parsing API could even look like someMessage.MergeFrom(CodedInputStream cis, ReadOnlySpan<byte> immediateBuffer) and the older backward-compatible API would be pretty simple to implement in terms of the newer (hopefully faster) parsing API.
  • Another option is to define the ParsingContext (open to better naming) explicitly as an actual class (or a struct which would reference CodedInputStream internally)
  • In any case, there would only be one implementation of methods doing the actual parsing and only one generated implementation of MergeFrom and the others variants/overloads would be implemented via this single implementation.

What do you think?
Btw, thanks again for spending time prototyping various approaches, it is useful to have an actual code to look at.

Because the changes with the approach I'm suggesting would be relatively significant I'd suggest creating a new experimental nuget package with this new implementation of parsing first (and see how that goes), rather than trying to merge these changes to upstream.

@jtattermusch Dropping the async requirement enables us to create a ref struct to be used as ParsingContext.

So, in that case we can declare ref struct CodedInputParsingContext which is a stack-only type and is allowed to contain ReadOnlySpan<T> etc.

This means we can probably create an implicit conversion between CodedInputStream and CodedInputParsingContext and use a single code path for CodedInputParsingContext.

Do you like the idea of stack-only parsing context? I think I can try to prototype it.

Regarding the ref ReadOnlySpan<T>: the version with in ReadOnlySpan<T> and ref int pos is slightly faster - it is called ParseUsingCodedInputSpanPosParser .

@mkosieradzki I've thought about the ref struct and I'm not sure if it's a good idea. The good thing is the ReadOnlySpan can be part of this struct, but you'll need a reference to some class with context anyway (as non-immutable struct seems like a bad idea in general). I'm also unsure about the backward compatibility of ref struct (which is a C# 7.2 concept) - how would the code look like to someone using an older runtime (e.g. net45)?
In a way, the difference between using ref struct CodedInputParsingContext { SomeContextClass context, ReadOnlySpan buffer} and using two separate parsing parameters SomeContextClass context, ReadOnlySpan buffer in the API seems only like syntactic sugar - both implementations would probably look almost exactly the same, so I don't quite see the added value of using ref struct and potentially risking backward compatibility issues with older runtimes.

@jtattermusch I agree.

@mkosieradzki to prevent doing unnecessary changes, I'd say a prototype that parses from 2 arguments:
CodedInputStream context, ReadOnlySpan<byte> immediateBuffer would be useful - that should be a sufficient proof of concept that minimizes the amount of changes needed. (Basically every ReadXXX method on CodedInputStream would get an overload that accepts those two args).

@jtattermusch I was rather thinking about ref ReadOnlySpan<T> immediateBuffer so while iterating over ReadOnlySequence we can move pointer to the next buffer.

When iterating over a Stream or an array based buffer it's enough to use immutable reference ReadOnlySpan<T>, but our goal is to handle ReadOnlySequence<T> efficiently.

BTW. My preliminary results

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.97 us | 0.1073 us | 0.1004 us |    5 |
                 ParseUsingClassicInline | 15.97 us | 0.1102 us | 0.0977 us |    4 |
                           ParseUsingNew | 14.97 us | 0.0939 us | 0.0878 us |    3 |
       ParseUsingCodedInputSpanPosParser | 13.06 us | 0.0493 us | 0.0461 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.49 us | 0.0803 us | 0.0751 us |    1 |

ParseUsingNew is a heavy-inlined version of classic (classic already benefits from some inlinining so it is not a true baseline) version optimized with an additional parameter ref ReadOnlySpan<T> (as you suggested). I need to track down where the 2 us are lost.

The current experimental code is available: https://github.com/mkosieradzki/protobuf/tree/spans/csharp/src/TestProtoPiper/Benchmarks - it looks pretty bad, but it's only to prototype.

Found 0.7us ;) and accidentally boosted classic parser

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.19 us | 0.0571 us | 0.0534 us |    5 |
                 ParseUsingClassicInline | 14.86 us | 0.0341 us | 0.0319 us |    4 |
                           ParseUsingNew | 14.24 us | 0.0361 us | 0.0338 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.84 us | 0.0471 us | 0.0441 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.23 us | 0.0505 us | 0.0473 us |    1 |

It was hidden in an ineffient IsAtEnd implementation which didn't take the fast track if Stream was null. Still ~1.4us to find.

Found another 0.2us in toxic inlining of slow-path, ~1.2 to go :)

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.31 us | 0.0957 us | 0.0896 us |    5 |
                 ParseUsingClassicInline | 14.73 us | 0.0480 us | 0.0449 us |    4 |
                           ParseUsingNew | 14.05 us | 0.0543 us | 0.0481 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.78 us | 0.0624 us | 0.0583 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.24 us | 0.0527 us | 0.0493 us |    1 |

Picked-up the low-hanging fruit of indirect interface call. I think that the rest of the overhead resides inside the PushLimit / PopLimit / RecomputeBufferSizeAfterLimit - I think it can be optimized if the current message fits the current buffer - but let's leave this for later.

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.27 us | 0.1190 us | 0.1055 us |    5 |
                 ParseUsingClassicInline | 14.85 us | 0.0533 us | 0.0499 us |    4 |
                           ParseUsingNew | 13.79 us | 0.0602 us | 0.0563 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.67 us | 0.0683 us | 0.0605 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.55 us | 0.0490 us | 0.0459 us |    1 |

@jtattermusch

Long story short - I think that the concept of adding ref ReadOnlySpan<T> immediateBuffer is feasible (without generating dedicated code just for Span<T>) and worth the effort. Let's create a separate branch and create an experimental implementation.

@jtattermusch I have created a branch https://github.com/mkosieradzki/protobuf/tree/spans-pr with added spans support. (also there is branch when I do most of active development https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace)

It does not have all of the aforementioned (but significant amount) optimizations, however it is feature complete, passes all tests, has Code Access Security attributes.

I have split my changes into 4 commits: SDK-change, protoc-change, regenerated code, updated tests

This branch should be a nice starting point to add ReadOnlySequence<byte> support as no code in CodedInputStream depends on byte[] buffer. Everything utilizes ReadOnlySpan<byte> immediateBuffer except of the stream reading. I am anticipating you can add ReadOnlySequence<byte> in a similar way to Stream. As a source for immediate buffers (I can do this if you like).

As a small bonus I have added ReadOnlySpan<byte> support to ByteString what gives a secure access to the ByteString. I would like to suggest to change string to ByteString (or at least add such an option) as this is the best UTF-8 representation available today. One can use the exposed span together with the System.Buffers.Text.Utf8Parser.

How should we distribute/name/etc this package before it lands in the upsteam repo?

Progress report for https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace

  • [x] Basic support for ReadOnlySpan<T> when deserializing
  • [x] Basic support for Span<T> when serializing
  • [x] Improve codegen to avoid using the inefficient FieldCodec abstractions:

    • [x] Deserialization

    • [x] Size computation

    • [x] Serialization

    • [x] Other?

  • [x] Add inlining hints when applicable
  • [x] Hide all public internal-use APIs from end-users
  • [x] Add Code Access Security attributes where applicable
  • [x] Add/Update XML Comments to all non-hidden APIs
  • [x] Change XML Comments for hidden APIs as internal-use only
  • [x] Get rid of the entire FieldCodec abstraction
  • [x] Add nice API extensions to retain code-level compatibility (no binary compatibility is planned) with existing SDK
  • [x] Add support for non-contiguous Memory<T>-based buffers

    • [x] Serialization (IBufferWriter<byte>)

    • [x] Deserialization (ReadOnlySequence<byte>)

Future work:

  • [ ] Try to avoid UTF8 -> UTF16 conversion when possible - we should wait for Utf8String
  • [ ] Integrate with Pipelines (should be extremely easy - PipeReader returns ReadOnlySequence<byte> and PipeWriter implements IBufferWriter<byte>) - so I am not even sure if we should implement this or it would be a trivial element of gRPC Server I am planning to implement basing on .NET Core 2.2 HTTP/2 support (it's based on Pipelines) when .NET Core 2.2 Preview 1 will be public.

@jtattermusch

  • [ ] Figure out what to do with this PR/package
  • [ ] Integrate with gRPC
  • [ ] Prepare independent benchmarks
  • [ ] Google paperwork (formal proposal, etc)

@jtattermusch code in branch https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace is now entirely Span-compatible, it is also hard-generated instead of the inefficient FieldCodec-abstraction. I expect a huge performance boost. (btw. the best effects will be observable when used from netcoreapp2.1).

@jtattermusch I'm assigning this issue to you since you have been following up with the proposal.

@jtattermusch It's ready. I don't plan any significant work on this repo now. I have also merged spans-pr-workspace to span-pr.

This was a great discussion to read and I'm happy to share that we went through a very similar exercise when we were writing the HTTP/1.1 parser for Kestrel. The conclusion was that for optimal performance when using ReadOnlySequence, you need to parse a Span at a time, and only use the higher level multi buffer operations when crossing buffer boundaries.

Simple example here -https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L41-L62

More complex example here - https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L198-L209

We're looking at adding a ref struct BufferReader/BufferWriter for the next version to make things a a bit easier to use and more efficient in the common case. We ended up doing this in Kestrel and work is happening now to make it part of the BCL (https://github.com/dotnet/corefxlab/blob/08a1874a0ca4a1a889b2045801c309a1e8575458/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader.cs).

The other interesting thing in this discussion was about async parsing vs sync parsing. I believe that for optimal performance, the best thing to do is to write low level synchronous parsers and build asynchronous parsing on top of that. It looks like you guys have come to a similar conclusion here. In my mind there are 2 patters:

  1. The parser is stateless and works on complete messages
  2. The parsers is stateful and resumable



      • The state can be passed into the parser





      • The state can be kept by the parser



The first makes the parser very simple but consumers to worry about the max message size to avoid a DoS.

The second parser is much harder to write and is where async helps (it basically automatically stores the state on the heap and resumes on your behalf). This state now needs to be managed manually by the parser itself.

Using pipelines though the unconsumed buffer is managed managed on your behalf so the parser just needs to know where it left off and needs to be able to resume based on the parser context.

Here's how I think each of these could look:

Stateless Parsing

```C#
async Task StatelessParsing()
{
var input = connection.Input;

while (true)
{
    var result = await input.ReadAsync();
    var buffer = result.Buffer;

    try
    {
        if (result.IsCanceled)
        {
            break;
        }

        if (!buffer.IsEmpty)
        {
            while (Protocol.TryParseMessage(ref buffer, out var message))
            {
                await _dispatcher.DispatchMessageAsync(connection, message);
            }
        }

        if (result.IsCompleted)
        {
            if (!buffer.IsEmpty)
            {
                throw new InvalidDataException("Connection terminated while reading a message.");
            }
            break;
        }
    }
    finally
    {
        // The buffer was sliced up to where it was consumed, so we can just advance to the start.
        // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
        // before yielding the read again.
        input.AdvanceTo(buffer.Start, buffer.End);
    }
}

}

### Stateful Parsing

```C#
async Task StatefulParsing()
{
    var input = connection.Input;
    ParserContext parserContext;

    while (true)
    {
        var result = await input.ReadAsync();
        var buffer = result.Buffer;

        try
        {
            if (result.IsCanceled)
            {
                break;
            }

            if (!buffer.IsEmpty)
            {
                while (Protocol.TryParseMessage(ref buffer, ref parserContext, out var message))
                {
                    await _dispatcher.DispatchMessageAsync(connection, message);
                }
            }

            if (result.IsCompleted)
            {
                if (!buffer.IsEmpty)
                {
                    throw new InvalidDataException("Connection terminated while reading a message.");
                }
                break;
            }
        }
        finally
        {
            // The buffer was sliced up to where it was consumed, so we can just advance to the start.
            // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
            // before yielding the read again.
            input.AdvanceTo(buffer.Start, buffer.End);
        }
    }
}

Of course this is just the outer code, the more complicated code exists in the parser itself as each of the methods needs to flow the context in order to resume parsing.

Glad to see all of the energy and progress here!

@davidfowl Thank you for all your valuable points. I will take a closer look on the suggestions, but I really think we want to sacrifice async parsing for simplicity and performance.

I was considering stateful parsing as an option before, but I think it would make the codegen much more complicated - and especially after removing this FieldCodec abstraction responsible for virtcalls - current codegen is so incredibly fast. There also security considerations with gRPC - you don't want to start message parsing before it is already there.

I have created a new branch: https://github.com/mkosieradzki/protobuf/tree/spans-pr-rebased

It is a rebased core code to make code-review, and further development easier.
It does not contain generated code to make rebasing to the latest version much easier - learned the hard-way when creating async PR.

Also I code on Windows (VS2017) and previously had problems with line endings in the generated code - so the code should be generated on Linux (or Bash on Ubuntu on Windows :) ).

On Sat, Jul 21, 2018 at 9:37 PM David Fowler notifications@github.com
wrote:

This was a great discussion to read and I'm happy to share that we went
through a very similar exercise when we were writing the HTTP/1.1 parser
for Kestrel. The conclusion was that for optimal performance when using
ReadOnlySequence, you need to parse a Span at a time, and only use the
higher level multi buffer operations when crossing buffer boundaries.

Simple example here -
https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L41-L62

More complex example here -
https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L198-L209

We're looking at adding a ref struct BufferReader/BufferWriter for the
next version to make things a a bit easier to use and more efficient in the
common case. We ended up doing this in Kestrel and work is happening now to
make it part of the BCL (
https://github.com/dotnet/corefxlab/blob/08a1874a0ca4a1a889b2045801c309a1e8575458/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader.cs
).

The other interesting thing in this discussion was about async parsing vs
sync parsing. I believe that for optimal performance, the best thing to do
is to write low level synchronous parsers and build asynchronous parsing on
top of that. It looks like you guys have come to a similar conclusion here.
In my mind there are 2 patters:

  1. The parser is stateless and works on complete messages
  2. The parsers is stateful and resumable
    3.

    • The state can be passed into the parser

      4.

    • The state can be kept by the parser

Thanks for the good points, to add my 50cents about the stateless/stateful
parsing:
Currently for Google.Protobuf the CodedInputStream is the storage for
parsing state - it keeps track of where we are in the buffer, tracks
recursion depth, max message size and other things important for security.
So in a way, we are in the situation, where our current parser is "stateful
and resumable"+"state is kept by the parser". Changing that would require a
significant revamp, so unless that's turns out to be a blocker, we should
go this route.

The first makes the parser very simple but consumers to worry about the
max message size to avoid a DoS.

The second parser is much harder to write and is where async helps (it
basically automatically stores the state on the heap and resumes on your
behalf). This state now needs to be managed manually by the parser itself.

Using pipelines though the unconsumed buffer is managed managed on your
behalf so the parser just needs to know where it left off and needs to be
able to resume based on the parser context.

Here's how I think each of these could look:
Stateless Parsing

async Task StatelessParsing()
{
var input = connection.Input;

while (true)
{
    var result = await input.ReadAsync();
    var buffer = result.Buffer;

    try
    {
        if (result.IsCanceled)
        {
            break;
        }

        if (!buffer.IsEmpty)
        {
            while (Protocol.TryParseMessage(ref buffer, out var message))
            {
                await _dispatcher.DispatchMessageAsync(connection, message);
            }
        }

        if (result.IsCompleted)
        {
            if (!buffer.IsEmpty)
            {
                throw new InvalidDataException("Connection terminated while reading a message.");
            }
            break;
        }
    }
    finally
    {
        // The buffer was sliced up to where it was consumed, so we can just advance to the start.
        // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
        // before yielding the read again.
        input.AdvanceTo(buffer.Start, buffer.End);
    }
}

}

Stateful Parsing

async Task StatefulParsing()
{
var input = connection.Input;
ParserContext parserContext;

while (true)
{
    var result = await input.ReadAsync();
    var buffer = result.Buffer;

    try
    {
        if (result.IsCanceled)
        {
            break;
        }

        if (!buffer.IsEmpty)
        {
            while (Protocol.TryParseMessage(ref buffer, ref parserContext, out var message))
            {
                await _dispatcher.DispatchMessageAsync(connection, message);
            }
        }

        if (result.IsCompleted)
        {
            if (!buffer.IsEmpty)
            {
                throw new InvalidDataException("Connection terminated while reading a message.");
            }
            break;
        }
    }
    finally
    {
        // The buffer was sliced up to where it was consumed, so we can just advance to the start.
        // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
        // before yielding the read again.
        input.AdvanceTo(buffer.Start, buffer.End);
    }
}

}

Of course this is just the outer code, the more complicated code exists in
the parser itself as each of the methods needs to flow the context in order
to resume parsing.

Glad to see all of the energy and progress here!

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/3431#issuecomment-406818931,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJeq5HIiXbimws8G3zCTTee0-dTC5A3Rks5uI4LxgaJpZM4OoqYG
.

@mkosieradzki I started reviewing your code, but haven't noticed the spans-pr-rebased branch initially, which made the review more complicated. Here are some initial comments:

β€” I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.

β€” (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (let’s create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.

β€” Providing the public ReadOnlySpan<byte> Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan<byte> bytes).

β€” aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).

β€” what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.

β€” I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.

β€” I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.

I'm currently trying to review spans-pr-rebased

@mkosieradzki you've come up with a great amount of good changes, but I'm now thinking of how to make the review process efficient and how we can integrate this. Here's what I suggest:

  • I created a branch in my fork: https://github.com/jtattermusch/protobuf/tree/span_support_experimental (currently in sync with protobuf master). I think we could review the code piece-by-piece by creating smaller PRs against that branch. That way we have control of what's being accepted and a good way to handle review comments before things get merged. The very first PR can add a dependency on System.Memory and then we can take it from there (lot of the optimizations you're suggesting are logically independent). I think the most important item is to support the non-contiguous parsing in it's simplest possible form.

  • I also created a myget.org feed where I can upload the experimental version of Google.Protobuf nuget at any time without disturbing the official version - anyone who want's to experiment with that package can just add the feed to their dev environment and use the experimental packages instead of the official vanilla one. (I think this is better than creating a nuget.org package under fake name).

What do you think?

β€” I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.

System.Memory requires 1.1 - so it simplified the PR a lot. If you take a look on the compatibility .NET Standard compatbility table the only different is Windows Phone Silverlight. So I think this sounds like a reasonable compromise between amount of #if and compatibility.

β€” (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (let’s create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.

Yup this is exaclty why I have created spans-pr-rebased.

β€” Providing the public ReadOnlySpan Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan bytes).

Fair enough. Not sure if the second is not used somewhere in the process...

β€” aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).

Makes sense.

β€” what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.

Wrapped is dedicated to handle the wrapped messages like ValueInt32. Previously Wrapped messages were being handled by codecs with a lot of abstraction. The wrapped message structure is:

  • length
  • predefined tag (don't remember the number)
  • the actual value (Int32/Single/Double/etc..)

β€” I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.

I hope I didn't introduce any logic changes there! Please let me know where. I have only added support for wrapped messages.

β€” I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.

I'm ok with both approaches.

@jtattermusch
Please give the spans-rebased-pr branch a try. I really tried to make this specific branch very clean and split into separate meaningful commits. I think that because we are dropping compatbility with 3.x branch by deleting FieldCodec abstraction the entire PR should be treated as a major change.

If you want to play with what I have built, I have created yet another branch spans-pr-rebased-build (with custom build process) and connected it to the AppVeyor. You can access the package from the AppVeyor nuget feed.

My apologies, but creating this PR as is has consumed most of my free time (it was a considerable amount of work, most in experimentation), I can't promise I will be able to split the PR into smaller pieces very soon

https://ci.appveyor.com/project/mkosieradzki/protobuf/build/1.0.30 - this is the best build I have currently.. There is one artifact with protoc for windows and one artifact with SDK nuget package. It is also available through this feed: as 4.0.0-pre-30 https://ci.appveyor.com/nuget/protobuf-spans

@mkosieradzki I plan to take a closer look at spans-rebased-pr tomorrow.

I followed the trail of related issues all the way to this PR.

I'm running a locally-built version of Protobuf where I have implemented some of these same optimizations. This issue and related PRs look dead. @mkosieradzki has done so much work on this I hate to put in effort and submit PRs of my own, but I don't see this issue moving otherwise.

What's the best path forward to get these optimizations in Protobuf?

@prat0088 we still plan to move ahead with the Span-based parsing but it's unclear which exact approach we're going to take (design discussions are ongoing). IMHO the PR that's the closest to the ideal design is https://github.com/protocolbuffers/protobuf/pull/5888.
@mkosieradzki definitely deserves recognition for prototyping the approach, his code is really useful as a reference (regardless of which exact approach eventually end up landing).

Was this page helpful?
0 / 5 - 0 ratings