Runtime: API Proposal: Utf8Parser overloads for ReadOnlySequence<byte>

Created on 31 Oct 2019  路  52Comments  路  Source: dotnet/runtime

These text parsers are very convenient when parsing text based protocols (like HTTP/1.1 and RESP). It would be great to have overloads of the existing APIs that support ReadOnlySequence<byte>. I end up writing code like this:

```C#
private static bool TryParse(in ReadOnlySequence bytes, out int value, out int bytesConsumed)
{
if (bytes.IsSingleSegment)
{
if (!Utf8Parser.TryParse(bytes.FirstSpan, out value, out bytesConsumed))
{
return false;
}
}
else
{
Span textSpan= stackalloc byte[128];
bytes.CopyTo(textSpan);

    if (!Utf8Parser.TryParse(textSpan, out value, out bytesConsumed))
    {
        return false;
    }
}

return true;

}

I would be great if this was built into the existing parser.

```C#
public partial static class Utf8Parser
{
    public static bool TryParse(in ReadOnlySequence<byte> source, out bool value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out byte value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out DateTime value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out DateTimeOffset value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out decimal value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out double value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out Guid value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out short value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out int value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out long value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out sbyte value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out float value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out TimeSpan value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out ushort value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out uint value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out ulong value, char standardFormat = '\0');
}

cc @ahsonkhan @GrabYourPitchforks

api-needs-work area-System.Memory

Most helpful comment

@davidfowl

Instead of

bool TryParseInt(IByteEnumerator enumerator, out int value)

I think we'd effectively need

bool TryParseInt<T>(T enumerator, out int value)
  where T: ref struct, IByteEnumerator

Note: changed ByteEnumerator to IByteEnumerator to make it more clear this is meant to be an interface (at least in my sample)

The method can't be non-generic because that would imply that IByteEnumerator is a boxed value. Even if we supported ref interface instead of allowing ref struct to implement normal interfaces my expectation is a non-generic instance would always require some form of boxing and hence be illegal.

Is that aligned with everyone else's expectations here?

All 52 comments

Should we consider adding these APIs on SequenceReader instead (or potentially in addition to)?

Something like these extension methods:
https://github.com/dotnet/corefxlab/blob/94248fb2ea9881936d6338b0d2d62ca8c79cbd79/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader_text.cs#L239-L275

The bytesConsumed parameter doesn't make too much sense for a SequencePosition. How about consumedPosition (or sequenceConsumed, consumedUntil, etc.)?

cc @JeremyKuhne

I originally had that but the problem is that with text based protocols are usually delimited. For example, say you wanted to parse an HTTP/1.1 response (e.g. HTTP/1.1 200 OK\r\n). You want to write code that looks like this:

```C#
// HTTP/1.1
if (!sequenceReader.TryReadTo(out ReadOnlySpan version, (byte)' '))
{
return;
}

// 200
if (!sequenceReader.TryReadTo(out ReadOnlySpan statusCodeText, (byte)' '))
{
return;
}
else if (!Utf8Parser.TryParse(statusCodeText, out int statusCode, out _))
{
return;
}

// OK
if (!sequenceReader.TryReadTo(out ReadOnlySequence statusText, NewLine))
{
return;
}

If we add these parser APIs, how do you avoid parsing the status code into an integer until you have the full payload? 

```C#
// HTTP/1.1
if (!sequenceReader.TryReadTo(out ReadOnlySpan<byte> version, (byte)' '))
{
    return;
}

// 200
// This code is incorrect because it'll parse 2, 20, and 200 successfully. So we need to scan for a space before we start parsing to make sure we have the full payload.
if (!sequenceReader.TryParse(out int statusCode))
{
    return;
}

// OK
if (!sequenceReader.TryReadTo(out ReadOnlySequence<byte> statusText, NewLine))
{
    return;
}

(Standard grumbling that this belongs on an AsciiParser class, not a Utf8Parser class, since presumably you want these APIs to fail if they encounter non-ASCII data.)

The APIs seem reasonable. We'd have to be a little careful with the implementation because each method would need to know the worst-case input length based on the out T and the char standardFormat parameters. It's likely that any introduced APIs will have bugs in edge case handling, especially for types like double where the input buffer could be arbitrarily long. That doesn't mean we shouldn't do them, but it does potentially mean we should consider carefully which overloads we want to ship with.

I'd love to find a more general way to tackle this. Adding these convenience overloads for everything parsing a Span will get tedious, and it would be tacit encouragement to parse ReadOnlySequence directly.

I don't know if I'd want it in corefx but an extension on ReadOnlySequence could make this simple:

```c#

// return FirstSpan if single element, otherwise copy to storage and return storage.
ReadOnlySpan ReadOnlySequence.MaybeCopyTo(Span storage);

Utf8Parser.TryParse(byteSequence.MaybeCopyTo(stackalloc byte[128]), out int statusCode, out int bytesConsumed);
```

I'd love to find a more general way to tackle this. Adding these convenience overloads for everything parsing a Span will get tedious, and it would be tacit encouragement to parse ReadOnlySequence directly.

Case by case seems fine but if we're looking at language features then I rather not block that to add these very simple overloads that I keep re-writing 馃槃 .

The APIs seem reasonable. We'd have to be a little careful with the implementation because each method would need to know the worst-case input length based on the out T and the char standardFormat parameters. It's likely that any introduced APIs will have bugs in edge case handling, especially for types like double where the input buffer could be arbitrarily long. That doesn't mean we shouldn't do them, but it does potentially mean we should consider carefully which overloads we want to ship with.

We're just delegating to the Span overloads for most of these (I would assume all check for overflow). Isn't double the only special case? Doesn't it already have to be handled by the span overload?

Isn't double the only special case? Doesn't it already have to be handled by the span overload?

Not sure if it's the only special case. And the span overloads can assume that they have all the data, which means that edge case handling is simpler. As an example of something that might be error-prone: what happens if the input is a 100GB stream consisting of all zeroes and you pass such a stream to the double parser? We can't delegate to the span parser because it can't handle payloads that large.

Again, not an argument that we shouldn't do it, but more of an admission that the more we try to support the higher the odds of this being a bug farm. So we'd want to weigh carefully the tradeoffs of having an API that works directly with sequences vs. telling customers to find the delimiters themselves, slice, and use the span-based APIs.

Not sure if it's the only special case. And the span overloads can assume that they have all the data, which means that edge case handling is simpler.

Why can't these overloads assume the same thing?

As an example of something that might be error-prone: what happens if the input is a 100GB stream consisting of all zeroes and you pass such a stream to the double parser? We can't delegate to the span parser because it can't handle payloads that large.

What happens if you do that for the span overload?

What happens if you do that for the span overload?

If you pass int.MaxValue worth of zeroes to the span overload? It should handle them just fine and return 0.0.

Constrain the input size to int.MaxValue for ReadOnlySequence

For example, say you wanted to parse an HTTP/1.1 response (e.g. HTTP/1.1 200 OK\r\n). You want to write code that looks like this:

Is this example how you imagine writing production quality high-performance HTTP response parser, e.g. are you going to use these APIs in Kestrel if we add them?

That would be ideal if the performance drop wasn't terrible. We started to write an HTTP client for performance testing here https://github.com/sebastienros/httpsocket/blob/38f3ee6d4a357fa60fec4a8b6cf802ce7e921924/src/HttpSocket.cs#L146-L175 and it uses the above code.

So what is this code fragment going to look like with this API?

I would use the ReadOnlySequence overload instead I get the data instead of ReadOnlySpan.

This isn鈥檛 the main scenario though, it鈥檚 when you have parsed some sequence of date then you need to further parse and slice to get the right section, then you want to turn it into some primitive

And would you still use SequenceReader? I think we have way too many of these parsing helper APIs and it is getting very difficult to figure out what is the right combination to use for any given situation.

And would you still use SequenceReader? I think we have way too many of these parsing helper APIs and it is getting very difficult to figure out what is the right combination to use for any given situation.

I don't think it's complicated (having written a bunch of these parsers now). The main problem is that nothing takes a ReadOnlySequence so once you've parsed and narrowed your search, you're left with the complicated logic to materialize the bytes into something else. It's very much like trying to use Span<T> on .NET Standard 2.0.

e.g. https://github.com/aspnet/AspNetCore/blob/271ebe019823e8e6a751429cd350c2db76ebab05/src/Http/WebUtilities/src/FormPipeReader.cs#L316

(we just added these thanks @GrabYourPitchforks!)

e.g. https://github.com/sebastienros/httpsocket/blob/38f3ee6d4a357fa60fec4a8b6cf802ce7e921924/src/HttpSocket.cs#L254-L260

This is the perfect example, parsing a chunked prefix in HTTP/1.1. The payload looks like this:

a\r\n
Helloworld\r\n
0\r\n
\r\n

I want to parse the hex number (a for example) after grabbing all text up to the newline.

So the code pattern is that the caller slices the input based on some delimiter (such as CRLF), then passes the sliced sequence to these parsing APIs? That is somewhat different than the behavior of the span-based APIs, which consume input until a delimiter is encountered.

So the code pattern is that the caller slices the input based on some delimiter (such as CRLF), then passes the sliced sequence to these parsing APIs?

In these particular examples that are handling streaming data it has to be done this way. Parsing partial integers would succeed so we need to get the entire "frame" first.

That is somewhat different than the behavior of the span-based APIs, which consume input until a delimiter is encountered.

Yes, I guess that's why the bytesConsumed parameter exists. Generally though this approach applies here as well when using these APIs in the same way. I now see your original concern about overflow and edge cases, but that can be mitigated by having a default limit on the input (even though it's a ReadOnlySequence).

So what we're really talking about is a behavioral equivalent to int.TryParse, not Utf8Parser.TryParse(ROS<byte>, out int)?

@GrabYourPitchforks I think that would be reasonable yes. That way we can call safely delegate to the the Span overloads. I'd prefer if we kept these implementations super cheap to implement. So far we've scoped it:

  • Fully formed primitives (no partial data allowed)
  • Limiting the max input size (to something reasonable to prevent abuse or weird edge cases)

Do we also need to add these overloads for the ReadOnlySpan<byte> variant (maybe it could be a different request)? I've updated the issue to reflect this.

Just to set baseline expectations, these are not going to be cheap to implement if you're looking for behavioral parity with existing TryParse methods. Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method), so the likely result is that we'll end up duplicating the existing parsing logic into these methods.

Just to set baseline expectations, these are not going to be cheap to implement if you're looking for behavioral parity with existing TryParse methods. Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method), so the likely result is that we'll end up duplicating the existing parsing logic into these methods.

OK, well that's unfortunate but I don't really have any push back. The only unfortunate thing here basically leads back to what @scalablecory mentioned. There's no way to have an efficient byte enumerator that would be more efficient with span and less efficient with ReadOnlySequence/SequenceReader. That might be worth looking into if we end up needing to duplicate all of these methods.

@AndyAyersMS @jaredpar @jkotas Here's a question for you both:

```C#
bool TryParseInt(ByteEnumerator enumerator, out int value)
{
int result = 0;
while (enumerator.TryGetNext(out byte b))
{
var digit = b - '0';
result = result * 10 + digit;
}
return result;
}

We need a language and JIT  and compiler feature that lets me write code like the above but can then turn it into the equivalent of:

```C#
bool TryParseInt(Span<byte> span, out int value)
{
   int result = 0;
   for (var i = 0; i < span.Length; i++)
   {
     var digit = span[i] - '0';
     result = result * 10 + digit;
   }
   return result;
}

If the ByteEnumerator is a Span<byte>

Agree - having something like the ref interface proposal would definitely help here. Then we could potentially reuse the same logic for Span and ReadOnlySequence and rely on generic specialization to make things more efficient.

@davidfowl @GrabYourPitchforks C++ enables you to do this with templates and iterators and I've employed it very successfully in the past for high-perf parsing. I'd love it we could get better parity there.

When I tried to do this myself in C# I hit two roadblock features I needed:

  • enable ref struct to implement interfaces.
  • add a ref struct generic constraint that enforces "you must use this generic as if it were a ref struct". Then you can do e.g.

```c#
ref struct SpanEnumerator : IByteEnumerator
{
// ...
}

struct SequenceEnumerator : IByteEnumerator
{
// ...
}

bool TryParseHelper(ByteEnumerator enumerator, out int value) where ByteEnumerator : IByteEnumerator, ref struct
{
// ...
}

bool TryParse(ReadOnlySequence sequence, out int value)
{
return TryParseHelper(CreateSpanEnumerator(sequence.FirstSpan), out value)
|| TryParseHelper(CreateSequencEnumerator(sequence), out value);
}
```

I briefly touched on this in https://github.com/dotnet/csharplang/issues/1479#issuecomment-463488171

C++ also makes it really easy via traits/ADL/SFINAE/etc. to automatically select e.g. the most efficient IndexOf for an iterator type -- I'd love that for this as well, but that is significantly more complex of a feature to ask for :)

I agree ref interfaces or ref constrains would be useful.

I have doubts whether the specific helper methods discussed here are useful enough to be in the core framework. I believe that there are two mainstream categories of parsers to worry about:

  • High-performance production-grade parsers always work against buffers for number of difference reasons: look ahead, context to give good error messages, etc. Helper method like these are not useful for high-performance production-grade parsers.
  • Lower-performance handrolled to get the job done. The building block for these is typically a reader object (like TextReader) that is optimized for convenience, discovery and ease of use.

The method proposed here do not seem to be either category.

@jkotas I agree that ReadOnlySequence<T> isn't performant for parsing. It's job was to be an representation of a single or multi-segment buffer that would be as efficient as possible while in a single buffer, but pay the cost when you needed to span buffers. While that came with some tradeoffs, we have this type in the BCL now and I'd like more ways to interact with them in the BCL natively without having to write helpers to do the most basic parsing.

Helper method like these are not useful for high-performance production-grade parsers.

That's subjective. These helpers could be very fast in the case of parsing a single segment ROS and slower when we need to cross buffers. Once you've decided to represent a buffer with this type you're already in that world so I don't think this argument holds true.

No matter how you feel about that, users have to write the above code today and I want to remove that need in the next version.

Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method)

@GrabYourPitchforks prior ART https://github.com/dotnet/corefxlab/blob/94248fb2ea9881936d6338b0d2d62ca8c79cbd79/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader_text.cs#L15

We'll see if anybody complains 馃槃

EDIT: Wait, that was corefxlab 馃槃

That's subjective. These helpers could be very fast

It is not just about performance. It is also about usability. What would help to convince me if you show me a delta how these methods will be used in Utf8JsonReader, Kestrel (or some other production grade parser). The motivating example was from a test code and it was not very convincing.

No matter how you feel about that, users have to write the above code today and I want to remove that need in the next version.

We should focus on designing scenario: Parsing data coming from ReadOnlySequence. We have a SequenceReader for that today. I find it confusing that we are not improving SequenceReader and instead trying to add an alternative way for parsing ReadOnlySequence that is not SequenceReader . How are the users of the platform supposed to pick the right way to parse the ReadOnlySequence ?

Compare this with the simple story for TextReader that does not have ambiguity like this.

It is not just about performance. It is also about usability.

The general pattern for using SequenceReader is to the methods to narrow down a Span or Sequence of content to further parse. We have many examples of production parsers already described in this thread.

Kestrel's header parser https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L180 (ignore the pointers, those will go away soon 馃槃)

The Form Parser https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Http/WebUtilities/src/FormPipeReader.cs#L219

The motivating example was from a test code and it was not very convincing.

I don't really understand why it wasn't, pretend that code was in HttpClient, that's what we're currently building. It happens to be a client used for performance testing but I'm not sure why that makes a difference.

We should focus on designing scenario: Parsing data coming from ReadOnlySequence. We have a SequenceReader for that today. I find it confusing that we are not improving SequenceReader and instead trying to add an alternative way for parsing ReadOnlySequence that is not SequenceReader . How are the users of the platform supposed to pick the right way to parse the ReadOnlySequence ?

I thought I clarified that but maybe I was unclear. First the specific example:

  • When parsing an HTTP/1.1 response start line, the tokens are split by spaces then a CRLF. The second token is an integer.

Can you tell me how that parsing would look if we added TryParseInt to SequenceReader?

C# // Find the space if (sequenceReader.TryReadTo(out ReadOnlySequence<byte> statusCodeText, (byte)' ')) { // Parse the int var subReader = new SequenceReader<byte>(statusCodeText); if (subReader.TryParseInt(out statusCode)) { } }

That's syntactically noisy, expensive and unnecessary.

Compare this with the simple story for TextReader that does not have ambiguity like this.

TextReader can a read a line, the entire payload or blocks or chars. It really doesn't do anything like what we're describing here (other than SequenceReader and TextReader sharing the Reader suffix).

Alternative Consideration:

I want to summarize the patterns I've seen in the 3 years of using these APIs "in anger" writing and reviewing lots of parsers (both for play and production). There are only a handful of patterns that need supporting to work for a extremely LARGE range of scenarios:

SequenceReader IMO solves the problem of walking the ReadOnlySequence<byte> in sane way, once you have narrowed your way down to a single token, the next step is to materialize into some other form. If It's unclear what I mean, here are some examples:

IMO I do think the ideal solution is to get to a span as quickly as possible, then parse (if the data isn't large),

The options are:

  1. Parse byte by byte or span by span. Some parsers do this (like JSON) and it can be faster but results in re-implementing various primitive parsers. This is usually still required when parsing a custom primitive format we don't ship OOTB.
  2. Add more overloads for things to take ReadOnlySequence, make those APIs do the fast path slow path thing - This is the most usable but requires adding alot more overloads.
  3. Add more overloads to SequenceReader, but that means making a SequenceReader from a ReadOnlySequence when we want to create a primitive type (see my comments above about this being wasteful).
  4. Copy the ReadOnlySequence to a Span, then use that span to parse. Incurs copies potentially on every read which defeats some of the purpose of this approach.
  5. Copy only in the multi-segment case (this has hybrids like stackalloc if small enough and array pool otherwise, will will use the heap if it's too big).

Hybrids of the above approaches are also valid and I've seen them in the wild.

Maybe it's reasonable for primitives like this to incur the "maybe copy" by using overloads like this https://github.com/dotnet/corefx/issues/42252. If we combine this with a further API to do what @scalablecory suggests (but maybe on the reader as well), it might be the best of all worlds but I want to get your feedback.

Thoughts?

I do think the ideal solution is to get to a span as quickly as possible, then parse

Agree.

Parse byte by byte or span by span. Some parsers do this (like JSON) and it can be faster but results in re-implementing various primitive parsers.

Yes, that is the way to implement high-performance production-grade parsers. I do not think any amount of helper APIs is going to change that.

Copy the ReadOnlySequence to a Span
Copy only in the multi-segment case

Yes, I think APIs along these lines are the most powerful building blocks for lower-performance handrolled parsers that just need to get the job done.

For example, I think it would be useful to have ReadLine API for SequenceReader<byte/char> that returns Span and allocates the buffer to cover the splits as necessary. This Span can be then sliced and diced as necessary using regular Span APIs, including number parsing Span APIs. This is the same pattern one uses with TextReader today.

I agree ref interfaces or ref constrains would be useful.

In terms of language design ref interfaces / constraint is a fairly straight forward extension of our existing work. The ref constraint in particular was discussed at the time of ref struct and the behavior is mostly understood.

The tricky part is, as per usual, all features start at -100 so need to justify the cost is worth it. :smile:

I have a doc I'm already working on for improvements to ref structs that I hope to start sharing around in late November (just booked solid until then). I will add ref constraints and interfaces to the list.

@davidfowl

Instead of

bool TryParseInt(IByteEnumerator enumerator, out int value)

I think we'd effectively need

bool TryParseInt<T>(T enumerator, out int value)
  where T: ref struct, IByteEnumerator

Note: changed ByteEnumerator to IByteEnumerator to make it more clear this is meant to be an interface (at least in my sample)

The method can't be non-generic because that would imply that IByteEnumerator is a boxed value. Even if we supported ref interface instead of allowing ref struct to implement normal interfaces my expectation is a non-generic instance would always require some form of boxing and hence be illegal.

Is that aligned with everyone else's expectations here?

@jaredpar yes.

@jaredpar yes.

@jkotas

Yes, that is the way to implement high-performance production-grade parsers. I do not think any amount of helper APIs is going to change that.

I would beware about how we use the term "production-grade", I would swap that with high performance. But I agree. Arguably, byte by byte parsers aren't any more difficult with ReadOnlySequence/SequenceReader. The forward only nature does make it hard to jump around though.

See https://github.com/dotnet/corefx/blob/279292804fa4f6cb856577e59246ac934a2ef9e5/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnection.cs#L1262, which uses Array.IndexOf to scan for tokens.

Yes, I think APIs along these lines are the most powerful building blocks for lower-performance handrolled parsers that just need to get the job done.

I think the reader already exposes enough to do this.

For example, I think it would be useful to have ReadLine API for SequenceReader that returns Span and allocates the buffer to cover the splits as necessary.

What happens if the line is really long? Just allocate a long byte[] or return another ReadOnlySequence<byte/char>? That's the crux of the issue here. When you've found a single token, what form do you get it in and how to do materialize it. Your answers still haven't covered that scenario.

This Span can be then sliced and diced as necessary using regular Span APIs, including number parsing Span APIs. This is the same pattern one uses with TextReader today.

The difference being that SequenceReaeder doesn't have to allocate a buffer, that might be the difference. It's similar to the question above, which remains unanswered. If tokens are small enough, Span is fine, otherwise you can get another ReadOnlySequence and futher slice that up into spans. The problem with the latter approach is that it requires getting another SequenceReader.

As an example:

Key2=Value1\r\n
Key2=Value2\r\n
\r\n

Take this imaginary format, you can write the following code:

```C#
void ParsePair(ref SequenceReader reader, out string key, out string value)
{
// This may allocate
reader.TryReadTo(out ReadOnlySpan keySpan, '=');

// This may allocate
reader.TryReadTo(out ReadOnlySpan<byte> valueSpan, NewLine);

key = Encoding.UTF8.GetString(keySpan);
value  = Encoding.UTF8.GetString(valueSpan);

}

OR you can get the entire line:

```C#
void ParsePair(ref SequenceReader<byte> reader, out string key, out string value)
{
    // After we add https://github.com/dotnet/corefx/issues/42252
    reader.TryReadTo(out ReadOnlySpan<byte> line, NewLine); // This might allocate

    int eq = line.IndexOf('=');

    ReadOnlySpan<byte> keySpan = s[0..eq];
    ReadOnlySpan<byte> valueSpan = s[(eq + 1)..^2];

    key = Encoding.UTF8.GetString(keySpan);
    value  = Encoding.UTF8.GetString(valueSpan);
}

I think either approach is fine but one may be more prone to allocations depending on the maximum size of the token and the segment size of the ReadOnlySequence.

Here's a summary of the 3 approaches I think we could take and the pros and cons of each:

1. ReadOnlySpan all the things

Pros:

  • No need to add overloads, ReadOnlySpan<byte/char> is enough
  • Code is simpler for the caller

    Cons:

  • Potentially large allocations depending on

    1. The maximum token size (imagine a single large token e.g. 32K bytes long)
    2. The size of segments in the ReadOnlySequence

```C#
void Parse(ref SequenceReader reader)
{
// This might allocate a large byte[]
if (!reader.TryReadTo(out ReadOnlySpan line, NewLine))
{
return;
}

if (Utf8Parser.TryParse(line, out int value, out int bytesConsumed))
{
    line = line.Slice(bytesConsumed);
}

}

# 2. SequenceReader all the things

## Pros:
- Does not allocate

## Cons:
- Expensive to create another SequenceReader on the line (it is a big struct)
- Requires adding lots of methods to SequenceReader to parse textual data
- May not be as efficient as `ReadOnlySpan<byte/char>` equivalent

```C#
void Parse(ref SequenceReader<byte> reader)
{
    // This doesn't allocate
    if (!reader.TryReadTo(out ReadOnlySequence<byte> line, NewLine))
    {
        return;
    }

    var lineReader = new SequenceReader<byte>(line);

    if (lineReader.TryParse(line, out int value))
    {
        // 0 allocations
    }    
}

3. ReadOnlySequence all the things

Pros:

  • Does not allocate

Cons:

  • Requires adding overloads that take ReadOnlySequence<byte/char> to various parts of the BCL
  • May not be as efficient as ReadOnlySpan<byte/char> equivalent

```C#
void Parse(ref SequenceReader reader)
{
// This doesn't allocate
if (!reader.TryReadTo(out ReadOnlySequence line, NewLine))
{
return;
}

if (Utf8Parser.TryParse(line, out int value, out SequencePosition bytesConsumed))
{
    line = line.Slice(bytesConsumed);
}

}
```

Requires adding lots of methods to SequenceReader to parse textual data
Requires adding overloads that take ReadOnlySequence to various parts of the BCL

This is same "lots of methods" in both cases, right?

@jkotas but different because one is all on a single type VS being on various types in the BCL.

You need a concrete T for these parsers that makes it impossible to add them as instance methods to SequenceReader<T> itself. So they would be extension methods, spread over as may types as you would like.

You need a concrete T for these parsers that makes it impossible to add them as instance methods to SequenceReader itself. So they would be extension methods, spread over as may types as you would like.

Logically they hang off SequenceReader<T>, which is different than changing lots of other places.

I spent a little time writing a something to parse a basic http/1.1 request start line, no error handling, best case scenario and got these results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015523
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.52202, CoreFX 4.700.19.52317), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.52202, CoreFX 4.700.19.52317), X64 RyuJIT


|            Method |     Mean |    Error |   StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |---------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|     SequenceParse | 161.1 ns |  3.22 ns |  3.16 ns |  1.25 |    0.04 | 0.0477 |     - |     - |     200 B |
| StreamReaderParse | 498.2 ns | 26.79 ns | 32.90 ns |  3.89 |    0.25 | 0.8678 |     - |     - |    3632 B |
|         SpanParse | 128.6 ns |  2.65 ns |  2.95 ns |  1.00 |    0.00 | 0.0477 |     - |     - |     200 B |
| SequenceFastParse | 148.1 ns |  3.01 ns |  4.22 ns |  1.16 |    0.04 | 0.0477 |     - |     - |     200 B |
|   ByteByByteParse | 205.9 ns |  3.57 ns |  3.34 ns |  1.60 |    0.05 | 0.0401 |     - |     - |     168 B |

https://gist.github.com/davidfowl/323a5a037700ad49692b97c0afff69f6

The StreamReader is obviously a bit of a troll but I wanted to show that even though the SequenceReaeder is slower than the direct Span API, it doesn't allocate it's possible to write pretty convenient APIs that are reasonably fast.

The SequenceFastParse also shows that with a little more effort, it can be brought a bit closer to Span performance at the cost of a bit more complex code (slow path, fast path).

@davidfowl The benchmark sample you linked doesn't use the TryParse methods you were proposing. Was this intentional?

Yes (but I can add it if need be), the conversation is now about what "production ready" parsers look like. I stand by what i said earlier in this thread and I just wanted to have some concrete data to back it up.

I want to close on a strategy based on these options here https://github.com/dotnet/corefx/issues/42255#issuecomment-549705509 and then push forward with additions in that direction.

@davidfowl I think your byte-by-byte parser isn't realistic. It should have an implementation similar to SequenceReader.

@scalablecory Send me the codez and I'll update the benchmark.

@davidfowl Here's more what I had in mind -- faster than your version, but JIT unfortunately defeats much of the benefit... this is basically "free" in C++ but has considerable overhead here. Imagine the two helper methods are a single generic one (they have same code).

https://gist.github.com/scalablecory/4b1320044b911c5795531731d3d86f6a

@GrabYourPitchforks to your point, will add an example of the a basic Utf8Parser.TryParse with the various techniques to help answer @jkotas 's questions around performance. I'm coming back to just adding the originally proposed overloads based on the existing data but lets try some more things.

@jkotas 's questions around performance

The question for me is whether the cost of these new APIs (confusion about what are the right parsing APIs to use, code-bloat associated with having yet another set of parsing APIs for everything, ...) is worth it to avoid some GC heap allocations in less common cases.

It鈥檚 not sure why it would be confusing (I鈥檇 be happy if were down to that discussion), it鈥檚 very simple, if you have a ReadOnlySequence, you would call this overload (which is often the case when using pipelines). This is specifically for parsing unknown lengths of textual data. This is why we added similar overloads for encoding APIs.

For binary data where you鈥檙e attempting to parse a big/little endian number of fixed size, span is much easier.

Hopefully that makes the case and we can move on to agreeing that these are useful and needed.

Video

@davidfowl have tried using APIs like this in Kestrel or is this API proposal a hypothetical? The question is whether you can the perf, because parsers should really be over spans, not sequences. We already said we won't do parsers of Memory<T> because accessing the span is not free. Doing it over sequences would be even worse.

Was this page helpful?
0 / 5 - 0 ratings