Aspnetcore: Is it possible to tune request parsing any further?

Created on 3 Apr 2020  路  17Comments  路  Source: dotnet/aspnetcore

I am currently working on improving our results in the JSON TechEmpower benchmark.

We got to the point, where everything was already tuned at least once and even 1% matters.

The majority of time is spent on necessary work like sending & receiving messages, epoll & thread pool scheduling that are hard or impossible to optimize any further.

For the JSON Platform benchmark, we spent 5% of the time on parsing headers.

obraz

If I remove it, I get something around 40-50k RPS gain.

@GrabYourPitchforks is there any chance that you could take a look at the parsing logic and see if there are any possibilities to optimize it any further? I know that you have a LOT of expertise in the low-level tuning of text operations.

I've prepared a copy of the TE logic and encapsulated it into a benchmark that can be run by doing the following:

git clone https://github.com/adamsitnik/aspnetcore.git parsing
cd parsing
git checkout techEmpowerParsing
./build.sh
./.dotnet/dotnet run -c Release -f netcoreapp5.0 --filter TechEmpowerHttpParserBenchmark --project ./src/Servers/Kestrel/perf/Kestrel.Performance/Microsoft.AspNetCore.Server.Kestrel.Performance.csproj --cli ./.dotnet/dotnet 
area-servers

Most helpful comment

I am adding some details about where the time is spent exactly:

obraz

All 17 comments

HTTP/1.x parsing has already been pretty heavily optimized by people like @benaadams.

We don't really do text operations at the HttpParser-level. We treat everything as bytes. Instead we use ReadOnlySpan.IndexOf (or ReadOnlySequence.PositionOf for headers that span multiple blocks), to search for the next \n byte in the input stream and slice. These methods have been vectorized where possible.

There's some validation that happens to verify that the headers don't contain any invalid bytes (also vectorized), and there's some copying that happens for headers that span multiple blocks, but there's no decoding, string allocations or anything like that going on in the platform benchmarks. It just calls the no-op BenchmarkApplication.public.OnHeader(ReadOnlySpan<byte> name, ReadOnlySpan<byte> value) implemenation.

When https://github.com/dotnet/runtime/issues/28230 comes online we'll probably be able to plumb it through some of the network protocol parsers. That might give a small perf boost for "text" processing. But IMO the biggest benefits would come from more efficient data structure management.

Some of the other recent gains are from introducing unsafe code in previously-safe code paths. That continues to be an option, but it really shouldn't be our go-to solution.

No more unsafe code in the HttpParser, that鈥檚 hard no from me. We should be moving to a place with less unsafe code

There's some gains still to get if can change the api. I assume that's ok as its "internal"?

e.g. current OnStartLine handler is

public interface IHttpRequestLineHandler
{
    void OnStartLine(
        HttpMethod method,
        HttpVersion version,
        Span<byte> target,
        Span<byte> path,
        Span<byte> query,
        Span<byte> customMethod,
        bool pathEncoded);
}

Which is very heavy on stack usage as well as parameter ABI stack passing.

Changing to something like the following would mean that only one Span<byte> is passed on stack, whereas the other parameters are passed via register:

public interface IHttpRequestLineHandler
{
    void OnStartLine(
        HttpVersionAndMethod versionAndMethod,
        PathAndQuery pathAndQuery,
        Span<byte> startLine);
}

public struct HttpVersionAndMethod
{
    private long _versionAndMethod;

    public HttpVersion Version => (HttpVersion)(sbyte)(_versionAndMethod >> (32 + 8));
    public HttpMethod Method => (HttpMethod)(byte)(_versionAndMethod >> 32);
    public int MethodEnd => (int)_versionAndMethod;
}

public struct PathAndQuery
{
    private long _pathAndQuery;

    public int PathEnd => (int)_pathAndQuery & 0X7FFF;
    public bool PathEncoded => ((int)_pathAndQuery) < 0 ? true : false;
    public int QueryEnd => (int)(_pathAndQuery >> 32);
}

Then use C#8's range operator to get the sub Spans from the startLine in the handler.

Looks like it might be worth a go
Change in method start to ParseRequestLine

-G_M28088_IG01:
-       push     rbp
-       push     r15
-       push     r14
-       push     r13
-       push     r12
-       push     rdi
-       push     rsi
-       push     rbx
-       sub      rsp, 152
-       vzeroupper 
-       lea      rbp, [rsp+D0H]
-       vxorps   xmm4, xmm4
-       vmovdqa  xmmword ptr [rbp-90H], xmm4
-       vmovdqa  xmmword ptr [rbp-80H], xmm4
-       vmovdqa  xmmword ptr [rbp-70H], xmm4
-       vmovdqa  xmmword ptr [rbp-60H], xmm4
-       vmovdqa  xmmword ptr [rbp-50H], xmm4
-       xor      rax, rax
-       mov      qword ptr [rbp-40H], rax
-       mov      gword ptr [rbp+18H], rdx
-       mov      r14, rcx
-       mov      rdi, r8
-       mov      esi, r9d

+G_M63331_IG01:
+       push     r15
+       push     r14
+       push     rdi
+       push     rsi
+       push     rbp
+       push     rbx
+       sub      rsp, 56
+       xor      rax, rax
+       mov      qword ptr [rsp+28H], rax
+       mov      qword ptr [rsp+30H], rax
+       mov      rbp, rcx
+       mov      rbx, rdx
+       mov      rdi, r8
+       mov      esi, r9d

Change in calling OnStartLine in ParseRequestLine

-G_M28088_IG34:
-       mov      rdi, gword ptr [rbp+18H]
-       lea      r9, bword ptr [rbp-60H]
-       mov      bword ptr [r9], rax
-       mov      dword ptr [r9+8], edx
-       lea      rdx, bword ptr [rbp-70H]
-       mov      bword ptr [rdx], rax
-       mov      dword ptr [rdx+8], r11d
-       lea      rdx, bword ptr [rbp-80H]
-       mov      bword ptr [rdx], r10
-       mov      dword ptr [rdx+8], ecx
-       lea      rcx, bword ptr [rbp-90H]
-       mov      rdx, bword ptr [rbp-50H]
-       mov      bword ptr [rcx], rdx
-       mov      edx, dword ptr [rbp-48H]
-       mov      dword ptr [rcx+8], edx
-       mov      dword ptr [rsp+38H], r8d
-       mov      rcx, rdi
-       mov      edx, r15d
-       mov      r8d, ebx
-       lea      r9, bword ptr [rbp-60H]
-       lea      r11, bword ptr [rbp-70H]
-       mov      bword ptr [rsp+20H], r11
-       lea      r11, bword ptr [rbp-80H]
-       mov      bword ptr [rsp+28H], r11
-       lea      r11, bword ptr [rbp-90H]
-       mov      bword ptr [rsp+30H], r11
-       mov      r11, 0xD1FFAB1E
-       mov      rax, 0xD1FFAB1E
-       cmp      dword ptr [rcx], ecx
-       call     qword ptr [rax]IHttpRequestLineHandler:OnStartLine(ubyte,byte,System.Span`1[Byte],System.Span`1[Byte],System.Span`1[Byte],System.Span`1[Byte],bool):this

+G_M63331_IG34:
+       mov      rdx, rcx
+       lea      rcx, bword ptr [rsp+28H]
+       mov      bword ptr [rcx], rdi
+       mov      dword ptr [rcx+8], esi
+       mov      rcx, rbx
+       lea      r9, bword ptr [rsp+28H]
+       mov      r11, 0xD1FFAB1E
+       mov      rax, 0xD1FFAB1E
+       cmp      dword ptr [rcx], ecx
+       call     qword ptr [rax]IHttpRequestLineHandler:OnStartLine(HttpVersionAndMethod,PathAndQuery,System.Span`1[Byte]):this 

I am adding some details about where the time is spent exactly:

obraz

Think I can drop the Contains call from ParseHeaders

I just looked at the current implementation and my first thought is: why do we iterate over a parser line so many times? And why do we use vectorized methods? Vectorization has an overhead and for small inputs, it very often means even worse performance. Are typical HTTP header names and values long? (sorry I am a web n00b).

First, we look for the line end which means iterating N elements:

https://github.com/dotnet/aspnetcore/blob/d6cb79e64ac42dee8b5343f829ce2c4ccbaa90b5/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L236

Then we search for the End of the Name:

https://github.com/dotnet/aspnetcore/blob/d6cb79e64ac42dee8b5343f829ce2c4ccbaa90b5/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L336

Which has 4 branches:

https://github.com/dotnet/aspnetcore/blob/d6cb79e64ac42dee8b5343f829ce2c4ccbaa90b5/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L312-L316

Then we check the value of the header with another 4 nasty branches:

https://github.com/dotnet/aspnetcore/blob/d6cb79e64ac42dee8b5343f829ce2c4ccbaa90b5/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L350-L354

Then we search once again for CR:

https://github.com/dotnet/aspnetcore/blob/d6cb79e64ac42dee8b5343f829ce2c4ccbaa90b5/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L362

Would it be possible to iterate over it just once using a simple for loop and store information about the special character we have found (colon, CR, LF)?

@GrabYourPitchforks is there any more performant way for checking for whitespace rather than using comparison with ByteTab, ByteSpace and ByteCR

Are typical HTTP header names and values long

Cookie is often 1Kb+; User-agent is 200+ bytes; authorization is 100+ bytes etc

Would it be possible to ...

Looking to do that in https://github.com/dotnet/aspnetcore/pull/20562 while also moving to safe code e.g. https://github.com/dotnet/aspnetcore/blob/ca806df85fac30201fdc793502c5e8149e6d5e07/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L308-L365

Currently working on dropping the headerLine.Contains(ByteCR) check

@adamsitnik your suggestions on header parsing are looking good; this is with converting it to be "safe"

Starting PR run on 'e31989b274e627c654b563b2bda3f92ec90eb460'...
| Description |     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | Build Time (ms) | Published Size (KB) | First Request (ms) | Latency (ms) | Errors | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | --------------- | ------------------- | ------------------ | ------------ | ------ | ----- |
|      Before | 335,344 |      99 |          88 |              6.88 |          507 |           21509 |              120289 |              157.9 |         0.59 |      0 |  1.00 |
|       After | 342,796 |      99 |          88 |              6.73 |          492 |            7503 |              120289 |             126.63 |         1.58 |      0 |  1.02 |

Sounds like @benaadams is on the case ;).

The if (ch == ByteTab || ch == ByteSpace || ch == ByteCR) check could be gated via if (ch <= 0x20 && (ch == ByteTab || ch == ByteSpace || ch == ByteCR /* rearrange these in order of likelihood */)). That way you only have one jump in the common case of A-Za-z.

Edit: Since the method logic seems to want to reject anything with whitespace, can we simplify and also reject anything with C0 control characters? I don't think it's valid for somebody to want to send a backspace character inside a header name. Then this can be simplified even further:

private unsafe int FindEndOfName(byte* headerLine, int length)
{
    for (int index = 0; index < length; index++)
    {
        byte ch = headerLine[(uint)index]; // indexing by uint is more efficient than int on 64-bit platforms (nuint would be preferred)
        if (ch == ByteColon)
        {
            return index;
        }

        if (ch <= 0x20)
        {
            break; // found a space or C0 control character
        }
    }

    return -1; // indicate invalid
}

Safe and fast header parsing is ready for review #20562

Alas Span doesn't allow this, saw some conversions in the asm due to it

 // indexing by uint is more efficient than int on 64-bit platforms (nuint would be preferred)

@benaadams updated breakdown for your PR:

obraz

Safe and fast startline parsing is ready for review https://github.com/dotnet/aspnetcore/pull/20885

Note: this will break PlatformBenchmarks due to change in api; though it should be pretty quick to fix up (have most of it in-place in another PR)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

githubgitgit picture githubgitgit  路  3Comments

FourLeafClover picture FourLeafClover  路  3Comments

groogiam picture groogiam  路  3Comments

Kevenvz picture Kevenvz  路  3Comments

markrendle picture markrendle  路  3Comments