I am currently working on improving our results in the JSON TechEmpower benchmark.
We got to the point, where everything was already tuned at least once and even 1% matters.
The majority of time is spent on necessary work like sending & receiving messages, epoll & thread pool scheduling that are hard or impossible to optimize any further.
For the JSON Platform benchmark, we spent 5% of the time on parsing headers.

If I remove it, I get something around 40-50k RPS gain.
@GrabYourPitchforks is there any chance that you could take a look at the parsing logic and see if there are any possibilities to optimize it any further? I know that you have a LOT of expertise in the low-level tuning of text operations.
I've prepared a copy of the TE logic and encapsulated it into a benchmark that can be run by doing the following:
git clone https://github.com/adamsitnik/aspnetcore.git parsing
cd parsing
git checkout techEmpowerParsing
./build.sh
./.dotnet/dotnet run -c Release -f netcoreapp5.0 --filter TechEmpowerHttpParserBenchmark --project ./src/Servers/Kestrel/perf/Kestrel.Performance/Microsoft.AspNetCore.Server.Kestrel.Performance.csproj --cli ./.dotnet/dotnet
HTTP/1.x parsing has already been pretty heavily optimized by people like @benaadams.
We don't really do text operations at the HttpParser-level. We treat everything as bytes. Instead we use ReadOnlySpan.IndexOf (or ReadOnlySequence.PositionOf for headers that span multiple blocks), to search for the next \n byte in the input stream and slice. These methods have been vectorized where possible.
There's some validation that happens to verify that the headers don't contain any invalid bytes (also vectorized), and there's some copying that happens for headers that span multiple blocks, but there's no decoding, string allocations or anything like that going on in the platform benchmarks. It just calls the no-op BenchmarkApplication.public.OnHeader(ReadOnlySpan<byte> name, ReadOnlySpan<byte> value) implemenation.
When https://github.com/dotnet/runtime/issues/28230 comes online we'll probably be able to plumb it through some of the network protocol parsers. That might give a small perf boost for "text" processing. But IMO the biggest benefits would come from more efficient data structure management.
Some of the other recent gains are from introducing unsafe code in previously-safe code paths. That continues to be an option, but it really shouldn't be our go-to solution.
No more unsafe code in the HttpParser, that鈥檚 hard no from me. We should be moving to a place with less unsafe code
There's some gains still to get if can change the api. I assume that's ok as its "internal"?
e.g. current OnStartLine handler is
public interface IHttpRequestLineHandler
{
void OnStartLine(
HttpMethod method,
HttpVersion version,
Span<byte> target,
Span<byte> path,
Span<byte> query,
Span<byte> customMethod,
bool pathEncoded);
}
Which is very heavy on stack usage as well as parameter ABI stack passing.
Changing to something like the following would mean that only one Span<byte> is passed on stack, whereas the other parameters are passed via register:
public interface IHttpRequestLineHandler
{
void OnStartLine(
HttpVersionAndMethod versionAndMethod,
PathAndQuery pathAndQuery,
Span<byte> startLine);
}
public struct HttpVersionAndMethod
{
private long _versionAndMethod;
public HttpVersion Version => (HttpVersion)(sbyte)(_versionAndMethod >> (32 + 8));
public HttpMethod Method => (HttpMethod)(byte)(_versionAndMethod >> 32);
public int MethodEnd => (int)_versionAndMethod;
}
public struct PathAndQuery
{
private long _pathAndQuery;
public int PathEnd => (int)_pathAndQuery & 0X7FFF;
public bool PathEncoded => ((int)_pathAndQuery) < 0 ? true : false;
public int QueryEnd => (int)(_pathAndQuery >> 32);
}
Then use C#8's range operator to get the sub Spans from the startLine in the handler.
Looks like it might be worth a go
Change in method start to ParseRequestLine
-G_M28088_IG01:
- push rbp
- push r15
- push r14
- push r13
- push r12
- push rdi
- push rsi
- push rbx
- sub rsp, 152
- vzeroupper
- lea rbp, [rsp+D0H]
- vxorps xmm4, xmm4
- vmovdqa xmmword ptr [rbp-90H], xmm4
- vmovdqa xmmword ptr [rbp-80H], xmm4
- vmovdqa xmmword ptr [rbp-70H], xmm4
- vmovdqa xmmword ptr [rbp-60H], xmm4
- vmovdqa xmmword ptr [rbp-50H], xmm4
- xor rax, rax
- mov qword ptr [rbp-40H], rax
- mov gword ptr [rbp+18H], rdx
- mov r14, rcx
- mov rdi, r8
- mov esi, r9d
+G_M63331_IG01:
+ push r15
+ push r14
+ push rdi
+ push rsi
+ push rbp
+ push rbx
+ sub rsp, 56
+ xor rax, rax
+ mov qword ptr [rsp+28H], rax
+ mov qword ptr [rsp+30H], rax
+ mov rbp, rcx
+ mov rbx, rdx
+ mov rdi, r8
+ mov esi, r9d
Change in calling OnStartLine in ParseRequestLine
-G_M28088_IG34:
- mov rdi, gword ptr [rbp+18H]
- lea r9, bword ptr [rbp-60H]
- mov bword ptr [r9], rax
- mov dword ptr [r9+8], edx
- lea rdx, bword ptr [rbp-70H]
- mov bword ptr [rdx], rax
- mov dword ptr [rdx+8], r11d
- lea rdx, bword ptr [rbp-80H]
- mov bword ptr [rdx], r10
- mov dword ptr [rdx+8], ecx
- lea rcx, bword ptr [rbp-90H]
- mov rdx, bword ptr [rbp-50H]
- mov bword ptr [rcx], rdx
- mov edx, dword ptr [rbp-48H]
- mov dword ptr [rcx+8], edx
- mov dword ptr [rsp+38H], r8d
- mov rcx, rdi
- mov edx, r15d
- mov r8d, ebx
- lea r9, bword ptr [rbp-60H]
- lea r11, bword ptr [rbp-70H]
- mov bword ptr [rsp+20H], r11
- lea r11, bword ptr [rbp-80H]
- mov bword ptr [rsp+28H], r11
- lea r11, bword ptr [rbp-90H]
- mov bword ptr [rsp+30H], r11
- mov r11, 0xD1FFAB1E
- mov rax, 0xD1FFAB1E
- cmp dword ptr [rcx], ecx
- call qword ptr [rax]IHttpRequestLineHandler:OnStartLine(ubyte,byte,System.Span`1[Byte],System.Span`1[Byte],System.Span`1[Byte],System.Span`1[Byte],bool):this
+G_M63331_IG34:
+ mov rdx, rcx
+ lea rcx, bword ptr [rsp+28H]
+ mov bword ptr [rcx], rdi
+ mov dword ptr [rcx+8], esi
+ mov rcx, rbx
+ lea r9, bword ptr [rsp+28H]
+ mov r11, 0xD1FFAB1E
+ mov rax, 0xD1FFAB1E
+ cmp dword ptr [rcx], ecx
+ call qword ptr [rax]IHttpRequestLineHandler:OnStartLine(HttpVersionAndMethod,PathAndQuery,System.Span`1[Byte]):this
I am adding some details about where the time is spent exactly:

Think I can drop the Contains call from ParseHeaders
I just looked at the current implementation and my first thought is: why do we iterate over a parser line so many times? And why do we use vectorized methods? Vectorization has an overhead and for small inputs, it very often means even worse performance. Are typical HTTP header names and values long? (sorry I am a web n00b).
First, we look for the line end which means iterating N elements:
Then we search for the End of the Name:
Which has 4 branches:
Then we check the value of the header with another 4 nasty branches:
Then we search once again for CR:
Would it be possible to iterate over it just once using a simple for loop and store information about the special character we have found (colon, CR, LF)?
@GrabYourPitchforks is there any more performant way for checking for whitespace rather than using comparison with ByteTab, ByteSpace and ByteCR
Are typical HTTP header names and values long
Cookie is often 1Kb+; User-agent is 200+ bytes; authorization is 100+ bytes etc
Would it be possible to ...
Looking to do that in https://github.com/dotnet/aspnetcore/pull/20562 while also moving to safe code e.g. https://github.com/dotnet/aspnetcore/blob/ca806df85fac30201fdc793502c5e8149e6d5e07/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L308-L365
Currently working on dropping the headerLine.Contains(ByteCR) check
@adamsitnik your suggestions on header parsing are looking good; this is with converting it to be "safe"
Starting PR run on 'e31989b274e627c654b563b2bda3f92ec90eb460'...
| Description | RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | Build Time (ms) | Published Size (KB) | First Request (ms) | Latency (ms) | Errors | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | --------------- | ------------------- | ------------------ | ------------ | ------ | ----- |
| Before | 335,344 | 99 | 88 | 6.88 | 507 | 21509 | 120289 | 157.9 | 0.59 | 0 | 1.00 |
| After | 342,796 | 99 | 88 | 6.73 | 492 | 7503 | 120289 | 126.63 | 1.58 | 0 | 1.02 |
Sounds like @benaadams is on the case ;).
The if (ch == ByteTab || ch == ByteSpace || ch == ByteCR) check could be gated via if (ch <= 0x20 && (ch == ByteTab || ch == ByteSpace || ch == ByteCR /* rearrange these in order of likelihood */)). That way you only have one jump in the common case of A-Za-z.
Edit: Since the method logic seems to want to reject anything with whitespace, can we simplify and also reject anything with C0 control characters? I don't think it's valid for somebody to want to send a backspace character inside a header name. Then this can be simplified even further:
private unsafe int FindEndOfName(byte* headerLine, int length)
{
for (int index = 0; index < length; index++)
{
byte ch = headerLine[(uint)index]; // indexing by uint is more efficient than int on 64-bit platforms (nuint would be preferred)
if (ch == ByteColon)
{
return index;
}
if (ch <= 0x20)
{
break; // found a space or C0 control character
}
}
return -1; // indicate invalid
}
Safe and fast header parsing is ready for review #20562
Alas Span doesn't allow this, saw some conversions in the asm due to it
// indexing by uint is more efficient than int on 64-bit platforms (nuint would be preferred)
@benaadams updated breakdown for your PR:

Safe and fast startline parsing is ready for review https://github.com/dotnet/aspnetcore/pull/20885
Note: this will break PlatformBenchmarks due to change in api; though it should be pretty quick to fix up (have most of it in-place in another PR)
Most helpful comment
I am adding some details about where the time is spent exactly: