We have been having performance issues reading simple POST request with ASP.Net Core hosted as an Azure WebApp. The issue pertains to request body read time of 300ms or more at constant interval.
I setup a test rig with the following code - No middleware or Actionfilters.
[HttpPost]
[Route("/test")]
public ActionResult TestAction()
{
Stopwatch sw = Stopwatch.StartNew();
using (StreamReader inputStream = new StreamReader(Request.Body))
{
var payload = inputStream.ReadToEnd();
}
sw.Stop();
_streamWriterProvider.Writer.WriteLine(sw.ElapsedMilliseconds);
return this.Status(StatusCodes.Status200OK);
}
I do 100 POST request of 290 characters each every 5 seconds(~20 qps) and the chart below shows that it spikes above 300ms just for reading this request at every constant interval.

Our product does not even have this high qps (~20 qps) as this test but the issue still pertains.
Profiling shows Hotpath to be in Kestrel

I tried the same experiment by configuring Kestrel to start with different threadpool values(2, 4, 8) (in KestrelServerOptions) for libuv threads but to no avail, sometimes making it even worse.
The Azure Plan is S2 standard (2 cores, so with the default KestrelServerOptions it might have run as single threaded libuv?). I even upgraded the plan to S4 standard but still the same behavior. I could not effectively repro this in my local machine which is a relatively beefier box with 8 logical cores.
If you go async? (Guessing it would be something like)
public async Task<ActionResult> TestAction()
{
Stopwatch sw = Stopwatch.StartNew();
using (StreamReader inputStream = new StreamReader(Request.Body))
{
var payload = await inputStream.ReadToEndAsync();
}
sw.Stop();
await _streamWriterProvider.Writer.WriteLineAsync(sw.ElapsedMilliseconds);
return this.Status(StatusCodes.Status200OK);
}
Not sure how much it would differ. I can observe the same behavior when I do FromBody(which I assume does async?) and define a ADT for my payload.
public ActionResult TestAction([FromBody] Payload pl) {...}
@antonydeepak Can you order the hot paths by exclusive samples? LoggingThreadpool.<RunAction> is calling into all the request parsing and application code on a threadpool thread, so it isn't that revealing. KestrelThread.ThreadStart is the bottom of the call stack for all the IO operations, so it really isn't surprising that its inclusive samples would also be high.
It's a bit of a shot in the dark, but have you tried configuring System.Threading.ThreadPool.MinThreads to 100 to better handle the concurrent request initially without having to ramp up?
The VS Profiler has a default 'Just My Code' setting that is on by default and separate from the debuggers 'Just My Code'. Clicking 'Show All Code' on the landing page of the profiler will hopefully reveal more.
@halter73 So, I captured a session and analyzed with Symbols turned on.

Here is the complete diagsession for your perusal. From what I see most of the callstack functions lead back to libuv.dll or picohelper.dll (Azure?)
Take into account that each StreamReader instance allocates about 6 KB of buffers, is might be GC hitting you
@halter73 I ran some experiments with the following configuration and I see relatively better results. So something is changing, but still 250ms on request body reads is not up-to the mark.
<gcServer enabled="true"/>
<gcConcurrent enabled="true"/>
System.Threading.ThreadPool.SetMinThreads(100, 100);


@antonydeepak shouldn't your GC settings be
"runtimeOptions": {
"configProperties": {
"System.GC.Server": true
}
}
As per samples
@benaadams I tried that initially but it didn't work; not sure if that is because we host it against a full .NET framework vs the .NET core framework. Hence, I reverted to using app.config and verified that it is the Server GC by checking the System.Runtime.GCSettings.IsServerGC property at runtime.
I think gcConcurrent is a noop is my case. That makes me wonder if I should specifically set the GCLatency to LowLatency? Maybe will give that a shot.
So I ran a sample with System.Runtime.GCLatencyMode.SustainedLowLatency and here is the result. Still spikes maybe with

You might want to run your tests with PerfView running, it'll give you more info about the GC pauses (if there are any)
I was able to get much better perf by configuring the following:
reading in async
gcserver configuration
lowlatency mode
minthreadpool=100
libuv workerthreadpool = 4 (options.ThreadCount)

Which thing has the biggest impact? Did you change them all at the same time?
Yes, the graph shown is for all the changes.
I think async, ReadToEndAsync and System.Threading.ThreadPool.SetMinThreads(100, 100) are the biggest helpers; I tried one without the other and the results were not good. I will do some more experiments in my core project and see how it fares.
@antonydeepak any updates?
@davidfowl i ported the changes to our core project but did not see the performance improvements; some further explorations have to be done. I will get around to it this week.
@pakrym will take a look in what Post does.
@pakrym Anything to do in 2.0.0 for this issue?
@antonydeepak and @muratg yup same experience here with a plain webapi. Tried exposing it through app services as well as k8s both with docker containers. The POST throughput is not even close to the GET benchmarks. It would be great if aspnet/benchmarks#169, all the Techempower test types seem to be GET though.
I was wondering whether the aforementioned configurations could be a workaround...
@pakrym Closing this one, as no more investigation is planned in 2.1.
@muratg Quick question: so does that mean that:
something which looks very much like this issue is hitting me atm :( it's relatively crippling - every request in our app takes an extra 150ms reading the request body (which are small, a couple of hundred bytes). i'll report back if i get anywhere.
i've tracked it down to this point:
Request.Body.CopyToAsync(new MemoryStream())
takes a fixed ~160ms when running kestrel under netcoreapp2.0 in azure app service.
here's the body of the request, which is a multipart form with two short parameters - but i'm not running any of the form-reading code, just copying the body as bytes.
--2656bd3f-7235-4f66-a798-da397838cd2f
Content-Type: text/plain; charset=utf-8
Content-Disposition: form-data; name=format
Xml
--2656bd3f-7235-4f66-a798-da397838cd2f
Content-Type: text/plain; charset=utf-8
Content-Disposition: form-data; name=ticket
<Ticket xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/The.Security"><Subject><Name>user</Name><Realm>pbkdf</Realm></Subject><Token i:nil="true" /></Ticket>
--2656bd3f-7235-4f66-a798-da397838cd2f--
things which did not work: ThreadPool.SetMinThreads; options.ThreadCount; increasing cores allocated in Azure. i already had gcServer and was already async.
@PureKrome @gulbanana We closed this issue because we haven't seen reproducible evidence of POSTs performing badly. The original issue seems to be related to GC collection spikes. POSTs are of course going to be slightly slower than GETs because of the data being uploaded in the body.
@gulbanana The 160ms latency for Request.Body.CopyToAsync(new MemoryStream()) seems weird unless the client is taking nearly that long to upload the POST data. Maybe the client is waiting for a 100-continue before uploading the response body. This extra roundtrip could explain the higher POST latency. The first thing I would do is take a network trace and verify the entire request body is really being sent quickly.
If the upload really taking well less than 160ms, I would try adding the following to the start of your Configure method. If you do this, do you still see the 160ms latency for each request? If you use a Stopwatch to measure the actual call to CopyToAsync do you see it taking over 100ms?
app.Run(context =>
{
context.Request.Body.CopyToAsync(new MemoryStream());
return Task.CompletedTask;
});
If neither the network trace or the simplified app.Run handler shed any light on the issue, please feel free to create a new GitHub issue with enough details for us to reproduce the poor performance.
I tested this myself using the Kestrel SampleApp. I used wrk to measure the perf of GET, GET with the CopyToAsync handler, and POST with the CopyToAsync handler. In all cases logging was disabled. For POSTs, I used this wrk lua script.
Here are my results from a local test on my laptop. In all cases the latency is ~30ms on average.
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 34.36ms 23.53ms 400.20ms 93.59%
Req/Sec 229.54 99.42 484.00 72.09%
203507 requests in 30.09s, 25.62MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 6762.31
Transfer/sec: 0.85MB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.40ms 15.92ms 242.34ms 82.95%
Req/Sec 232.07 103.75 565.00 74.13%
207180 requests in 30.10s, 26.08MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 6883.68
Transfer/sec: 0.87MB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.98ms 12.04ms 206.62ms 80.97%
Req/Sec 255.78 97.70 1.13k 70.71%
221532 requests in 30.10s, 27.89MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 7358.99
Transfer/sec: 0.93MB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.85ms 22.23ms 348.20ms 93.73%
Req/Sec 284.03 107.51 565.00 59.46%
235333 requests in 30.10s, 20.65MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 7818.12
Transfer/sec: 702.41KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.38ms 9.40ms 198.66ms 82.08%
Req/Sec 300.15 88.40 1.36k 69.44%
251040 requests in 30.10s, 22.03MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 8339.80
Transfer/sec: 749.28KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 27.25ms 11.38ms 200.90ms 84.54%
Req/Sec 272.69 102.80 565.00 60.19%
244324 requests in 30.10s, 21.44MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 8115.92
Transfer/sec: 729.17KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.40ms 14.45ms 256.45ms 87.16%
Req/Sec 247.42 77.96 464.00 68.96%
206205 requests in 30.10s, 18.09MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 6850.57
Transfer/sec: 615.48KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.89ms 9.46ms 210.83ms 80.22%
Req/Sec 254.95 83.57 484.00 71.77%
220956 requests in 30.10s, 19.39MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 7340.44
Transfer/sec: 659.49KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 30.99ms 10.38ms 197.44ms 82.95%
Req/Sec 245.99 86.65 777.00 78.09%
213191 requests in 30.10s, 18.70MB read
Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec: 7083.38
Transfer/sec: 636.40KB
thanks for the tips. i hadn鈥檛 yet tried using a stopwatch instead of an ILogger - my logging is definitely around the CopyToAsync but it could still be producing some sort of misleading result. client side network tracing is also a good idea- while our stack was already tested for this stuff some time ago, the client as well as the server has been ported to .net core, and it鈥檚 quite plausible that something went wrong in the process. (for example, now that i think about it, changes to servicepointmanager usage could have reintroduced 100-continue..)
i鈥檒l do these measurements and if that doesn鈥檛 pan out then i鈥檒l work on creating a repro.
in particular if the client config is at fault then there鈥檚 nothing kestrel could do about it, since it鈥檚 inherent to the design of lazy body-reading. which probably works fine for people who aren鈥檛 in australia :)
@halter73 Thanks for the reply - _really_ appreciate it and the metrics supplied. It also gives us _newbies_ a great understanding of what _we_ can do to help get metric data, to help explain an issue and _hopefully_ not waste your/teams time.
I'm going to wait for @gulbanana to see if he can score some network trace data. We're continuing the convo in the public ASPNET slack #azure channel and if there's more info, then we'll create a new GH issue and reference this for some context.
Again, Stephen - thank you and keep up the awesome work!
yes, it has been very helpful! once i saw this issue i forgot that our client, too, needed to be audited post-port. turns out the problem was actually more Nagle's algorithm than expect: 100-continue; the way we were disabling it in .net framework didn't translate over to whatever apis were available when this code was written. by the combined effect of tcp ack delay and 100-continue two extra round trips were being added to what looked like a trivial POST.
i'm not sure there's anything kestrel can do to help here. it already disables nodelay server-side by default, but on Azure IIS is in the way..
but on Azure IIS is in the way..
Should be able to change IIS's settings in web.config?
perhaps asp.net core publish could include that setting by default? i'm not sure how many people will fall into this trap or whether latency-sensitive small POSTs are a weird edge case.
@gulbanana Can you post more info about what you guys did to change your _current_ setup please? And why/rational for the custom tweaks, prior to testing on Kestrel.
sure. client-side, we added this config of the HttpClient being used:
// force-disable stack options which cause extra roundtrips
client.DefaultRequestHeaders.ExpectContinue = false;
var servicePoint = ServicePointManager.FindServicePoint(client.BaseAddress);
servicePoint.UseNagleAlgorithm = false;
servicePoint.ConnectionLimit = 100;
on the server we also added a nasty hack to read multipart forms in a single data copy: https://gist.github.com/gulbanana/b0a2537e4a75be7b3e1456c0f9e021a7
that's not necessary for most use cases and possibly not a good idea in ours either.
as for why we were disabling Nagle's algorithm before the port to .net core (ServicePointManager was not exposed in .NET Core back then) - the Azure team recommended it. for example, here:
https://blogs.msdn.microsoft.com/windowsazurestorage/2010/06/25/nagles-algorithm-is-not-friendly-towards-small-requests/
looks like they know what they're talking about.
Most helpful comment
Yes, the graph shown is for all the changes.
I think async,
ReadToEndAsyncandSystem.Threading.ThreadPool.SetMinThreads(100, 100)are the biggest helpers; I tried one without the other and the results were not good. I will do some more experiments in my core project and see how it fares.