Kestrelhttpserver: Performance issues in Kestrel when reading POST data from Request Body

Created on 23 Jan 2017 · 33Comments · Source: aspnet/KestrelHttpServer

We have been having performance issues reading simple POST request with ASP.Net Core hosted as an Azure WebApp. The issue pertains to request body read time of 300ms or more at constant interval.

I setup a test rig with the following code - No middleware or Actionfilters.

[HttpPost]
[Route("/test")]
public ActionResult TestAction()
{
    Stopwatch sw = Stopwatch.StartNew();
    using (StreamReader inputStream = new StreamReader(Request.Body))
    {
        var payload = inputStream.ReadToEnd();
    }
    sw.Stop();
    _streamWriterProvider.Writer.WriteLine(sw.ElapsedMilliseconds);

    return this.Status(StatusCodes.Status200OK);
}

I do 100 POST request of 290 characters each every 5 seconds(~20 qps) and the chart below shows that it spikes above 300ms just for reading this request at every constant interval.

Our product does not even have this high qps (~20 qps) as this test but the issue still pertains.

Profiling shows Hotpath to be in Kestrel

I tried the same experiment by configuring Kestrel to start with different threadpool values(2, 4, 8) (in KestrelServerOptions) for libuv threads but to no avail, sometimes making it even worse.

The Azure Plan is S2 standard (2 cores, so with the default KestrelServerOptions it might have run as single threaded libuv?). I even upgraded the plan to S4 standard but still the same behavior. I could not effectively repro this in my local machine which is a relatively beefier box with 8 logical cores.

2 - Working bug

Source

gogsbread

👍4

Most helpful comment

Yes, the graph shown is for all the changes.
I think async, ReadToEndAsync and System.Threading.ThreadPool.SetMinThreads(100, 100) are the biggest helpers; I tried one without the other and the results were not good. I will do some more experiments in my core project and see how it fares.

gogsbread on 30 Jan 2017

👍2

All 33 comments

If you go async? (Guessing it would be something like)

public async Task<ActionResult> TestAction()
{
    Stopwatch sw = Stopwatch.StartNew();
    using (StreamReader inputStream = new StreamReader(Request.Body))
    {
        var payload = await inputStream.ReadToEndAsync();
    }
    sw.Stop();
    await _streamWriterProvider.Writer.WriteLineAsync(sw.ElapsedMilliseconds);

    return this.Status(StatusCodes.Status200OK);
}

benaadams on 23 Jan 2017

Not sure how much it would differ. I can observe the same behavior when I do FromBody(which I assume does async?) and define a ADT for my payload.
public ActionResult TestAction([FromBody] Payload pl) {...}

gogsbread on 23 Jan 2017

@antonydeepak Can you order the hot paths by exclusive samples? LoggingThreadpool.<RunAction> is calling into all the request parsing and application code on a threadpool thread, so it isn't that revealing. KestrelThread.ThreadStart is the bottom of the call stack for all the IO operations, so it really isn't surprising that its inclusive samples would also be high.

It's a bit of a shot in the dark, but have you tried configuring System.Threading.ThreadPool.MinThreads to 100 to better handle the concurrent request initially without having to ramp up?

halter73 on 24 Jan 2017

The VS Profiler has a default 'Just My Code' setting that is on by default and separate from the debuggers 'Just My Code'. Clicking 'Show All Code' on the landing page of the profiler will hopefully reveal more.

rynowak on 24 Jan 2017

@halter73 So, I captured a session and analyzed with Symbols turned on.

Here is the complete diagsession for your perusal. From what I see most of the callstack functions lead back to libuv.dll or picohelper.dll (Azure?)

gogsbread on 25 Jan 2017

Take into account that each StreamReader instance allocates about 6 KB of buffers, is might be GC hitting you

ayende on 25 Jan 2017

@halter73 I ran some experiments with the following configuration and I see relatively better results. So something is changing, but still 250ms on request body reads is not up-to the mark.

<gcServer enabled="true"/>
<gcConcurrent enabled="true"/>
System.Threading.ThreadPool.SetMinThreads(100, 100);

gogsbread on 26 Jan 2017

@antonydeepak shouldn't your GC settings be

"runtimeOptions": {
    "configProperties": {
      "System.GC.Server": true
    }
}

As per samples

benaadams on 26 Jan 2017

@benaadams I tried that initially but it didn't work; not sure if that is because we host it against a full .NET framework vs the .NET core framework. Hence, I reverted to using app.config and verified that it is the Server GC by checking the System.Runtime.GCSettings.IsServerGC property at runtime.

I think gcConcurrent is a noop is my case. That makes me wonder if I should specifically set the GCLatency to LowLatency? Maybe will give that a shot.

gogsbread on 26 Jan 2017

So I ran a sample with System.Runtime.GCLatencyMode.SustainedLowLatency and here is the result. Still spikes maybe with

gogsbread on 27 Jan 2017

You might want to run your tests with PerfView running, it'll give you more info about the GC pauses (if there are any)

mattwarren on 27 Jan 2017

I was able to get much better perf by configuring the following:
reading in async
gcserver configuration
lowlatency mode
minthreadpool=100
libuv workerthreadpool = 4 (options.ThreadCount)

gogsbread on 30 Jan 2017

Which thing has the biggest impact? Did you change them all at the same time?

davidfowl on 30 Jan 2017

gogsbread on 30 Jan 2017

👍2

@antonydeepak any updates?

davidfowl on 19 Feb 2017

@davidfowl i ported the changes to our core project but did not see the performance improvements; some further explorations have to be done. I will get around to it this week.

gogsbread on 19 Feb 2017

@pakrym will take a look in what Post does.

muratg on 20 Mar 2017

@pakrym Anything to do in 2.0.0 for this issue?

muratg on 27 Jun 2017

@antonydeepak and @muratg yup same experience here with a plain webapi. Tried exposing it through app services as well as k8s both with docker containers. The POST throughput is not even close to the GET benchmarks. It would be great if aspnet/benchmarks#169, all the Techempower test types seem to be GET though.

I was wondering whether the aforementioned configurations could be a workaround...

ferantivero on 6 Jul 2017

@pakrym Closing this one, as no more investigation is planned in 2.1.

muratg on 18 Aug 2017

@muratg Quick question: so does that mean that:

this issue still exists in 2.0 (and therefore 2.1)
no one is going to try and resolve this
POST is going to _remain_ slower than GET?

PureKrome on 3 Oct 2017

something which looks very much like this issue is hitting me atm :( it's relatively crippling - every request in our app takes an extra 150ms reading the request body (which are small, a couple of hundred bytes). i'll report back if i get anywhere.

gulbanana on 3 Oct 2017

i've tracked it down to this point:
Request.Body.CopyToAsync(new MemoryStream())
takes a fixed ~160ms when running kestrel under netcoreapp2.0 in azure app service.

here's the body of the request, which is a multipart form with two short parameters - but i'm not running any of the form-reading code, just copying the body as bytes.

--2656bd3f-7235-4f66-a798-da397838cd2f
Content-Type: text/plain; charset=utf-8
Content-Disposition: form-data; name=format

Xml
--2656bd3f-7235-4f66-a798-da397838cd2f
Content-Type: text/plain; charset=utf-8
Content-Disposition: form-data; name=ticket

<Ticket xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/The.Security"><Subject><Name>user</Name><Realm>pbkdf</Realm></Subject><Token i:nil="true" /></Ticket>
--2656bd3f-7235-4f66-a798-da397838cd2f--

gulbanana on 3 Oct 2017

things which did not work: ThreadPool.SetMinThreads; options.ThreadCount; increasing cores allocated in Azure. i already had gcServer and was already async.

gulbanana on 3 Oct 2017

@PureKrome @gulbanana We closed this issue because we haven't seen reproducible evidence of POSTs performing badly. The original issue seems to be related to GC collection spikes. POSTs are of course going to be slightly slower than GETs because of the data being uploaded in the body.

@gulbanana The 160ms latency for Request.Body.CopyToAsync(new MemoryStream()) seems weird unless the client is taking nearly that long to upload the POST data. Maybe the client is waiting for a 100-continue before uploading the response body. This extra roundtrip could explain the higher POST latency. The first thing I would do is take a network trace and verify the entire request body is really being sent quickly.

If the upload really taking well less than 160ms, I would try adding the following to the start of your Configure method. If you do this, do you still see the 160ms latency for each request? If you use a Stopwatch to measure the actual call to CopyToAsync do you see it taking over 100ms?

app.Run(context =>
{
    context.Request.Body.CopyToAsync(new MemoryStream());
    return Task.CompletedTask;
});

If neither the network trace or the simplified app.Run handler shed any light on the issue, please feel free to create a new GitHub issue with enough details for us to reproduce the poor performance.

I tested this myself using the Kestrel SampleApp. I used wrk to measure the perf of GET, GET with the CopyToAsync handler, and POST with the CopyToAsync handler. In all cases logging was disabled. For POSTs, I used this wrk lua script.

Here are my results from a local test on my laptop. In all cases the latency is ~30ms on average.

GET "hello, world"

Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    34.36ms   23.53ms 400.20ms   93.59%
    Req/Sec   229.54     99.42   484.00     72.09%
  203507 requests in 30.09s, 25.62MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   6762.31
Transfer/sec:      0.85MB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    32.40ms   15.92ms 242.34ms   82.95%
    Req/Sec   232.07    103.75   565.00     74.13%
  207180 requests in 30.10s, 26.08MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   6883.68
Transfer/sec:      0.87MB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.98ms   12.04ms 206.62ms   80.97%
    Req/Sec   255.78     97.70     1.13k    70.71%
  221532 requests in 30.10s, 27.89MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   7358.99
Transfer/sec:      0.93MB

GET Request.Body.CopyToAsync

Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.85ms   22.23ms 348.20ms   93.73%
    Req/Sec   284.03    107.51   565.00     59.46%
  235333 requests in 30.10s, 20.65MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   7818.12
Transfer/sec:    702.41KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    26.38ms    9.40ms 198.66ms   82.08%
    Req/Sec   300.15     88.40     1.36k    69.44%
  251040 requests in 30.10s, 22.03MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   8339.80
Transfer/sec:    749.28KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.25ms   11.38ms 200.90ms   84.54%
    Req/Sec   272.69    102.80   565.00     60.19%
  244324 requests in 30.10s, 21.44MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   8115.92
Transfer/sec:    729.17KB

POST Request.Body.CopyToAsync

Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    32.40ms   14.45ms 256.45ms   87.16%
    Req/Sec   247.42     77.96   464.00     68.96%
  206205 requests in 30.10s, 18.09MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   6850.57
Transfer/sec:    615.48KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.89ms    9.46ms 210.83ms   80.22%
    Req/Sec   254.95     83.57   484.00     71.77%
  220956 requests in 30.10s, 19.39MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   7340.44
Transfer/sec:    659.49KB
Stephens-MacBook-Pro:SampleApp shalter$ wrk -c 256 -t 32 -d 30 -s post.lua http://deb9:5000/
Running 30s test @ http://deb9:5000/
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    30.99ms   10.38ms 197.44ms   82.95%
    Req/Sec   245.99     86.65   777.00     78.09%
  213191 requests in 30.10s, 18.70MB read
  Socket errors: connect 37, read 0, write 0, timeout 0
Requests/sec:   7083.38
Transfer/sec:    636.40KB

halter73 on 4 Oct 2017

thanks for the tips. i hadn’t yet tried using a stopwatch instead of an ILogger - my logging is definitely around the CopyToAsync but it could still be producing some sort of misleading result. client side network tracing is also a good idea- while our stack was already tested for this stuff some time ago, the client as well as the server has been ported to .net core, and it’s quite plausible that something went wrong in the process. (for example, now that i think about it, changes to servicepointmanager usage could have reintroduced 100-continue..)

i’ll do these measurements and if that doesn’t pan out then i’ll work on creating a repro.

gulbanana on 4 Oct 2017

in particular if the client config is at fault then there’s nothing kestrel could do about it, since it’s inherent to the design of lazy body-reading. which probably works fine for people who aren’t in australia :)

gulbanana on 4 Oct 2017

@halter73 Thanks for the reply - _really_ appreciate it and the metrics supplied. It also gives us _newbies_ a great understanding of what _we_ can do to help get metric data, to help explain an issue and _hopefully_ not waste your/teams time.

I'm going to wait for @gulbanana to see if he can score some network trace data. We're continuing the convo in the public ASPNET slack #azure channel and if there's more info, then we'll create a new GH issue and reference this for some context.

Again, Stephen - thank you and keep up the awesome work!

PureKrome on 4 Oct 2017

yes, it has been very helpful! once i saw this issue i forgot that our client, too, needed to be audited post-port. turns out the problem was actually more Nagle's algorithm than expect: 100-continue; the way we were disabling it in .net framework didn't translate over to whatever apis were available when this code was written. by the combined effect of tcp ack delay and 100-continue two extra round trips were being added to what looked like a trivial POST.

i'm not sure there's anything kestrel can do to help here. it already disables nodelay server-side by default, but on Azure IIS is in the way..

gulbanana on 4 Oct 2017

but on Azure IIS is in the way..

Should be able to change IIS's settings in web.config?

benaadams on 4 Oct 2017

perhaps asp.net core publish could include that setting by default? i'm not sure how many people will fall into this trap or whether latency-sensitive small POSTs are a weird edge case.

gulbanana on 4 Oct 2017

@gulbanana Can you post more info about what you guys did to change your _current_ setup please? And why/rational for the custom tweaks, prior to testing on Kestrel.

PureKrome on 6 Oct 2017

sure. client-side, we added this config of the HttpClient being used:

            // force-disable stack options which cause extra roundtrips
            client.DefaultRequestHeaders.ExpectContinue = false;
            var servicePoint = ServicePointManager.FindServicePoint(client.BaseAddress);
            servicePoint.UseNagleAlgorithm = false;
            servicePoint.ConnectionLimit = 100;

on the server we also added a nasty hack to read multipart forms in a single data copy: https://gist.github.com/gulbanana/b0a2537e4a75be7b3e1456c0f9e021a7
that's not necessary for most use cases and possibly not a good idea in ours either.

as for why we were disabling Nagle's algorithm before the port to .net core (ServicePointManager was not exposed in .NET Core back then) - the Azure team recommended it. for example, here:
https://blogs.msdn.microsoft.com/windowsazurestorage/2010/06/25/nagles-algorithm-is-not-friendly-towards-small-requests/

looks like they know what they're talking about.

gulbanana on 6 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings