Runtime: HttpRequestException when using SocketsHttpHandler

Created on 27 Jun 2018  ·  114Comments  ·  Source: dotnet/runtime

Hi all,

we had to drop the SocketsHttpHandler (https://github.com/ravendb/ravendb/commit/9ad7c41233fca1e3f16617ad0eadf191251c0e88)
because for quite some time we started receiving random test failures on our CI (Jenkins) Agent (Hyper-V machines, 6 vCPU, 6GB RAM on WS2012R2 and WS2016 - fully updated). We tried to reproduce this on our local dev machines, but without any luck. After turning off the SocketsHttpHandler, the random failures stopped showing up.

The exception we are getting is:

System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\\Jenkins\\workspace\\PR_Tests_v4.1\\s\\src\\Raven.Client\\Http\\RequestExecutor.cs:line 738

We are using latest .NET Core 2.1.1 (2.1.0 was also affected).

Do you know if there are some problems in SocketsHttpHandler that could cause this? What can we do to help you solve this issue? Any pointers?

area-System.Net.Http bug tenet-compatibility tenet-reliability

Most helpful comment

@stephentoub Is there anything here that we can look at?
To reproduce this we need to schedule > 5 runs of 4,000+ tests that takes an hour plus.
In such scenario, we typically see 1 - 3 such failures using SocketsHttpHandler.

Note that we never use SocketsHttpHandler directly, so I don't think that we improperly using it.
We got through HttpClient, a shared instance that is re-used among threads, as recommended.

The code is actually doing something like:

c# try { return await httpHandler.SendAsync(...); } catch(HttpRequestException e) { }

So we can catch the exception and then check state / logging / something.

Anything you can recommend that we'll look at to get to the root cause?

All 114 comments

Most / all of these requests were using POST operations (or other methods that have method bodies).
We suspect, but aren't sure, that this might be related to connection pooling in some manner.

What can we do to help you solve this issue?

A repro :smile:

We suspect, but aren't sure, that this might be related to connection pooling in some manner.

That error message is for SocketError.OperationAborted, which likely means the socket was closed while the send or receive was in progress.

@stephentoub Is there anything here that we can look at?
To reproduce this we need to schedule > 5 runs of 4,000+ tests that takes an hour plus.
In such scenario, we typically see 1 - 3 such failures using SocketsHttpHandler.

Note that we never use SocketsHttpHandler directly, so I don't think that we improperly using it.
We got through HttpClient, a shared instance that is re-used among threads, as recommended.

The code is actually doing something like:

c# try { return await httpHandler.SendAsync(...); } catch(HttpRequestException e) { }

So we can catch the exception and then check state / logging / something.

Anything you can recommend that we'll look at to get to the root cause?

Is it possible you're hitting a timeout? It shouldn't manifest this way, but it would give us a place to start looking. For example, if you change your HttpClient's Timeout to Timeout.InfiniteTimeSpan, does this still happen?

Or if it's potentially related to connection pooling, what happens if you change the SocketHttpHandler's PooledConnectionLifetime and PooledConnectionIdleTimeout to infinite... does it still repro?

It is all running locally, and we timed the error and it was about 11 ms from calling SendAsync.
We'll try with the connection lifetimes.

Actually, I have a theory for how this could manifest... now I just need to go look through the code for where the cause could be. Is it possible that before these errors occur for you, some other request timed out or could have otherwise been canceled (e.g. via the token you pass in)?

Or... are you potentially disposing of the response stream from a request while it's concurrently being read? e.g. something buggy like:
C# Task t = responseStream.ReadAsync(...); responseStream.Dispose();
?

It is entirely possible that the previous request can time out, yes.
In fact, that is probably _guaranteed_. We have a few tests that test what happens when a request is timing out.

I don't think that I have anything like the second option, but it is possible. We might be in the middle of a read and need to abort the connection.

And... I have a repro. On my machine, this reproduce the scenario instantly.
You were correct about the timeout causing this.

```c#
using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace ConsoleApp15
{
class Program
{
static async Task Main(string[] args)
{

        // http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk


        var client = new HttpClient(new SocketsHttpHandler
        {
        })
        {
            Timeout = TimeSpan.FromMilliseconds(250)
        };

        var tasks = new[]
        {
            Run(client),
            Run(client),
            Run(client),
            Run(client),
            Run(client),
            Run(client),
            Run(client),
            Run(client),
            Run(client),
        };

        await Task.WhenAll(tasks);

        Console.WriteLine("Done");

    }

    public static async Task Run(HttpClient client)
    {
        for (int i = 0; i < 25; i++)
        {
            try
            {
                await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk");
                Console.WriteLine("Should timeout");
            }
            catch (Exception e)
            {
            }
            await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/100/url/http://www.google.co.uk");
        }

    }
}

}
```

And... I have a repro. On my machine, this reproduce the scenario instantly.

That's hitting a timeout and throwing an OperationCanceledException... that's expected.

What's not expected is getting an IOException (not wrapped in an OperationCanceledException) out of SendAsync and when nothing on the request is being canceled / timing out.

@stephentoub There are two exceptions thrown here. The first is expected, because we are hitting the timeout.
The _second_ exception (which will actually kill the process) is thrown even though the second request shouldn't be timing out. In fact, you can replace the slowwly url in the second one entirely and still get the same error.

I think that this shows that the timeout on the request indeed poison the pool

The second exception (which will actually kill the process) is thrown even though the second request shouldn't be timing out.

Why shouldn't the second request time out? It takes a very long time to access and the timeout is set at 250ms. (I'm trying to access it now and it's just sitting waiting.)

Hm... looks like I was too excited.

c# var sp = Stopwatch.StartNew(); try { await client.GetStringAsync("http://www.google.co.uk"); } catch (Exception) { Console.WriteLine(sp.ElapsedMilliseconds); throw; }

Always gives > 250 ms. I assumed that this was because of the previous run, not that it is (probably) waiting there.

Okay, I think that I'm on a better track here. See the following code.
The key here is that we need to be POSTing data., I think.

I was able to reproduce the HttpRequestException when posting data and getting a timeout.
We are going to try to build a better repro for this.

We have some more information (still no direct repro).

We have managed to find out exactly which particular test is causing those errors. This is RavenDB_7787 - a test that is spawning a separate process with our Server and doing some basic CRUD operations on it.

The server is started with random port (we are passing http://127.82.86.78:0) and from our CI logs we can see that the address is generated correctly (Embedded URL: http://127.82.86.78:60101). What is worth noting is that we are killing this Server and whole external process in a bit rude way - using process.Kill.

After this test completes, 20-30 seconds later, few tests are failing with the exception that I wrote about in the issue. Those tests are connecting to global server that is available for whole test suite (and this server is working on a different url and port). After few (2-3) of those failures the tests are starting to work in a normal way.

It looks like something is going on after 20-30 seconds after process was killed and for some time it affects other socket connections.

This is our CI log (the 7787 turns on a switch that writes on console url's and http methods for sanity check - to know if we are hitting appropriate server). The code is available here: https://github.com/ppekrol/ravendb/commits/v4.1-temp4

FYI, skipping this test resolves the issue.

01:02:45.307     SlowTests.Tests.Sorting.AlphaNumericSorting.order_by_two_parameters_first_long_than_alphanumeric
01:02:45.579     SlowTests.Tests.Sorting.AlphaNumericSorting.basic_alphanumeric_sort
01:02:45.859     SlowTests.Tests.Sorting.AlphaNumericSorting.number_and_decimal_alphanumeric_sort
01:02:46.465     SlowTests.Tests.Sorting.AlphaNumericSorting.random_words(seed: 1144883738)
01:02:49.064     SlowTests.Issues.RavenDB_7787.TestEmbedded
01:02:50.064 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:02:54.406 Request: http://127.82.86.78:60101/cluster/topology. Type: GET
01:02:54.688 Request: http://127.82.86.78:60101/admin/databases?name=Test&replicationFactor=1. Type: PUT
01:02:56.117 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:02:58.727 Embedded URL: http://127.82.86.78:60101
01:02:58.727 Request: http://127.82.86.78:60101/topology?name=Test. Type: GET
01:02:58.727 Request: http://127.82.86.78:60101/databases/Test/bulk_docs. Type: POST
01:02:59.334 Request: http://127.82.86.78:60101/databases/Test/configuration/client. Type: GET
01:03:01.920 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:03:03.338 Request: http://127.82.86.78:60109/cluster/topology. Type: GET
01:03:03.926 Request: http://127.82.86.78:60109/admin/databases?name=Test&replicationFactor=1. Type: PUT
01:03:04.196 Embedded URL: http://127.82.86.78:60109
01:03:04.197 Request: http://127.82.86.78:60109/topology?name=Test. Type: GET
01:03:04.197 Request: http://127.82.86.78:60109/databases/Test/docs?&id=people%2F1. Type: GET
01:03:04.197 Request: http://127.82.86.78:60109/databases/Test/configuration/client. Type: GET
01:03:05.178     SlowTests.Issues.RavenDB_6414.Should_unload_db_and_send_notification_on_catastrophic_failure
01:03:05.470 Request: http://127.0.0.1:60113/cluster/topology. Type: GET
01:03:05.470 Request: http://127.0.0.1:60113/admin/databases?name=Should_unload_db_and_send_notification_on_catastrophic_failure_2534&replicationFactor=1. Type: PUT
01:03:05.470 Request: http://127.0.0.1:60113/info/tcp?tag=Supervisor. Type: GET
01:03:06.068 Request: http://127.0.0.1:60113/rachis/waitfor?index=4. Type: GET
01:03:08.060 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:03:08.060 Request: http://127.0.0.1:60113/topology?name=Should_unload_db_and_send_notification_on_catastrophic_failure_2534. Type: GET
01:03:08.060 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/hilo/next?tag=users&lastBatchSize=0&lastRangeAt=0001-01-01T00:00:00.0000000&identityPartsSeparator=/&lastMax=0. Type: GET
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/configuration/client. Type: GET
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/bulk_docs. Type: POST
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/hilo/return?tag=users&end=32&last=1. Type: PUT
01:03:08.334     SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession
01:03:08.334 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.334 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyQueriesInAsyncSession_2535&replicationFactor=1. Type: PUT
01:03:08.610 Request: http://127.0.0.1:58232/rachis/waitfor?index=10760. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyQueriesInAsyncSession_2535. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/bulk_docs. Type: POST
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/configuration/client. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/stats. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/multi_get. Type: POST
01:03:08.610 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/stats?failure=check. Type: GET
01:03:08.610     SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession [FAIL]
01:03:08.610       Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
01:03:08.610          --- End of inner exception stack trace ---
01:03:08.610          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
01:03:08.610          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
01:03:08.610          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.610          --- End of inner exception stack trace ---
01:03:08.610          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.610          at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
01:03:08.612          at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs:line 738
01:03:08.612       Response.StatusCode - InternalServerError
01:03:08.612       Stack Trace:
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(116,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteLazyOperationsSingleStep(ResponseTimeInformation responseTimeInformation, List`1 requests, CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(88,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteAllPendingLazyOperationsAsync(CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\test\SlowTests\Issues\RavenDb1962.cs(83,0): at SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession()
01:03:08.612         --- End of stack trace from previous location where exception was thrown ---
01:03:08.612     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall
01:03:08.612 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.612 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536&replicationFactor=1. Type: PUT
01:03:08.891 Request: http://127.0.0.1:58232/rachis/waitfor?index=10764. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/bulk_docs. Type: POST
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/configuration/client. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/multi_get. Type: POST
01:03:08.891 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:08.891     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall [FAIL]
01:03:08.891       Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
01:03:08.891          --- End of inner exception stack trace ---
01:03:08.891          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
01:03:08.891          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
01:03:08.891          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          --- End of inner exception stack trace ---
01:03:08.891          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
01:03:08.891          at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs:line 738
01:03:08.891       Response.StatusCode - InternalServerError
01:03:08.891       Stack Trace:
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(116,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteLazyOperationsSingleStep(ResponseTimeInformation responseTimeInformation, List`1 requests, CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(88,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteAllPendingLazyOperationsAsync(CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\test\SlowTests\Issues\RavenDb1962.cs(52,0): at SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall()
01:03:08.891         --- End of stack trace from previous location where exception was thrown ---
01:03:08.891     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession
01:03:08.891 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyLoadsInAsyncSession_2537&replicationFactor=1. Type: PUT
01:03:09.161 Request: http://127.0.0.1:58232/rachis/waitfor?index=10767. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyLoadsInAsyncSession_2537. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_2537/bulk_docs. Type: POST
01:03:09.162 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_2537/configuration/client. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:09.162     SlowTests.Issues.RavenDB_11046.FacetRqlShouldSupportAliasNotation

I have this behavior on nodes in Service Fabric Cluster 1. In SF Cluster 2 it works OK
I'll try to investigate it further, still have no clues, can't reproduce it on my local machine.

We have a statistical repro. Unfortunately, it require us to run about 3,400 tests to run into this.

We can try our logging on it. Do you have steps we can "easily" replicate locally?

Get the code from this PR
https://github.com/ravendb/ravendb/pull/6879

Go to test/slowtests and run dotnet xunit

This reproduce in one of five runs or so. Typically more on lower end machines.
If there is a way to turn on the logging configuration, I'll happily do so and provide the results.

Most of our logging went in AFTER 2.1 :( ... I am not even sure if it is going to cover this particular scenario ...
If we find logging useful, we can get it in 2.1.x servicing.
Is there easy way to run your test suite against master? (if you don't know, we will figure it out)

@karelz

I'm not sure what is needed to run from 'master', but if you want to update <RuntimeFrameworkVersion> in csprojs (all of them) then best option to do so would be to run:

.\scripts\updateFrameworkVersion.ps1 2.1.1

Also our CI is executing tests in 'Release'.

@rmkerr will take a look on our side - I asked him to try the repro in house as first step.
Just to clarify:

  • Was it discovered as part of .NET Core 2.0->2.1.0/2.1.1 migration? Or is it part of larger port from .NET Framework to .NET Core? (Just to scope what to look for and if it is regression against 2.0)
  • ~Did you try to run it with SocketsHttpHandler disabled on 2.1.1? (Just to confirm if it is really specific to SocketsHttpHandler or if it may be general problem in Networking) (If you didn't it is ok, we can try it too)~

It started to happening all of a sudden. First we thought that it might be related to 2.0 -> 2.1 migration, but at the end we think it is related to SocketsHttpHandler and RavenDB_7787.cs only.

Check my explanation here: https://github.com/dotnet/corefx/issues/30691#issuecomment-401775334

I'm in the process of setting up the repro now. I think that the connection management logging in 2.1 actually isn't bad, so we can hopefully get some useful info there.

@ayende are there any additional steps I need to follow to get the tests working? I'm currently:
1) checking out the code from the linked PR
2) running dotnet restore in /tests/SlowTests
3) running dotnet xunit in /tests/SlowTests

As output I see a large number of tests skipped, and then several failures that occur every time I try to run the tests. This is running on Windows 10 RS2


Logs:
D:\ravendb\test\SlowTests>dotnet xunit
Detecting target frameworks in SlowTests.csproj...
Building for framework netcoreapp2.1...
Sparrow -> D:\ravendb\src\Sparrow\bin\Debug\netcoreapp2.1\Sparrow.dll
Sparrow -> D:\ravendb\src\Sparrow\bin\Debug\netstandard2.0\Sparrow.dll
Voron -> D:\ravendb\src\Voron\bin\Debug\netstandard2.0\Voron.dll
Raven.Client -> D:\ravendb\src\Raven.Client\bin\Debug\netstandard2.0\Raven.Client.dll
Raven.Embedded -> D:\ravendb\src\Raven.Embedded\bin\Debug\netstandard2.0\Raven.Embedded.dll
Raven.Server -> D:\ravendb\src\Raven.Server\bin\Debug\netcoreapp2.1\Raven.Server.dll
Tests.Infrastructure -> D:\ravendb\test\Tests.Infrastructure\bin\Debug\netcoreapp2.1\Tests.Infrastructure.dll
FastTests -> D:\ravendb\test\FastTests\bin\Debug\netcoreapp2.1\FastTests.dll
Raven.TestDriver -> D:\ravendb\src\Raven.TestDriver\bin\Debug\netstandard2.0\Raven.TestDriver.dll
SlowTests -> D:\ravendb\test\SlowTests\bin\Debug\netcoreapp2.1\SlowTests.dll
Running .NET Core 2.1.1 tests for framework netcoreapp2.1...
xUnit.net Console Runner v2.4.0-beta.1.build3958 (64-bit .NET Core 4.6.26606.02)
Discovering: SlowTests
Discovered: SlowTests
Starting: SlowTests
Max number of concurrent tests is: 2
To attach debugger to test process (x64), use proc-id: 8640.
SlowTests.Issues.RavenDB_3365.index_pretty_printer_ignores_whitespaces [SKIP]
RavenDB-4185
To attach debugger to test process (x64), use proc-id: 8640. Url http://127.0.0.1:50061
SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromAsyncSessionTwiceInEmbeddedDocumentStore [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromSessionTwiceInShardingDocumentStore [SKIP]
RavenDB-5918, RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromAsyncSessionTwiceInShardingDocumentStore [SKIP]
RavenDB-5918, RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromAsyncSessionInShardingDocumentStore [SKIP]
RavenDB-5918, RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromSessionInShardingDocumentStore [SKIP]
RavenDB-5918, RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromSessionInEmbeddedDocumentStore [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromAsyncSessionInEmbeddedDocumentStore [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromSessionTwiceInEmbeddedDocumentStore [SKIP]
RavenDB-6283
SlowTests.MailingList.ShardedFacets.FacetTest [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_3465.get_metadata_for_async_sharded [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_3465.get_metadata_for_sharded [SKIP]
RavenDB-6283
SlowTests.Tests.Queries.CanQueryOnCustomClass.UsingConverter [SKIP]
RavenDB-6263
SlowTests.MailingList.StreamingHalfWay.ShouldWork [SKIP]
Missing feature: /docs/stream
SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_to_replace_index_with_errors [SKIP]
RavenDB-5919
SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_with_errors_to_replace_index_with_errors [SKIP]
RavenDB-5919
SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_with_errors_to_replace_index [SKIP]
RavenDB-5919
SlowTests.Server.Documents.SqlMigration.MySQLSchemaTest.CanFetchSchema [SKIP]
Test requires MySQL database
SlowTests.Issues.RavenDB_4708.CanUseSingleSyncSharded [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_4708.CanUseFirstAsyncSharded [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_4708.CanUseFirstSyncSharded [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_4708.CanUseLazilySyncShaded [SKIP]
RavenDB-6283
SlowTests.MailingList.JohanNilsson.WithCustomizedTagNameAndIdentityProperty [SKIP]
RavenDB-6124
SlowTests.Issues.RavenDB_3609.Test [SKIP]
RavenDB-6283
SlowTests.Authentication.AuthenticationChangesTests.ChangesWithAuthentication [SKIP]
RavenDB-9580: Does not work on current version .NET core.
SlowTests.Issues.RavenDB_3420.bulk_insert_sharded [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_3420.bulk_insert_sharded_when_specifying_default_database [SKIP]
RavenDB-6283
SlowTests.MailingList.DeserializationToObjectTests.Query_GivenDbWithComplexObjects_ShouldDeserializePropertiesToOriginalType [SKIP]
RavenDB-6124
SlowTests.Bugs.Indexing.IndexingEachFieldInEachDocumentSeparately.ForIndexing [SKIP]
Missing feature : RavenDB-6152
SlowTests.Bugs.Indexing.WiseShrek.Isolated [SKIP]
Missing features
SlowTests.Issues.RavenDB_579.OneShardPerSessionStrategy [SKIP]
RavenDB-6283
SlowTests.Issues.RavenDB_579.OneShardPerSessionStrategyAsync [SKIP]
RavenDB-6283
SlowTests.Cluster.ClusterModesForRequestExecutorTest.Fastst_node_should_choose_the_node_without_delay [SKIP]
RavenDB-9020
SlowTests.Issues.RavenDB_6596.Run [SKIP]
Culture tests are disabled. Please set 'RAVEN_ENABLE_CULTURE_TESTS' environment variable to 'true' to enable them.
SlowTests.MailingList.Lindblom.Test [SKIP]
RavenDB-6124
SlowTests.MailingList.Kushnir.SortOnMetadata [SKIP]
RavenDB-6264
SlowTests.Issues.RavenDB_2944.CanCreateTestMapReduceIndexes [SKIP]
RavenDB-6572
SlowTests.Issues.RavenDB_2944.CanCreateTestMapIndexes [SKIP]
RavenDB-6572
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.upload_archive [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_blob_in_folder [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.can_get_correct_error_s3 [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.upload_archive_multipart [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_blob [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.can_get_correct_error_glacier [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_object_multipart [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_object [SKIP]
Requires Amazon AWS Credentials
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_70MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_64MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_256MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_100MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_500MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_765MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_500MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.can_get_container_not_found [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_765MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_64MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_256MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_70MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.can_get_and_delete_container [SKIP]
Azure Storage Emulator is not installed
SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_100MB [SKIP]
Azure Storage Emulator is not installed
SlowTests.MailingList.ScriptedPatchBug.Test [SKIP]
Missing feature: Tasks (operations) and their results
SlowTests.Issues.RavenDB_3726.Test [SKIP]
RavenDB-6283
SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(provider: MsSQL) [FAIL]
System.InvalidOperationException : Use a valid connection
Stack Trace:
D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(73,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.<>c.<.cctor>b__32_0()
at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode) at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
at System.Lazy1.CreateValue() D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\SqlAwareTestBase.cs(113,0): at SlowTests.Server.Documents.SqlMigration.SqlAwareTestBase.WithMsSqlDatabase(String& connectionString, String& databaseName, String dataSet, Boolean includeData) D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\SqlAwareTestBase.cs(104,0): at SlowTests.Server.Documents.SqlMigration.SqlAwareTestBase.WithSqlDatabase(MigrationProvider provider, String& connectionString, String& schemaName, String dataSet, Boolean includeData) D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\RecursiveMigrationTest.cs(119,0): at SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(MigrationProvider provider) --- End of stack trace from previous location where exception was thrown --- SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(provider: MySQL) [SKIP] Test requires MySQL database SlowTests.Voron.Bugs.RavenDB_6971.Applying_new_diff_requires_to_zero_destination_bytes_first [FAIL] Raven.Client.Exceptions.Database.DatabaseLoadTimeoutException : Raven.Client.Exceptions.Database.DatabaseLoadTimeoutException: Database Applying_new_diff_requires_to_zero_destination_bytes_first_42 after 00:01:00 is still loading, try again later. Database initialization log: [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Starting database initialization [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Creating db.lock file [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Initializing NotificationCenter [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Initializing DocumentStorage [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Starting Recovery [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Recovering journal 6 (upto last journal 7) [Load Database] 7/11/2018 7:46:27 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Recovering journal 7 (upto last journal 7) at Raven.Server.Routing.RouteInformation.ThrowDatabaseLoadTimeoutWithLog(StringSegment databaseName, TimeSpan timeout, String log) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 143 at Raven.Server.Routing.RouteInformation.UnlikelyWaitForDatabaseToLoad(RequestHandlerContext context, Task1 database, DatabasesLandlord databasesLandlord, StringSegment databaseName) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 122
at Raven.Server.Routing.RouteInformation.WaitForDb(Task databaseLoading) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 162
at Raven.Server.Routing.RequestRouter.HandlePath(HttpContext context, String method, String path) in D:\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 73
at System.Threading.Tasks.ValueTask1.get_Result() at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\ravendb\src\Raven.Server\RavenServerStartup.cs:line 162 Response.StatusCode - ServiceUnavailable Stack Trace: D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(839,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand1 command, HttpRequestMessage request)
D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(795,0): at Raven.Client.Http.RequestExecutor.ExecuteAsyncTResult
D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(828,0): at Raven.Client.Http.RequestExecutor.ExecuteAsyncTResult
D:\ravendb\src\Raven.Client\Documents\Operations\OperationExecutor.Operations.cs(22,0): at Raven.Client.Documents.Operations.OperationExecutor.SendAsync(IOperation1 operation, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(104,0): at Raven.Client.Util.AsyncHelpers.<>c__DisplayClass3_01.<b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(124,0): at Raven.Client.Util.AsyncHelpers.RunSyncT
D:\ravendb\test\SlowTests\Voron\Bugs\RavenDB_6971.cs(75,0): at SlowTests.Voron.Bugs.RavenDB_6971.Applying_new_diff_requires_to_zero_destination_bytes_first()
SlowTests.Server.Documents.ETL.RavenDB_8866.CanResetEtl [FAIL]
Raven.Client.Exceptions.Commercial.LicenseLimitException : Raven.Client.Exceptions.Commercial.LicenseLimitException: Your current license doesn't include the RavenDB ETL feature
at Raven.Server.Commercial.LicenseManager.AssertCanAddRavenEtl() in D:\ravendb\src\Raven.Server\Commercial\LicenseManager.cs:line 1269
at Raven.Server.Web.System.OngoingTasksHandler.AssertCanAddOrUpdateEtl(String databaseName, BlittableJsonReaderObject& etlConfiguration, JsonOperationContext context) in D:\ravendb\src\Raven.Server\Web\System\OngoingTasksHandler.cs:line 719
at Raven.Server.Documents.DatabaseRequestHandler.DatabaseConfigurations(Func4 setupConfigurationFunc, String debug, RefAction beforeSetupConfiguration, Action3 fillJson, HttpStatusCode statusCode) in D:\ravendb\src\Raven.Server\Documents\DatabaseRequestHandler.cs:line 64
at Raven.Server.Web.System.OngoingTasksHandler.AddEtl() in D:\ravendb\src\Raven.Server\Web\System\OngoingTasksHandler.cs:line 685
at Raven.Server.Routing.RequestRouter.HandlePath(HttpContext context, String method, String path) in D:\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 123
at System.Threading.Tasks.ValueTask1.get_Result() at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\ravendb\src\Raven.Server\RavenServerStartup.cs:line 162 State for A-term1: Passive=>LeaderElect at 2018-07-11T19:34:45.5363246Z because I'm the only one in the cluster, so I'm the leader Stack Trace: D:\ravendb\src\Raven.Client\Exceptions\ExceptionDispatcher.cs(117,0): at Raven.Client.Exceptions.ExceptionDispatcher.Throw(JsonOperationContext context, HttpResponseMessage response, Action1 additionalErrorInfo)
D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(1036,0): at Raven.Client.Http.RequestExecutor.HandleUnsuccessfulResponseTResult
D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(786,0): at Raven.Client.Http.RequestExecutor.ExecuteAsyncTResult
D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(828,0): at Raven.Client.Http.RequestExecutor.ExecuteAsyncTResult
D:\ravendb\src\Raven.Client\Documents\Operations\MaintenanceOperationExecutor.cs(61,0): at Raven.Client.Documents.Operations.MaintenanceOperationExecutor.SendAsync[TResult](IMaintenanceOperation1 operation, CancellationToken token) D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(104,0): at Raven.Client.Util.AsyncHelpers.<>c__DisplayClass3_01.<b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(124,0): at Raven.Client.Util.AsyncHelpers.RunSyncT
D:\ravendb\test\SlowTests\Server\Documents\ETL\EtlTestBase.cs(21,0): at SlowTests.Server.Documents.ETL.EtlTestBase.AddEtl[T](DocumentStore src, EtlConfiguration1 configuration, T connectionString) D:\ravendb\test\SlowTests\Server\Documents\ETL\RavenDB_8866.cs(43,0): at SlowTests.Server.Documents.ETL.RavenDB_8866.CanResetEtl() SlowTests.Server.Documents.ETL.ConnectionStringTests.CanGetConnectionStringByName [FAIL] System.InvalidOperationException : Use a valid connection Stack Trace: D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(73,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.<>c.<.cctor>b__32_0() at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode)
--- End of stack trace from previous location where exception was thrown ---
at System.Lazy`1.CreateValue()
D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(1132,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.GetConnectionString(DocumentStore store)
D:\ravendb\test\SlowTests\Server\Documents\ETL\ConnectionStringTests.cs(171,0): at SlowTests.Server.Documents.ETL.ConnectionStringTests.CanGetConnectionStringByName()

@rmkerr

This is expected. There are over 3500 tests there and it takes around 30-40min for them to complete.

I suggest running dotnet xunit -c Release or dotnet xunit -c Release -verbose if you want to see which test is currently running.

Some of them might fail due to the missing license, or can be skipped because of missing SQL Server, but you can safely ignore that fact.

@rmkerr

I've pushed 1 additional commit to the branch https://github.com/ppekrol/ravendb/commits/v4.1-temp4 that disables test parallelization.

After that I've ran in test/SlowTests:

dotnet xunit -c Release -verbose > tests.txt

Following two tests have failed with the exception that we are trying to trace:

    SlowTests.Tests.Sorting.AlphaNumericSorting.random_words_using_document_query_async(seed: 1472672940)
Request: http://127.0.0.1:54525/cluster/topology. Type: GET
Request: http://127.0.0.1:54525/admin/databases?name=random_words_using_document_query_async_1557&replicationFactor=1. Type: PUT
Request: http://127.0.0.1:54525/rachis/waitfor?index=6223. Type: GET
Request: http://127.0.0.1:54525/topology?name=random_words_using_document_query_async_1557. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=0&lastRangeAt=0001-01-01T00:00:00.0000000&identityPartsSeparator=/&lastMax=0. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/configuration/client. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/operations/next-operation-id. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/bulk_insert?id=1. Type: POST
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=32&lastRangeAt=2018-07-11T21:13:52.7935007Z&identityPartsSeparator=/&lastMax=32. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=64&lastRangeAt=2018-07-11T21:13:52.8037418Z&identityPartsSeparator=/&lastMax=96. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=128&lastRangeAt=2018-07-11T21:13:52.8135658Z&identityPartsSeparator=/&lastMax=224. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=256&lastRangeAt=2018-07-11T21:13:52.8216998Z&identityPartsSeparator=/&lastMax=480. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=512&lastRangeAt=2018-07-11T21:13:52.8433858Z&identityPartsSeparator=/&lastMax=992. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/admin/indexes. Type: PUT
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/queries?queryHash=1369170765125549859. Type: POST
Request: http://127.0.0.1:54525/admin/databases. Type: DELETE
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats?failure=check. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats?failure=check. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/return?tag=tracks&end=2016&last=1024. Type: PUT
      Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
         at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs:line 738
      Response.StatusCode - InternalServerError
      Stack Trace:
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(847,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.ExecuteActualQueryAsync(CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(839,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.InitAsync(CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(797,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.ExecuteQueryOperation(Nullable`1 take, CancellationToken token)
        F:\Workspaces\HR\ravendb_2\test\SlowTests\Tests\Sorting\AlphaNumericSorting.cs(454,0): at SlowTests.Tests.Sorting.AlphaNumericSorting.random_words_using_document_query_async(Int32 seed)
        --- End of stack trace from previous location where exception was thrown ---
    SlowTests.Server.Documents.ModifyExistingDocument.ShouldThrowIfChangeCollection
Request: http://127.0.0.1:54525/cluster/topology. Type: GET
Request: http://127.0.0.1:54525/admin/databases?name=ShouldThrowIfChangeCollection_1623&replicationFactor=1. Type: PUT
Request: http://127.0.0.1:54525/rachis/waitfor?index=6497. Type: GET
Request: http://127.0.0.1:54525/topology?name=ShouldThrowIfChangeCollection_1623. Type: GET
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/docs?id=users%2F1. Type: PUT
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/configuration/client. Type: GET
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/docs?id=users%2F1. Type: PUT
Request: http://127.0.0.1:54525/admin/databases. Type: DELETE
      Assert.Contains() Failure
      Not found: Changing 'users/1' from 'Users' to 'UserAddresses' via update is not supported.
      Delete it and recreate the document users/1.
      In value:  System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
         at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs:line 738
      Response.StatusCode - InternalServerError
      Stack Trace:
           at SlowTests.Server.Documents.ModifyExistingDocument.ShouldThrowIfChangeCollection()
        --- End of stack trace from previous location where exception was thrown ---

Log is available here: https://drive.google.com/file/d/1syhM_yR2KwtWvzp7LGABag5_duU8cKLZ/view?usp=sharing

Thanks for the additional info. I'll try running the tests a few times to see if I hit the same error. You mentioned I should expect some failures without a license -- does this look about right?

=== TEST EXECUTION SUMMARY ===
   SlowTests  Total: 3349, Errors: 0, Failed: 518, Skipped: 96, Time: 113039.155s

I've been able to collect some interesting logs while running the test suite. The logging is missing some info that we added in 2.1, but it's still informative.

I initially thought this issue might be caused by a closed connection somehow ending up back in the connection pool, but the logs indicate that is not the case. The socket appears to be healthy until ~800ms after the request is initiated, at which point it gets closed for reasons that are still unclear.

I don't have any other conclusions yet, but there is a lot of data in the logs that I haven't been able to look at in detail yet. I'll update here if I find anything interesting.


Selected Logs:

| | | | |
|-------------------------------------------|--------------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Event Name                                |  Time MSec | Process Name | Rest   |
| HandlerMessage | "204,123.38" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""9288003"" workerId=""5429267"" requestId=""46227566"" memberName=""SendAsyncCore"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63912): Sending request: Method: PUT RequestUri: 'https://127.0.0.1:63912/admin/databases?name=AskServerForClientCertificate_138&replicationFactor=1' Version: 2.0 Content: Raven.Client.Json.BlittableJsonContent Headers: { Raven-Client-Version: 4.1.0.41 Topology-Etag: ""0"" Transfer-Encoding: chunked Accept-Encoding: gzip Accept-Encoding: deflate Content-Encoding: gzip }"" ActivityID=""/#3076/1/3710/""" |
| HandlerMessage | "204,123.49" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""9288003"" workerId=""5429267"" requestId=""46227566"" memberName=""WriteToStreamAsync"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63912): Writing 747 bytes."" ActivityID=""/#3076/1/3710/""" |
| Enter        | "204,123.50" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#34540026"" memberName=""SendAsync"" parameters=""(AwaitableSocketAsyncEventArgs#7711490)"" ActivityID=""/#3076/1/3710/""" |
| DumpBuffer   | "204,123.53" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""AwaitableSocketAsyncEventArgs#7711490"" memberName=""LogBuffer"" buffer=""17030303030000000000000003162400..."" ActivityID=""/#3076/1/3710/""" |
| Exit         | "204,123.53" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#34540026"" memberName=""SendAsync"" result=""False"" ActivityID=""/#3076/1/3710/""" |
| Exit            | "204,123.66" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""HttpClient#55553870"" memberName=""SendAsync"" result=""System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1[System.Net.Http.HttpResponseMessageSystem.Net.Http.DiagnosticsHandler+d__2]""" |
| Enter        | "204,944.19" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""ReceiveAsync"" parameters=""(AwaitableSocketAsyncEventArgs#11480187)"" ActivityID=""/#3076/1/3722/1/""" |
| Exit         | "204,944.23" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""ReceiveAsync"" result=""True"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,944.30" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SendAsync"" parameters=""(AwaitableSocketAsyncEventArgs#60314028)"" ActivityID=""/#3076/1/3722/1/""" |
| Exit         | "204,944.40" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SendAsync"" result=""False"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.57" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""InternalShutdown"" parameters=""how:Both"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" parameters=""(-1)"" ActivityID=""/#3076/1/3722/1/""" |
| Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" message=""timeout = -1"" ActivityID=""/#3076/1/3722/1/""" |
| Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""timeout = -1"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" parameters="""" ActivityID=""/#3076/1/3722/1/""" |
| Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""disposing:true CleanedUp:False"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" parameters="""" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SetToDisconnected"" parameters="""" ActivityID=""/#3076/1/3722/1/""" |
| Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""Calling _handle.CloseAsIs()"" ActivityID=""/#3076/1/3722/1/""" |
| Enter        | "204,949.66" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""UpdateStatusAfterSocketError"" parameters=""""" |
| ErrorMessage | "204,949.67" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""UpdateStatusAfterSocketError"" message=""errorCode:OperationAborted""" |
| Exit         | "204,950.00" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" result="""" ActivityID=""/#3076/1/3722/1/""" |
| Exit         | "204,950.00" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" result=""-1"" ActivityID=""/#3076/1/3722/1/""" |
| HandlerMessage | "204,954.88" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""66602504"" workerId=""65572011"" requestId=""65860881"" memberName=""SendAsyncCore"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63918): Error sending request: System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Security.SslStreamInternal.g__InternalFillBufferAsync|38_0TReadAdapter at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request CancellationToken cancellationToken)"" ActivityID=""/#3076/1/3722/1/""" |

@rmkerr, thanks. Are you able to get a call stack that shows where the close on the socket is coming from? My hypothesis has been that we're putting a connection back into the pool while there's still a CancellationToken from a previous request registered to close it; the connection gets put back into the pool, another request takes it out to use it, and then the previous request's CancellationToken gets a cancellation signal and cancels the wrong connection. But I've been unable to find where in the code that could be happening; I still think it's the most likely explanation, but it seems we've handled this correctly, so getting a stack for the closing would be super useful.

I can't currently see the call stack, but here's what I was able to piece together from the logs:

HttpConnection.RegisterCancellation
HttpConnection.Dispose
NetworkStream.Dispose
Socket.InternalShutdown
Socket.Close
Socket.Dispose

That seems consistent with your theory. If it would be helpful I can try to capture another trace with more info. I can also share the existing trace if you'd like to take a look.

@stephentoub @rmkerr I'm wondering if this might be some sort of issue with managed Sockets instead of SocketsHttpHandler. Starting in early July, we started noticing flakiness in a number of Kestrel tests on Windows, and failing with the same exception. Specifically "System.IO.IOException : Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request."

We see the same failure when testing against both Kestrel's libuv and Socket-based transports which points to this being a client issue. I have not seen this failure when running tests on the desktop framework or on other OSs.

The main difference from the original issue is there's no HttpClient involved. Since Kestrel tests involve a bunch of invalid requests and weird client behavior, Kestrel uses it's own Socket-based TestConnection for a large number of its functional tests. I can file a new issue if that's helpful, but I think the Kestrel test failures likely have the same root cause as the ravendb test failures.

There are a number of Kestrel functional tests that fail with this exception, but the test that reproduces this failure most reliably on my Windows 10 workstation is ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders. This test verifies that Kestrel doesn't incorrectly close the connection due to a response body data rate timeout. Our logs indicate that there is indeed no timeout when this test fails.

Despite being the most reliable way to reproduce the failure, it's still fairly hard to reproduce. I haven't seen the issue when running the test in isolation. The best way to repro the failure is to clone the Kestrel repo, navigate to test\Kestrel.Transport.Sockets.FunctionalTests and run dotnet test -f netcoreapp2.2. The entire test project takes aproximately 40-50 seconds, but you might need to run it several times to see the failure. The output for the repro looks like the following:

G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests [release/2.2 ≡ +0 ~3 -0 !]> dotnet test -f netcoreapp2.2
Microsoft (R) Build Engine version 15.7.179.6572 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 54.82 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Core\Kestrel.Core.csproj.
  Restore completed in 54.64 ms for G:\dev\aspnet\KestrelHttpServer\src\Connections.Abstractions\Connections.Abstractions.csproj.
  Restore completed in 55.6 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel\Kestrel.csproj.
  Restore completed in 54.8 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Https\Kestrel.Https.csproj.
  Restore completed in 54.65 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Transport.Abstractions\Kestrel.Transport.Abstractions.csproj.
  Restore completed in 54.64 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Transport.Sockets\Kestrel.Transport.Sockets.csproj.
  Restore completed in 64.34 ms for G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests\Kestrel.Transport.Sockets.FunctionalTests.csproj.
Build started, please wait...
Build completed.

Test run for G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests\bin\Debug\netcoreapp2.2\Sockets.FunctionalTests.dll(.NETCoreApp,Version=v2.2)
Microsoft (R) Test Execution Command Line Tool Version 15.7.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
[xUnit.net 00:00:22.14]     RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:True, Sends a dynamic table size update at the end of header block) [SKIP]
Skipped  RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:True, Sends a dynamic table size update at the end of header block)
[xUnit.net 00:00:25.20]     RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:True, closed: Sends a DATA frame after sending RST_STREAM frame) [SKIP]
Skipped  RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:True, closed: Sends a DATA frame after sending RST_STREAM frame)
[xUnit.net 00:00:26.05]     RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:False, closed: Sends a DATA frame after sending RST_STREAM frame) [SKIP]
Skipped  RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:False, closed: Sends a DATA frame after sending RST_STREAM frame)
[xUnit.net 00:00:33.79]     RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:False, Sends a dynamic table size update at the end of header block) [SKIP]
Skipped  RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:False, Sends a dynamic table size update at the end of header block)
[xUnit.net 00:00:34.99]     ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders [FAIL]
Failed   ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders
Error Message:
 System.IO.IOException : Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request..
---- System.Net.Sockets.SocketException : The I/O operation has been aborted because of either a thread exit or an application request.
Stack Trace:
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests.AssertStreamCompleted(Stream stream, Int64 minimumBytes, Int32 targetBytesPerSecond) in G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.FunctionalTests\ResponseTests.cs:line 901
   at Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests.ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders() in G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.FunctionalTests\ResponseTests.cs:line 856
--- End of stack trace from previous location where exception was thrown ---
----- Inner Stack Trace -----

Standard Output Messages:
 | [0.001s] TestLifetime Information: Starting test ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders at 2018-08-14T02:10:59
 | [0.006s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting starting
 | [0.007s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting started
 | [0.007s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Loaded hosting startup assembly Sockets.FunctionalTests, Version=2.2.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60
 | [0.008s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Listening on 5000
 | [0.008s] Microsoft.AspNetCore.Server.Kestrel Debug: Listening on 5000
 | [0.009s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" started.
 | [0.009s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" started.
 | [0.010s] Microsoft.AspNetCore.Hosting.Internal.WebHost Information: Request starting HTTP/1.1 GET http:///
 | [0.957s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting shutdown
 | [0.970s] Microsoft.AspNetCore.Hosting.Internal.WebHost Information: Request finished in 960.4081ms 200
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets Debug: Connection id "0HLG1HCD3P866" reset.
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" disconnecting.
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" disconnecting.
 | [0.971s] Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets Debug: Connection id "0HLG1HCD3P866" sending FIN.
 | [0.974s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" stopped.
 | [0.974s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" stopped.
 | [0.977s] TestLifetime Information: Finished test ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders in 0.9759387s



Total tests: 363. Passed: 358. Failed: 1. Skipped: 4.
Test Run Failed.
Test execution time: 36.2769 Seconds

I also managed to collect a wireshark network trace which shows that the connection was indeed reset by the client despite there being no attempt in the managed test code to abort the connection. I can provide the pcap file containing the network trace upon request.

image

The above repro was run with version 2.2.0-preview1-26618-02 of the core runtime.

(copying notes from email)

I’ve done some simple tracing on this but nothing obvious pops out. I think we will need a more involved investigation here, unfortunately.

Some thoughts on how to make progress here:

The test suite is doing lots of connection resets, and none of them look suspicious in and of themselves. To narrow this down I think we should try just running this single test, but run it many times simultaneously, to try to induce the repro. If we can, then we can be pretty sure that any connection reset that happens is unexpected and we can investigate further from there.

Alternatively, we could try to reproduce this in standalone code, which might help reduce the variables here but is more work and may be hard to reproduce.

Thanks for the advice Geoff. I'll try running the failing test in a loop to see if I can hit it more consistently.

If running in a loop doesn't do it, perhaps running multiple concurrent instances will help.

Interestingly, I haven't even been able to get a failure from running the whole Kestrel.Transport.Sockets.FunctionalTests suite. I'm going to see if I can get things working on a separate machine, as it probably isn't worth diagnosing what is different about my current setup.

@rmkerr One think that I think improves the reliability of the repro in the Kestrel.Transport.Sockets.FunctionalTests suite is running the H2SpecTests in parallel. These tests won't be enabled unless you are running Windows 8.1 (or the server equivalent) or higher.

It should be obvious whether or not the H2Spec tests are being run because you'll see about 359 passed tests in that project if they're being run and only about 71 passed tests if they're not.

I'm finally able to repro this consistently. Thanks for the help :)

I'm going to start with collecting System.Net logs, and see what I can work out from there.

For timing issues like this, I find it's often helpful to run some CPU intensive operation in the background, like a corefx build or HttpClient test suite.

For what its worth, we had a production site fail with the same error today on await httpClient.SendAsync(request), and once it failed, it snowballed (error showed up on every httprequest) until I restarted the app. I'm interested to see it resolved.

That's a really interesting datapoint. Were the requests all to the same endpoint?

@rmkerr Have you seen https://github.com/aspnet/KestrelHttpServer/issues/2789 ?

This might be related and there is a very small repro with 'UseLibuv' on kestrel-side as a workaround. It suggests that the issue is in managed sockets. Maybe it will be easier to trace it there and see if it might help with this issue? Just an idea.

To the same host but not the same path

On Sep 7, 2018, at 5:05 PM, Max Kerr <[email protected]notifications@github.com> wrote:

That's a really interesting datapoint. Were the requests all to the same endpoint?


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/dotnet/corefx/issues/30691#issuecomment-419577879, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCPi-zs1qA9ELygOCyntc_tDEk2xRwpks5uYu1ogaJpZM4U5dWk.

I don't see how https://github.com/aspnet/KestrelHttpServer/issues/2789 is related. Maybe I am missing something.

@geoffkizer

My thinking was that the issue here could be 'easily' (had to run 1h test suite) reproduced by me by spawning an external process with Kestrel and executing few HTTP requests from the parent application. At the end after killing external process, all connections (not related ones, to different ports) were poisoned. Check here: https://github.com/dotnet/corefx/issues/30691#issuecomment-404318426

And the https://github.com/aspnet/KestrelHttpServer/issues/2789 is doing similar thing, but the error is different, yet from the same category of errors. Was thinking that they could be related and have a similar root cause. I might be wrong, this is a pure speculation.

It's definitely possible that the issues are related, but looks like aspnet/KestrelHttpServer#2789 manifests very differently than the issue tracked here. If I'm not able to get any useful data from the current repros it may be worth revisiting though.

I've been trying to figure out whether the Kestrel test failure repros on its own while the CPU is under load, or if it also requires some interaction with one of the other test classes that run in parallel. I don't have an answer yet, but I have time to focus on this issue today and will report back when I know more.

I was finally able to collect some useful System.Net traces using perfview. So far I've only been able to confirm the observations made earlier in this thread: It's a socket abort, but the socket is not aborted in the test code, and the socket has not been open long enough for it to be a timeout.

I'm going to dig into the logs in more depth to see if I can gather any new info, but may end up having to enable some additional logging in order to learn anything new.

We are hitting this bug as well, errors occurs randomly after lots of concurrent tasks and requests at PutAsync on the same client. Below is some exception stacks traces.

Have tries to change settings on SocketsHttpHandler without success.

```c#
var handler = new SocketsHttpHandler
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
PooledConnectionLifetime = TimeSpan.FromMinutes(1),
ConnectTimeout = TimeSpan.FromSeconds(10),
};


After this error occures, the next request on the httpclient hangs on GetProxyForUrl.
![Visual Studio with process dump attached showing stack and hanging threads/tasks](https://i.imgur.com/bn9jTZy.png)

An theory:
This is happening in 3 different processes at the same time on the same server. We run the application inside 3 containers, and all gets the same error almost at the same time. Can this error occur if we hit the maximum number of allowed socket on the host OS? 

Example 1

2:45:40 PM/00:00:12.7154296 System.OperationCanceledException: The operation was canceled.
at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Sing.Price.Actor.Gateways.StoriesGateway.PutWorkAsync(ExpositionWork expositionWork, Price price) in D:\BA\1_work\833\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 41
at Sing.Price.Actor.Application.Processor.Process(ExpositionWork story) in D:\BA\1_work\833\s\src\Sing.Price.Actor.Application\Processor.cs:line 65
at Sing.Price.Actor.Program.<>c__DisplayClass12_0.<

b__3>d.MoveNext() in D:\BA\1_work\833\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 121

Example 2

4:48:09 AM/00:00:18.6366218 System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory1 buffer, CancellationToken cancellationToken) at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory1 source)
at System.Net.Http.HttpConnection.FlushAsync()
at System.Net.Http.HttpConnection.SendRequestContentAsync(HttpRequestMessage request, HttpContentWriteStream stream, CancellationToken cancellationToken)
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Sing.Price.Actor.Gateways.StoriesGateway.PutWorkAsync(ExpositionWork expositionWork, Price price) in D:\BA\2_work\1117\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 41
at Sing.Price.Actor.Application.Processor.Process(ExpositionWork story) in D:\BA\2_work\1117\s\src\Sing.Price.Actor.Application\Processor.cs:line 60
at Sing.Price.Actor.Program.<>c__DisplayClass11_0.<

b__3>d.MoveNext() in D:\BA\2_work\1117\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 128

Example 3

4:48:11 AM/00:00:10.0411834 System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Threading.Tasks.ValueTask1.get_Result() at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Sing.Price.Actor.Gateways.StoriesGateway.GetWorkAsync() in D:\BA\2_work\1117\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 26
at Sing.Price.Actor.Program.<>c__DisplayClass11_0.<

b__3>d.MoveNext() in D:\BA\2_work\1117\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 127
```

Does the ASP.NET repro involve a CancellationToken firing at any level of the stack?

I'm wondering if https://github.com/dotnet/coreclr/issues/19839 may be related. It's possible it's not, or it's possible there are multiple unrelated issues here. But the behavior that results from the bug in that issue could potentially result in the kind of symptom originally reported, since if the right sequence happened, we could end up closing an unrelated connection:

  1. Connection is used for request A.
  2. As part of processing A, an operation involving a CancellationToken is invoked.
  3. That operation finishes, but concurrently with it finishing, the token has cancellation requested.
  4. The processing of request A unregisters from the CancellationToken, and CancellationTokenRegistration.Dispose() returns, which should signal that either the registration ran or will never run, but because of that issue, it's possible it could still end up running subsequently.
  5. The connection is returned to the pool, as that CancellationToken's registrations continue to run.
  6. The connection is taken from the pool to serve request B.
  7. The CancellationToken's processing finally gets around to running the supposed-to-have-been-removed-but-still-there registration from the previous request, and it ends up closing the connection now used by B.

Again, this is just a theory, and it may be unrelated. I'm working on a fix for that issue, and hopefully folks able to repro this issue could then try out the new corelib with the fix to validate whether it does or doesn't address the problem.

There are definitely cancellations occurring in the test suite. We don't currently have the logging we need to say for sure if the specific failure is caused by a cancellation token firing, but it's possible. @halter73 can you confirm?

I do have a negative result to share. I'm not sure it got me any closer to a solution, but the info may be useful. I don't think that any of the data conflicts with what you proposed Stephen.

Geoff suggested looking into the possibility that we were double closing a socket handle, and proposed that the following situation might be occurring:

  1. One instance of the test creates managed Socket A with native socket handle 0x01 (or whatever), uses it and then closes it. At this point the handle can be reused by the OS.
  2. Another instance creates managed Socket B, which gets socket handle 0x01 again.
  3. There’s some sort of bug that causes Socket A to close handle 0x01 again.
  4. This causes the OS to return OperationAborted on the pending receive for Socket B.

To check this I captured both System.Net logs and Winsock logs and tracked down the socket that was unexpectedly reset. Correlating the logs showed that the socket close is triggered in the .NET layer, not Winsock as we would expect if the above were occurring. I can share those logs and my notes if anyone is interested.

Neither Kestrel server itself nor any of the client code in the Socket transport functional tests pass a CancellationToken to any Socket/NetworkStream operations.

Neither Kestrel server itself nor any of the client code in the Socket transport functional tests pass a CancellationToken to any Socket/NetworkStream operations.

Thanks, though that's not exactly what I was wondering. Do any cancellation tokens fire in the repro, and are there any registrations with those tokens that would cause the observed behavior if they were to fire unexpectedly, e.g. after a registration was disposed? For example, does any registration abort a connection?

Again, I don't know that it's related, I'm just speculating. We could test the theory if someone who can run the repro could try it would my fix in https://github.com/dotnet/coreclr/pull/20145. If it still repros even with that fix, then it's unrelated. If it doesn't, there's a really good chance that's the culprit.

Ah. I looked at the PR and see it's a CancellationToken, not a Socket bug. I still doubt it's related since each Kestrel test starts a new server and connection(s).

The test that fails most frequently due to this issue makes a single request to a newly instantiated server. This test is verifying that the request/response does not get canceled, and all our logs indicate that this is indeed the case, but the client resets the connection anyway while claiming the I/O operation was aborted even though it wasn't.

That being said, I can give the repro a shot tomorrow. Are build artifacts for the PR available anywhere, or do I need to build it myself?

This test is verifying that the request/response does not get canceled, and all our logs indicate that this is indeed the case

Does that mean no CTS has cancellation requested? If that's the case then for better or worse it's definitely not the issue I suggested.

@rmkerr It sounds like you're able to correlate the exact Close that's causing the exceptions, right?

You might be able to figure out what's causing the Close by capturing call stacks for your net events. See here: https://stackoverflow.com/questions/42970508/c-sharp-etw-how-view-call-stack-in-perfview

@geoffkizer Yeah. It takes a while to capture the logs, but once I do I can find the failure. I'll give the additional logging a try.

@halter73 Thanks for clarifying :)

Does that mean no CTS has cancellation requested?

Correct. There is no CTS that gets cancelled until after the bug has occurred.

Is there any updates on this issue? We are still running into this issue in our tests. FYI, it seems like we only see this failure on Windows 10.

@JunTaoLuo no update yet. Our work was interrupted with other high-pris, but this is now near top priority for @rmkerr.

I captured System.Net logs and Wireshark traces of another failure. Here's where I see the failure in Wireshark:

1990370 214.439692    127.0.0.1 → 127.0.0.1    TCP 84 51686 → 51687 [FIN, ACK] Seq=1 Ack=10334254 Win=525568 Len=0
1990371 214.439736    127.0.0.1 → 127.0.0.1    TCP 84 51687 → 51686 [ACK] Seq=10334254 Ack=2 Win=525568 Len=0
1990408 274.618073    127.0.0.1 → 127.0.0.1    TCP 84 51687 → 51686 [RST, ACK] Seq=10334254 Ack=2 Win=0 Len=0

The RST looks unusual to me, especially since it's occurring after the connection should have been closed. The timestamp on the RST makes it look like is was sent 60 seconds after the connection was closed, but I'm still investigating whether or not that's accurate or just an issue with the logs I collected.

I'm going to try capture another network trace to see if the issue occurs after a FIN again, since that could be valuable info.

We are seeing the exact same thing when running our applications on Linux. A temporary "workaround" for us has been to set the timeout to 100 seconds(it's not an acceptable work around).

When we have the timeout to something more reasonable, like 10 seconds or something, we get the error every 500 request or something like that.

The httpendpoint that we are calling are under our control and when the error happens, we can't even see that the request has reached the service. So it seems like the request doesn't leave the box.
We are using dotnet core 2.1.

I've started a stopwatch for each requests and then Im catching TaskCanceledException and logs the following(this is when we have the timeout set to 15 seconds):

Task was cancelled after 15000ms System.Threading.Tasks.TaskCanceledException:
The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException: Operation canceled
--- End of inner exception stack trace ---
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
at System.Net.Security.SslStreamInternal.g__InternalFillBufferAsync|38_0TReadAdapter

@joseftw Thanks for the additional info. It's possible the behavior you're seeing is related, but it sounds more like an actual timeout. In this case there are two differences from what I've observed. (1) in all other cases I've seen the connection is actually sending the data, as observed via wireshark and (2) in other cases the failure occurs well before a timeout should have occurred.

There's an open issue right now (#20296) tracking the fact that we throw TaskCanceledException on an actual request timeout. I think that's probably the actual root cause here.

If you think that your issue isn't actually being caused by timeouts though, feel free to open a new issue with all of the info you shared here.

Hey @halter73, I've been running your repro targeting .NET Core 2.2. In order to test if any of the recent changes in master I want to run it against 3.0. Any suggestions on how to do that?

You can use Kestrel's master branch which targets 3.0. You'll probably want to use a commit prior to https://github.com/aspnet/KestrelHttpServer/commit/2a610ee1b8d00eed8afe77252c966ecc1e695272 where I moved the h2spec tests to a new test group/project. This seems to prevent repro or at least reduce the likelihood of it occurring.

@rmkerr

Interesting development. We have encountered this issue after adding to our test suite a test that spawns a process with powershell script (very simple one - writing 1 line to a file). ~30 seconds after this tests there is a failure in 1-3 random subsequent tests.

Because of that we thought that this might be related to processes inheriting handles from the parent one and it appears that we are right.

We have made a PR that unskipps those tests and switches the implementation of Process to the one that sets 'inherit handles' to 'false'. After doing so everything seems to be fine, it looks like the issue is with handle inheritance and the behavior of those handles after that child process ends.

The PR in question is here: https://github.com/ravendb/ravendb/pull/8028

Our idea is that, at least as a workaround, it would be good to be able to set the https://github.com/dotnet/corefx/blob/f798c1a103c9b7c899ea43e0d978f7e1f7458473/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L575 to 'false'. Maybe even using reflection via setting some private field inside Process class.

That's _very_ interesting -- thanks for sharing that info. I briefly looked into the possibility that this was related to socket handles a few weeks ago after observing similar behavior with the h2spec process spawned by the Kestrel tests. It's probably worth looking into in more detail if you're seeing the same behavior.

(Note that as of 3.0, socket handles are no longer inherited: https://github.com/dotnet/corefx/pull/32903)

@stephentoub that would probably solve the issue, but:

  1. Can this be applied to 2.1?

or

  1. Can you add private bool _inheritHandles = true field to Process.cs that we would be able to set via reflection to false at least?

In Kestrel, we've moved the h2spec tests to their own test project about two weeks ago, and I don't think we've seen these SocketExceptions at all since then. If we determine that dotnet/corefx#32903 fixes these SocketExceptions, maybe we should look at backporting it.

maybe we should look at backporting it.

If that actually "fixes" it, I have trouble believing it's anything other than masking a symptom, and we should actually find the root cause and fix that.

Unless we believe the root cause really is that the socket handle is getting inherited and somehow that child process then closing is causing problems for the socket? In which case we should be able to repro that pretty easily, and if that's actually the problem, then I agree it would be a candidate for back porting.

Fixing the root cause would be best, but till then could you consider applying point 2. so we can workaround this?

but till then could you consider applying point 2. so we can workaround this?

Applying it where... as a servicing fix to 2.1? That's unlikely.

Adding that private field in Process.cs (probably Process.Windows.cs) and using it instead of https://github.com/dotnet/corefx/blob/f798c1a103c9b7c899ea43e0d978f7e1f7458473/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L575 would be a problem?

The bar for servicing fixes is very high. You're suggesting adding unexposed, unvalidated, functionality-accessible-only-via-reflection-and-that-we-encourage-developers-to-use-via-reflection as a servicing fix. Hence "unlikely". If you need it immediately, you can of course make the change in your own local fork.

Let's get to the bottom of the problem and then figure out the right fix.

cc: @danmosemsft

I agree with @stephentoub.
I feel we are getting closer. @rmkerr invested already weeks of his time trying to get to the bottom of this (incl. extensive discussion with winsock team in Windows). Let's see if this new information can help root cause it and address the root cause.
If it doesn't help too much, we can think about either porting the change dotnet/corefx#32903 to 2.1, or doing something to enable people work around the suspectible root cause.

I've been trying to build a repro that launches a process and then has several sockets sending and receiving data, but I have not been able to force the issue to occur. @ppekrol, would it be possible to set up a project with just the affected test and the code you're using to spawn the process?

If not I'll put some more time into the repro when I'm back in the office next week.

Hi @rmkerr
Apologies for the delay, I really doubt that this would be doable. We already spend significant amount of hours to isolate the issue, create a smaller subset of tests, but for an unknown reason, it only appears for us on a larger set of tests e.g. our full suite. I do not have an idea what else could be done to isolate this. We tried moving that from SlowTests project (takes around 20 min to complete) to FastsTests (around 90 seconds), but it does not reproduce then.

No problem @ppekrol. The info you shared is still valuable. I'm back in the office today, and will spend some time trying to cut down the size of the repro.

I'm able to reproduce the behavior described by @ppekrol with a modified version of the kestrel test suite.

To start, I disabled all of the tests except for the H2Spec tests (which are launches as a separate process) and ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders, which was the test that I observed failing the most frequently. In order to reproduce the issue more consistently, I set up the ConnectionNotClosed... test to run five times.

I was able to further minimize the repro by replacing the H2Spec tests with the following code:

[Fact]
public void LaunchProcess()
{
    for (int i = 0; i < 100; ++i)
    {
        //Thread.Sleep(100);
        Process.Start("Powershell.exe", @"""C:\Tools\HelloWorld.ps1""").WaitForExit();
    }
}

The targeted script just prints "hello world" to the console.

When running the test suites after this modification, I was able to reproduce the failure on 6 out of 10 runs. After replacing Process.Start with Thread.Sleep (as commented out above), the tests passed on 10 out of 10 runs.

Obviously this is pretty rough testing, and there isn't a lot of data. But I do think the behavior is interesting.

@stephentoub I think you mentioned you're looking into this one ...

I haven't but it's on my list for some point...

I'm not sure if this is related, but we are seeing this issue very consistently on one test in our test suite when we have a debugger attached (works when debugger is not attached). The only way we've been able to get around it is to disable keep alive. Disabling keep alive also resolved a lot of other random timeout problems we were having. Since I can't make our source public, I'd be happy to work with someone in Microsoft directly to dig into this further.

@stephentoub is it still on your radar?

I'd like to add that I'm also seeing this issue consistently in my application when POSTing to a popular service's API in addition to random timeouts. In some cases the exception happens in isolation and in other cases they seem to effect other concurrent requests (these are 3 separate HttpClient.SendAsync):

16:26:23.990: Exception: System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException (125): Operation canceled
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Net.Security.SslStream.<FillBufferAsync>g__InternalFillBufferAsync|215_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial)
   at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
   at System.Net.Http.HttpConnection.FillAsync()
   at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithNtConnectionAuthAsync(HttpConnection connection, HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   ...
16:26:23.991: Exception: System.OperationCanceledException: The operation was canceled.
   at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   ...
16:26:23.991: Exception: System.Threading.Tasks.TaskCanceledException: The operation was canceled.
   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   ...

This is the only remaining issue preventing me from switching from framework to core. The same application on framework (using an HttpWebRequest implementation) has never encountered this type of exception nor timeouts.

  • No use of cancellation tokens
  • Static HttpClient shared across threads
  • Ubuntu 18.04
  • .NET core 3 preview 6

```csharp
var socketsHttpHandler = new SocketsHttpHandler {
AllowAutoRedirect = false,
MaxConnectionsPerServer = int.MaxValue,
PooledConnectionIdleTimeout = TimeSpan.FromSeconds((3 * 60) - 5), // Server spec states idle connections are closed after 3 minutes
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
Proxy = null,
UseProxy = false
};

_HttpClient.DefaultRequestHeaders.ExpectContinue = false;
_HttpClient.DefaultRequestHeaders.ConnectionClose = false;
_HttpClient.DefaultRequestHeaders.CacheControl = new CacheControlHeaderValue() { NoCache = true };

@tomwar, can you share a complete repro?

Hi Stephen, I will try to build a complete repro this week.

Thanks, @tomwar.

Hi @stephentoub

Is this issue solved? We are facing the same error. This happens with multiple threads. Works fine with single thread. Any help will be highly appreciated.
Thanks

Unhandled Exception Occurred.: An existing connection was forcibly closed by the remote host
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithNtConnectionAuthAsync(HttpConnection connection, HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at TU.RSS.PES.IntegrationPartner.API.Controllers.ControllerBase.Post[T](String endpoint, Object postData) in C:\TFS\MicroBusiness\PreEmployment\Source\Dev\app\TU.RSS.PES.IntegrationPartner.Api\Controllers\ControllerBase.cs:line 156
at TU.RSS.PES.IntegrationPartner.API.Controllers.ExamsController.Post(Int32 screeningRequestApplicantId, ExamRequestModel examRequest) in C:\TFS\MicroBusiness\PreEmployment\Source\Dev\app\TU.RSS.PES.IntegrationPartner.Api\Controllers\ExamsController.cs:line 47
at lambda_method(Closure , Object )
at Microsoft.Extensions.Internal.ObjectMethodExecutorAwaitable.Awaiter.GetResult()
at Microsoft.AspNetCore.Mvc.Internal.ActionMethodExecutor.TaskOfActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
at System.Threading.Tasks.ValueTask`1.get_Result()
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.InvokeActionMethodAsync()
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.InvokeNextActionFilterAsync()
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.Rethrow(ActionExecutedContext context)
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.InvokeInnerFilterAsync()
at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.InvokeNextExceptionFilterAsync()
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
at System.Net.Security.SslStreamInternal.<FillBufferAsync>g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial)
at System.Net.Security.SslStreamInternal.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)    

We are facing the same error. This happens with multiple threads. Works fine with single thread. Any help will be highly appreciated.

Can you share a repro?

We have an API in .netcore 2.2 that is talking to an internal API (.netcore 2.2). Single requests are working fine but when sending multiple requests through our automation using multiple threads it fails with this error. Is this what you are looking for?

Is this what you are looking for?

No, I was actually looking for code that I could run to see the issue. What you describe is very unlikely to be the same underlying problem as this issue.

I will work on it and get back to you. How do i give the code to you?

How do i give the code to you?

You can post it here. Hopefully you can create a standalone repro that highlights what you're seeing and is generally shareable.

I also had this issue and was working on code to reproduce it, but I realized it can be reproduced fairly trivially by setting HttpClient.Timeout to an extremely low value and sending a HTTPS POST to a IIS Express server running the default ASP.Net Core project template. However that might be a "false positive", in my case there was really a timeout and the problem would be that the error message is misleading (it should clearly say there was a timeout instead of "either a thread exit or an application request")

Setting AppContext.SetSwitch("System.Net.Http.UseSocketsHttpHandler", false); also caused an exception due to a timeout, it just changed the error message and stack trace.

Correction: Not exactly the default ASP.Net Core project template, it needs to be modified to add a POST operation, and in that operation wait some delay (higher than the client's timeout) before returning a response.

Thanks @ckarras .
@stephentoub We created a repro but the error is not happening in that but clearly we are seeing it in our API. Any ideas? I'm gonna try the timeout that ckarras has mentioned.

Since I’m facing this issue I wanted to add my input to help find a solution.

@stephentoub I can add that i get the error in System.Net.Http.HttpConnection.SendAsyncCore

Some details about my specific case:
I'm currently facing this issue in an Azure Durable Functions project (dot net core v2.2) using the httpclient. I use the http client to integrate against our API.
In some cases, the durable function is running up to 5 posts in a row to the API by asynchronous function chaining and I can see the error as mentioned here in the same activity function several times but not always. I'm thinking it is most likely caused by a timeout since nothing is wrong in the API from what I can see.

If you have any questions about the implementation of this please let me know.

When I compare, I can see similarities to the previously posted logs.
From my logs:

`` Function 'function name (Activity)' failed with an error. Reason: System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Security.SslStreamInternal.<FillBufferAsync>g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap, ValueTask1 task, Int32 min, Int32 initial)
at System.Net.Security.SslStreamInternal.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory1 buffer) at System.Net.Http.HttpConnection.FillAsync() at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken) --- End of inner exception stack trace --- at xx.DurableFunctions.Activity.ArchivePdfAfterBPMApprovedAct.Run(DurableActivityContext context, IWebApiAdapter webApiAdapter, IEmailHelper emailHelper, ILogger log) in C:\xx\PN.DF.DurableFunctions\Activity\ArchivePdfAfterBPMApprovedAct.cs:line 59 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.FunctionInvocationFilterInvoker.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 1176
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.FunctionInvocationFilterInvoker.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 1213
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 584
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 531
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, IFunctionOutputDefinition outputDefinition, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 467
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 277
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 321
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsyncCore(IFunctionInstanceEx functionInstance, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 117. IsReplay: False. State: Failed. HubName: DFTaskHubNameProdV1. AppName: PNDigitalFormsDurableFunc. SlotName: Production. ExtensionVersion: 1.8.0. SequenceNumber: 9014.

```

@stephentoub I see you moved it to Future, while we had it tracking for servicing 2.1/2.2 (and 3.0). Was there new info about the scenario impact? AFAIK we had 2-3 independent customers affected.

@karelz, no one has been able to supply a repro, and I can't repro it.

@rmkerr was able to repro it locally. He was able to minimize the repro (to running one test massively in parallel). I asked him to transfer it to you. Did it not happen? I can check my old emails to see if I was on cc.

@karelz, he pointed me to https://github.com/dotnet/corefx/issues/30691#issuecomment-443926134, but that's from 8 months ago, and a lot has changed since then in both the product and in the ASP.NET tests he refers to. I tried various permutations of what he suggested and never got such a failure.

@ppekrol, have you moved to .NET Core 3.0? If yes, are you still hitting this?

@stephentoub would it make sense to ask someone on our team to try to sync back and repro it on older 3.0 build or on 2.2? (with Max's steps)

@stephentoub We haven't moved to 3.0 yet. We will start migration closer to the release date, probably when RC builds with hit the shelves.

I might be wrong, but I see two separate issues here that ppl submit, 1st is probably expected (Operation has timeout out) and 2nd one (the one that we have submitted) is more generic (An error occurred while sending the request).

I might be wrong, but I see two separate issues here that ppl submit, 1st is probably expected (Operation has timeout out) and 2nd one (the one that we have submitted) is more generic (An error occurred while sending the request).

Yes, there are multiple reasons this could occur. As you say, most are expected error conditions; that's just the nature of networking. The one you highlighted is not expected and is much more elusive.

@stephentoub would it make sense to ask someone on our team to try to sync back and repro it on older 3.0 build or on 2.2?

@karelz, if you think it will be fruitful, go for it.

@scalablecory can you please help with it?

Have you tried to play with MaxConnectionsPerServer settings in SocketsHttpClient?

I am running into this too if i can get any real information i will share.

for me it stems from System.IO.Stream.ReadAsync

Unable to replicate with current code; tried pulling old versions of RavenDb and ASP.NET and didn't happen after some playing around either.

Going to close and we can open a new ticket if we find something in 3.0. Many of the replications in this thread appear to be different from what was reported so it's hard to get a solid idea of how much relevancy the issue has today.

This issue drove me crazy for a really long time. In case it helps anyone, I added some retries with Polly:

var apiExceptionPolicy = Policy
    .Handle<Refit.ApiException>() // in my case I was using Refit, but change this to HttpRequestException or whatever works for you
    .RetryAsync(3, (exception, retryCount) => Task.Delay(50));

await apiExceptionPolicy.ExecuteAsync(() => randomlyFailingCode());

This seems to have fixed it for me - I haven't had any random exceptions during local development or emails from the production server for several weeks now 👍

Was this page helpful?
0 / 5 - 0 ratings