Runtime: HttpClient throwing System.Net.Sockets.SocketException after 21 seconds

Created on 14 Dec 2018  ·  32Comments  ·  Source: dotnet/runtime

We are getting random 21 second timeouts when making http calls using HttpClient. As far as I know this is an low-level TCP timeout. The exception message is the same as in issue dotnet/runtime#27232. What causes these errors?

The real world use case is that we have an Azure Webapp running the latest version of .Net Core 2.1. When this webapp receives a request it will do a request (using httpclient) to another API that is accessed through Azure API Management. When our webapp gets maybe 30+ requests at the same time (causing 30+ requests being made to APIM) we might get this timeout on some of the request made to APIM, after an retry it works.

I am trying to figure out why this timeout is happening. Can this exception indicate any errors in my code, setup or webapp performance? Or is the issue in API Management?

area-System.Net.Http needs more info

Most helpful comment

@jarrodd07 I still have en active Azure support ticket on this. It's been exactly 2 months today since I created it. It's escalated to the engineering team (multiple teams are looking into it, actually) and I will update this thread once I hear anything new.

All 32 comments

Are you using HttpClientFactory?

Yes, I am using a typed client with polly for retries.

Maybe silly question...are you using .GetResult() in some code path?

Simplified I use the following. The httpclient is injected using the pattern on https://docs.microsoft.com/sv-se/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.1

```c#
var request = new HttpRequestMessage
{
Method = method,
RequestUri = new Uri(apiPath, UriKind.Relative),
Content = content
};

using (var response = await _httpClient.SendAsync(request))
{
return await response.Content.ReadAsAsync();
}
```

/cc @davidsh

21 seconds is the built-in default for TCP connection timeout on Windows. It is 21 seconds because Windows TCP stack allows for 7 seconds per leg of the 3-leg TCP SYNC-ACK connection setup.

There is no current override to this timeout which is a built-in timeout to the Windows TCP layer.

This issue is basically a duplicate of dotnet/runtime#27232

cc: @karelz

@emilssonn I am curious how often it happens and if you are able to reproduce it in some debuggable/traceable environment outside of Azure WebApps.

I have not been able to reproduce it "manually" yet.
I apologize if this issue is a duplicate. My main question is if this exception always indicate an error in the target API (API Management in this case) or if this error/timeout can occur in some way where the cause is the caller and not the target. I have an ongoing ticket with Azure support and I am basically trying to rule out as many things as possible since we have not been able to find the reason yet.

It happens randomly when the load spikes from 0 rps to "high" rps. For example:
2018-12-08T18:12:42: 78 request to target, OK
2018-12-08T18:12:43: 24 request to target, OK
2018-12-08T18:12:43: 27 request to target, 21 sec timeout

Unless there is some bug somewhere, it is indication the endpoints you call are not responding as they should on TCP level, triggering this timeout.

I have been troubleshooting this for quite some time. After discussion with the Azure support, one possible cause is resource exhaustion on the Azure Webapp. The main reason I created this issue was that almost all references to hitting the connection limit in Azure Webapps says it results in a exception, “An attempt was made to access a socket in a way forbidden by its access permissions aaa.bbb.ccc.ddd.”.

And the reason for hitting(?) the connection limit is that the Webapp will under heavy load do alot of outgoing requests at the same time, limiting connection reuse (even when using HttpClient correctly).

under heavy load do alot of outgoing requests at the same time

You could try setting the MaxConnectionsPerServer property on the handler, in order to throttle how many connections you actually open.

@stephentoub, is that only per HttpClient object (I have multiple services using typed HttpClients that connects to the same domain)? What happens if the MaxConnectionsPerServer is hit?

At the moment I use a shared Polly bulkhead policy to throttle outgoing requests.

is that only per HttpClient object

It's per handler, so if you were to use the same handler instance with multiple HttpClient instances, all of those clients would share in the same max. The parameterless HttpClient ctor creates its own handler, in which case it ends up having the same effect as being per client.

@emilssonn did you ever get a resolution to this issue? we are using .netcore3.1 with a named HttpClientFactory and are having the same issues. The azure app service diagnostics show no problem with SNAT port exhaustion so we are at a loss as to what the problem could be.

We are also seeing this a lot over the past week, .NET Core 2.2 and 3.1 both using HttpClientFactory and hosted in Azure App Services (Windows).

@krispenner I am interested that you said you have been seeing this over the last week. Has this just started happening for you without changing the code? That would imply something environmental potentially. I know for instance I can't replicate locally.

@paulandrewc, our problem was with SNAT port exhaustion. The "resolution" we had to implement was to limit the number of concurrent outbound requests to the same ip/domain.

services.AddHttpClient("name").ConfigurePrimaryHttpMessageHandler(() => new HttpClientHandler { MaxConnectionsPerServer = 110 }).SetHandlerLifetime(Timeout.InfiniteTimeSpan)

This will limit the number of concurrent outbound requests to 110. If you have multiple http clients targeting the same ip/domain, they need to use the same httphandler.

https://4lowtherabbit.github.io/blogs/2019/10/SNAT/

UPDATE

The issue is not related to Azure's outbound NAT for App Services, see my comment on October 23, 2020 below for the actual reason and solution for our specific issue.

ORIGINAL COMMENT

What we have found or believe at this time, is that it's an issue with Azure's outbound NAT for App Services. The outbound port is being re-used too quickly across different App Services and is causing the destination to think it's a continuation of a previously established connection.

Example

  1. App Service A connects to Destination D on port 443, Azure uses port 24064 as the source port.
  2. Destination D sees this as TCP connection from source 52.1.2.3:24064.
  3. App Service A finishes its request.
  4. 54 seconds later....
  5. App Service B connects to Destination D on port 443, Azure uses port 24064 as the source port.
  6. Destination D sees this as TCP connection from source 52.1.2.3:24064.
  7. However, because there hasn't been at least four minutes or so of inactivity on the stream, (TCP WAIT TIMEOUT) Destination D thinks this new connection is actually a continuation of the previous connection and gets confused with the SYN packets, it returns ACK instead of SYN/ACK. This then causes multiple SYN attempts at 3 and 6 seconds later and eventually the connection fully times out at 21 seconds. Windows WinSock then tries a new connection and it succeeds.

The only solution I can think of currently is to move all our App Services onto distinct App Service Plans that have different outbound IP address ranges as this would ensure that Destination D always sees the connections from distinct sources.

I've engaged with Azure support and they have forwarded my findings to the App Service Network team to comment. I'm not aware of any configuration or code changes that I could apply to fix this myself. It seems to me to be an issue with Azure and how it re-uses ports too quickly across different App Services and not taking in to consideration that the destination for all these requests is the same. If it was a different destination then it wouldn't matter.

@krispenner I just wanted to add that I started seeing this issue from June 24th 2020 coinciding with the rollout of a new version of App Services. The new version is 88.0.7.45 (master-837f2f8549e). It might be a coincidence, but the timing of my issues quite perfectly fits the rollout.

@krispenner @HansOlavS We are experiencing a similar issue with two http triggered functions in our ASE. Was this issue ever resolved for you?

@jarrodd07 I still have en active Azure support ticket on this. It's been exactly 2 months today since I created it. It's escalated to the engineering team (multiple teams are looking into it, actually) and I will update this thread once I hear anything new.

Seing the same issue on a service plan (P3v2: 1) with 20+ services on it.

Not seing any NSAT or TCP connection issue on any of the actual services running under the serviceplan, but we are getting high amounts of socketexceptions.

The satcktrace in our case is not tied to any of our own code it seems, but I am positive that we of course somehow are causing the issue.

This is the stacktrace we see. It has a references to ApplicationInsightsProfiler and the Microsoft.ServiceProfiler.Uploaders.StampEtlUploader which to me seems there is an issue somewhere with offloading logs, but I have not idea if that's correct.

System.Net.Http.HttpRequestException: at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Net.Http.HttpClient+<FinishSendAsyncBuffered>d__58.MoveNext (System.Net.Http, Version=4.2.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Microsoft.ServiceProfiler.Agent.FrontendClient.StampFrontendClient+<>c__DisplayClass9_01+<b__0>d.MoveNext (Microsoft.ServiceProfiler.Agent.FrontendClient, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Agent.FrontendClient.StampFrontendClient+d__121.MoveNext (Microsoft.ServiceProfiler.Agent.FrontendClient, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Microsoft.ServiceProfiler.Agent.FrontendClient.StampFrontendClient+<HttpGetAsync>d__91.MoveNext (Microsoft.ServiceProfiler.Agent.FrontendClient, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Agent.FrontendClient.ProfilerFrontendClient+d__3.MoveNext (Microsoft.ServiceProfiler.Agent.FrontendClient, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Uploaders.StampEtlUploader+d__1.MoveNext (ApplicationInsightsProfiler, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Collectors.DetailedTraceCollector+d__41.MoveNext (ApplicationInsightsProfiler, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Collectors.DetailedTraceCollector+d__38.MoveNext (ApplicationInsightsProfiler, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Engine+<>c__DisplayClass1_1+<b__2>d.MoveNext (ApplicationInsightsProfiler, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Microsoft.ServiceProfiler.Agent.Orchestration.Orchestrator+d__18.MoveNext (Microsoft.ServiceProfiler.Agent.Orchestration, Version=2.6.1909.2701, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
Inner exception System.Net.WebException handled at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw:
at System.Net.HttpWebRequest.EndGetResponse (System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Net.Http.HttpClientHandler.GetResponseCallback (System.Net.Http, Version=4.2.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
Inner exception System.Net.Sockets.SocketException handled at System.Net.HttpWebRequest.EndGetResponse:
at System.Net.Sockets.Socket.InternalEndConnect (System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Net.Sockets.Socket.EndConnect (System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at System.Net.ServicePoint.ConnectSocketInternal (System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
`

FYI, today I was hit with this same issue in one of my app functions running in Azure West Europe again. Will update you if there's any news on this, though I doubt that very strongly.

@emilssonn I don't get why you've closed this issue. Is it because it should be handled somewhere else? This is very much still a big issue in the wild.

@HansOlavS I closed it because the errors we had went away as soon as we changed our code to limit number of connections as described in https://github.com/dotnet/runtime/issues/28205#issuecomment-650946712.

Example code:
var handler = new HttpClientHandler { MaxConnectionsPerServer = 120 };

services.AddHttpClient("name1").ConfigurePrimaryHttpMessageHandler(() => handler).SetHandlerLifetime(Timeout.InfiniteTimeSpan);

services.AddHttpClient("name2").ConfigurePrimaryHttpMessageHandler(() => handler).SetHandlerLifetime(Timeout.InfiniteTimeSpan);

I can reopen the issue if needed, but I do not have the issue anymore.

@emilssonn I see. I'm still hitting this same issue in Azure App Service today. The same 21-seconds timeout which seems to be the same thing. I use MaxConnectionsPerServer = 40.

I will share my solution in case it helps others. We experienced this for several months and it looked related to the SNAT on the underlying App Service Plan of our App Services. It turns out, that the vendor's destination firewall had two IP addresses mapped to it's hostname, as in www.example.com resolved to two IP addresses. Azure's NAT software would occasionally connect to both destination IP addresses from the same source IP address and port. This is technically totally legit as it creates a unique 5 point tuple for the connection stream (source IP, source port, destination IP, destination port, protocol). However, it was actually the vendor's firewall that could not handle two connections from the same IP address and port since for some reason it could not differentiate which destination IP address we were connecting to. This caused the vendor's firewall to send TCP RST packets that caused some confusion with the TCP connection handshake in our HttpClient and it eventually timed out at 21 seconds and then retried on the other destination IP address and from a different source IP and port which always succeeded.

Our current solution is to not use the hostname in the URL to connect to the vendor but instead only one of their IP addresses. We are hoping the vendor will fix this issue on their end so we can target the hostname instead.

So you should double check with a tool such as https://digwebinterface.com/ that your destination hostname is only one IP address. If it has multiple, then try targeting just one IP address and not the hostname. Replace https://www.example.com/api with https://1.2.3.4/api and include a HTTP HOST header with a value of www.example.com.

@krispenner Thanks so much for giving an update. My issue is between two Azure App Functions inside the West Europe or North Europe region. It's a simple app function queue-processor that calls a HTTP endpoint in another App function and I'm getting hit with socket timeout errors that all have duration of 21 seconds, even though I've explicitly set the connection timeout to be 30 secs. This is why I'm suspecting that I'm hit by Azure's outbound NAT for App Services that you describe above.

To be clear, our issue was not related to Azure's outbound NAT for App Services as I originally thought. In the end we traced it to the vendor's firewall and their two IP addresses. At this time I don't believe there are any issues with Azure's outbound NAT for App Services.

Also, the timeout of 21 seconds you are experiencing is typically due to the lower level TCP handshake in the WinSock Win32 OS code. Your 30 second timeout would only be triggered if it failed to find the destination target host I believe. For us, it was in fact finding the destination target host but then the TCP handshake itself is failing and timing out. That 21 second timeout cannot be changed from my understanding, at least not through .NET easily. You should run a network capture on both your source and destination App Services and see what the trace shows. This is how we were able to isolate our issue.

HTH

Thanks for the extra info, @krispenner! 👍

FYI, today I was hit with this same issue in one of my app functions running in Azure West Europe again. Will update you if there's any news on this, though I doubt that very strongly.

It seems we have a same issue. I have 2 app services on basic plan (West Europe). I tried them on North Europe as well and same issue. (Linux web app, .net core 3.1)

I sat up health check for one service (Service A) which sends http request to other service(Service B).
Service B reponds with 200 status code and I didn't get any issue with service B, according to the logs and appInsights. But Service A throw Socket exception every second call. (Interval 5 minutes)

I changed the interval to 10 and the same issue but a bit better result. Fianlly I changed to 15 and looks good (But not ideal).

I use typed client and IHttpClientFactory.

It is almost 3 days that I couldn't figure out why we get this issue. I have done the same solution 8-9 month ago and there was no issue. (Same service plan, same platform and .net core 3.1).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yahorsi picture yahorsi  ·  3Comments

Timovzl picture Timovzl  ·  3Comments

EgorBo picture EgorBo  ·  3Comments

GitAntoinee picture GitAntoinee  ·  3Comments

jzabroski picture jzabroski  ·  3Comments