Runtime: Address "System.Net.Sockets.SocketException: Address already in use" on K8S/Linux using HttpClient/TCP

Created on 25 Jan 2019  ·  138Comments  ·  Source: dotnet/runtime

~Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).~
This is HttpClient/TCP spin off. UdpClient is covered fully by dotnet/runtime#27274.

Issue Title

"System.Net.Sockets.SocketException: Address already in use" on Linux

General

Our .net core(v 2.2.0) services are running on Azure Kubernettes Linux environment. Recently we experimenced a lot of error "System.Net.Http.HttpRequestException: Address already in use" while calling dependencies, e.g. Active Directory, CosmosDB and other services. Once the issue started, we kept getting the same errors and had to restart the service to get rid of it. Our http clients are using DNS address, not specific ip and port. The following is the call stack on one example. What can cause such issues and how to fix it?

System.Net.Http.HttpRequestException: Address already in use ---> 
System.Net.Sockets.SocketException: Address already in use#N#   
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   --- 
End of inner exception stack trace ---#N#   

at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)#N#   
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)#N#   
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.DiagnosticsHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.HttpClientWrapper.GetResponseAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T](Boolean respondToDeviceAuthChallenge)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T]()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendHttpMessageAsync(IRequestParameters requestParameters)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendTokenRequestAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.CheckAndAcquireTokenUsingBrokerAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.RunAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenForClientCommonAsync(String resource, ClientKey clientKey)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenAsync(String resource, ClientCredential clientCredential)#N#   
bug tenet-compatibility tenet-reliability

Most helpful comment

Having the same issue on microsoft/dotnet:2.2-runtime-deps using ElasticSearch NEST 5.6.6. Very annoying issue. Can't go back to 2.1 since invested a lot of time upgrading from 2.1 to 2.2. Upgrade to 3.0 Preview is not an option.

+1 to include this fix into next 2.2 release.

All 138 comments

@karelz , @davidsh - can you have a look?

Can you please try .NET Core 3.0? It was fixed there ...

I'll try upgrading. Thanks!

@karelz do you plan to backport this fix to 2.2 ?

@yanrez not unless there is some good business justification - higher number of affected customers. Is that the case? (this is 2nd ask "only" so far)
Also, it would help to validate this is truly the root cause, e.g. by trying 3.0 preview/daily build.

It is happening in some of the clusters. It doesn't seem to consistently repro, but some pods go into this state and stay in it until being terminated. I understand 3.0 is still few months away (I don't know actual timeline though), so my question about 2.2 was based on assumption that hotfix for 2.2 could come earlier than 3.0 release.
We will look into trying to upgrade and see if it solves the issue.

BTW: It might be good for you to register as MS employees - at least by linking your accounts: https://github.com/dotnet/core/blob/master/Documentation/microsoft-team.md ... that allows other FTEs to see you are MSFT ;)

@yanrez what is the service? How large is it roughly? How often it happens? ... That may support the business justification (even if you were not MSFT ;)).
If you can verify on 3.0 that would be great. Either way, we may need to verify on early 2.1/2.2 build as part of "test signoff" to make sure we are fixing the real root cause here.

I will follow up offline, but in some of the regions we see it happening more often - taking down several pods in our k8s cluster per day. It's very annoying at the moment, causing us few dev-hours a day to act on it and mitigate.
We are also looking into automated mitigation using liveness probe wired into check if we start getting these exceptions and signalling k8s to kill the pod. Unfortunately, it's also non-trivial amount of dev work to build and deploy. Considering we can't exactly predict frequency of the issue, risk is that liveness pod might still impact our availability and cause us missing SLA.

I have same exception on AKS (ver. 1.11.4) , and container microsoft/dotnet:2.2-aspnetcore-runtime
, Region West Europe .

Just chiming in to say that my business is also experiencing this issue:
AKS (v1.9.6)
Region: Central US
Image: microsoft/dotnet:2.2-aspnetcore-runtime

We applied automated mitigation to count number of these exceptions and report negative signal to k8s liveness check. It helped us mitigate the impact. We didn't verify yet if latest builds of net core 3 would resolve the issue

FWIW, we implemented the same liveness check but then subsequently manage to fix the issue altogether in our deployment.

For us, we had a service client that was using HttpClient internally. The class was getting instantiated for each incoming request by the DI container (resulting in a new HttpClient for each incoming request). We changed the way the client was registered such that it is only instantiated once for the entire application and the issue was resolved.

I experimenced this problem since yesterday,everything using socket throw "System.Net.Sockets.SocketException: Address already in use". Like Mysql connection, redis, httpclient.

For us, we had a service client that was using HttpClient internally. The class was getting instantiated for each incoming request by the DI container (resulting in a new HttpClient for each incoming request). We changed the way the client was registered such that it is only instantiated once for the entire application and the issue was resolved.

@antoinne85 this is a common mistake, people do regarding HttpClient and one of the reasons why IHttpClientFactory was added in DotNet Core 2.1. (another being singleton HttpClients not respecting DNS changes by default)

See https://docs.microsoft.com/en-us/dotnet/standard/microservices-architecture/implement-resilient-applications/use-httpclientfactory-to-implement-resilient-http-requests

The original and well-known HttpClient class can be easily used, but in some cases, it isn't being properly used by many developers.

As a first issue, while this class is disposable, using it with the using statement is not the best choice because even when you dispose HttpClient object, the underlying socket is not immediately released and can cause a serious issue named ‘sockets exhaustion’. For more information about this issue, see You're using HttpClient wrong and it's destabilizing your software blog post.

Same problem with NEST elasticsearch client on Linux under Core 2.2. Backporting fix to 2.2 would be nice

Same here @Kirides, @karelz, we are hitting this issue also in a site with a lot of traffic requests (10-15 per second, maybe more), It worked fine in 2.1, the issue has been happening since we update to 2.2, it is happening with our HttpClients, but we are already using IHttpClientFactory.

We have to restart our docker containers to fix the problem, and it is happening at least once a day. I am also afraid to update to 3, since it is a productive site, and the official release is not ready yet.

We have to restart our docker containers to fix the problem

I'm doing the same. I set docker container mode "restart=always" and in my app. Then I'm catching this SocketException. If it's caught - i'm killing the app and docker engine restarts it. Works fine, but for complex apps it should be fixed properly at dotnet level.

@karelz I have an app with a large number of users affected by this. Any Dlna media app will be affected by this. A back-port would be much appreciated. Thanks.

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

It is becoming really frustrating, our code with 2.1 has a memory leak, we fixed it with 2.2, but we cannot update our servers because of this, I really do not want to update to 3, since is not prod ready yet, the migration is not so straightforward and we have some dependencies which we are not so sure will work as is in core 3 (like structure map, which we are changing to lamar).

@karelz i will try to update my project and let you know. The problem is that I would need to update to 3 and publish to prod, since this error is only seen after some hours (sometimes 2, sometimes more than a day) with high traffic, so i need to be really careful.

@antonioortizpola understood. I was hoping someone has a "repro" in environment where trying out and deploying 3.0 for a few days would be ok.
Alternatively, if someone is capable to build private version out of 2.2/2.1 servicing branch with the cherrypicked fix, that would be preferred even for us. It is just a bit more involved on the prep side.

@karelz Ok, after some hard work I could update to Core 3, I was excited since our project is a GRPC Server, and I tried the Grpc Template. Sadly after just around 8 hours working, we hit the same issue.

The project is very simple, it just receives the grpc request and make a http call or a WSDL call to an external service, this services has various response times, from 200 milliseconds to timeouts after one minute, then return the response object as is, no complex processing, no database connections or anything weird.

When the error starts happening, all the http clients start showing the errors, the direct ones and the ones coming form a WSDL definition.

_HttpClient throwing exception_
image

_WSDL Client throwing the same exception_
image

The csproj is

<Project Sdk="Microsoft.NET.Sdk.Web">

    <PropertyGroup>
        <TargetFramework>netcoreapp3.0</TargetFramework>
        <DockerDefaultTargetOS>Linux</DockerDefaultTargetOS>
    </PropertyGroup>

    <ItemGroup>
        <PackageReference Include="Grpc.AspNetCore.Server" Version="0.1.19-pre1" />
        <PackageReference Include="Microsoft.AspNet.WebApi.Client" Version="5.2.7" />
        <PackageReference Include="Microsoft.VisualStudio.Azure.Containers.Tools.Targets" Version="1.4.10" />
        <PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />
    </ItemGroup>

</Project>

If it helps, we are running the project in an Amazon Linux in a EC2 instance with docker, the docker file is

FROM mcr.microsoft.com/dotnet/core/aspnet:3.0-stretch-slim AS base
WORKDIR /app
EXPOSE 80

FROM mcr.microsoft.com/dotnet/core/sdk:3.0-stretch AS build
WORKDIR /src
COPY ["vtae.myProject.gateway/vtae.myProject.gateway.csproj", "vtae.myProject.gateway/"]
COPY ["vtae.myProject.gateway.proto/vtae.myProject.gateway.proto.csproj", "vtae.myProject.gateway.proto/"]
RUN dotnet restore "vtae.myProject.gateway/vtae.myProject.gateway.csproj"
COPY . .
WORKDIR "/src/vtae.myProject.gateway"
RUN dotnet build "vtae.myProject.gateway.csproj" -c Release -o /app

FROM build AS publish
RUN dotnet publish "vtae.myProject.gateway.csproj" -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "vtae.myProject.gateway.dll"]

There was no increase in the CPU after the error, but no request was made successful after the first error shows up. Again, this was not happening in 2.1 but it is happening in 2.2 and 3.

All my http clients are Typed, I do not know if this dependency affects something

<PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />

But I am using response.Content.ReadAsAsync<SomeClass>(); and _httpClient.PostAsJsonAsync(_serviceUrl, someRequestObject));

I would also like to know a way to stop the app form the app itself, so I can catch the exception and stop the server to let docker restart the container, I do not like the idea of doing just a System.Exit, but I could not find a way to do it in Core 3

EDIT

Ok, I ended up restarting the app, adding first a reference in Program.cs (A little dirty, but I guess is temporary until a fix is found).

```c#
public class Program
{
public static IHost SystemHost { get; private set; }

public static void Main(string[] args)
{
    SystemHost = CreateHostBuilder(args).Build();
    SystemHost.Run();
}

public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultBuilder(args)
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder
                .UseStartup<Startup>()
                .ConfigureKestrel((context, options) => { options.Limits.MinRequestBodyDataRate = null; });
        });

}


Then in my interceptor I catch the exception with a contains. This is because if the error comes from a simple `HttpClient`, is thrown as `HttpRequestException`, but if comes from a WSDL services, is thrown as `CommunicationException `.

```c#
public async Task<T> ScopedLoggingExceptionWsdlActionService<T>(Func<TService, Task<T>> action)
{
    try
    {
        return await _scopedExecutorService.ScopedActionService(async service => await action(service));
    }
    catch (CommunicationException e)
    {
        await HandleAddressAlreadyInUseBug(e);
        var errorMessage = $"There was a communication error calling the wsdl service in '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unavailable, errorMessage + ". Error message: " + e.Message));
    }
    catch (Exception e)
    {
        var errorMessage = $"There was an error calling the service '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unknown, errorMessage + ". Error message: " + e.Message));
    }
}

// TODO: Remove this after https://github.com/dotnet/core/issues/2253 is fixed    
private async Task HandleAddressAlreadyInUseBug(Exception e)
{
    if (string.IsNullOrWhiteSpace(e.Message) || !e.Message.Contains("Address already in use"))
        return;
    var errorMessage = "Hitting bug 'Address already in use', stopping server to force restart. More info at https://github.com/dotnet/core/issues/2253";
    _logger.LogCritical(e, errorMessage);
    await Program.SystemHost.StopAsync();
    throw new RpcException(new Status(StatusCode.ResourceExhausted, errorMessage + ". Error message: " + e.Message));
}

Having the same issue on microsoft/dotnet:2.2-runtime-deps using ElasticSearch NEST 5.6.6. Very annoying issue. Can't go back to 2.1 since invested a lot of time upgrading from 2.1 to 2.2. Upgrade to 3.0 Preview is not an option.

+1 to include this fix into next 2.2 release.

@sapleu do not update to 3 to fix this problem, as https://github.com/dotnet/core/issues/2253#issuecomment-482918706 states, this still happens in Core 3

We just got hit by this as well. It's very rare, but I'm (somewhat) glad to see it's a know issue.

Got hit by this issue 2 days ago as well. It doesn't happen very often but as soon as first 'Address already in use' shows up, we can't make any other calls until the system is restarted.

Still waiting for someone to have an environment where it happens with some frequence (aka production repro) and who can try to deploy private patch out of 2.1 or 2.1 branch.
Do we have someone like that? Without that this issue is sadly blocked ...

Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).

This assumption is not correct. The fix is for UDP, the issues reported here are for HTTP (which is TCP).

Getting "Address already in use" on a TCP connect is weird. If the local end isn't bound, it should pick a port that is not in use.
You may be running out of port numbers. Running netstat can help you find out what sockets are around and who owns them.

When I run netstat I do not see antything weird, the ports looks prety much the same than with 2.1.

Still waiting for someone to have an environment where it happens with some frequence (aka production repro) and who can try to deploy private patch out of 2.1 or 2.1 branch.
Do we have someone like that? Without that this issue is sadly blocked ...

@karelz, I already updated to 3 and the problem still exists, the error shows up in 8-12 hours, is there anything else that I can do to help with the problem?

I know that bing is running in core 2.1, have you update yourselves to 2.2? This problem is becoming really frustrating, I do not understand how a simple project that just call some http services is causing this issue. This si really causing trust issues in the team, now I want to update for security updates but I am not sure something internal and hiden is going to be broken for the next release.

When I run netstat I do not see antything weird, the ports looks prety much the same than with 2.1.

Did you run this after a few hours? How does it change over time?

@tmds Yes, we have a load balancer in AWS, so we put the 2.1 version in one side and the core 3 in the other, after around 4-8 hours working, the server with the 3 (or 2.2 version, also tried with that) started crashing, i did a netstat -a in both servers, there were many connections open, but it seemed very the same as the 2.1 which was still working with no problems).

If it is really necessary i can do the test again to send some screen caps, sadly this wont be easy, since we already ported the project to net core 3 and many of the new code is not on the other versions.

Netstat in a server with core 2.1

Netstat in a server with core 3

This was captured with two servers working (there was no error at the capture time), i will remove my workaround to restart the server and take a capture when the error is happening, In case I miss something, because some days ago I did that test and the outputs looked the same.

Also, I tried running netstat again and again, but I did not catch anything weird, I must recognize, I do not know if I am using the netstat command right, so If I am missing something please tell me so I can try again

I'd run netstat -at to show all tcp connections.
You should run it once at the start, and then once when your application has been running for a couple of hours.
The netstat output you provided doesn't have any http connections. So I guess you made this at the start.

You can see the local port range that your system is choosing from like this:

$ cat /proc/sys/net/ipv4/ip_local_port_range

We're also seeing this issue in 2.2 on Ubuntu 18.04 VMs. Netstat seems to show a very large number of connections (outbound HTTPS) in CLOSE_WAIT. Restarting the app fixes the issue, but the connections start climbing again.

It takes several days for us to see the issue, so I haven't yet been able to observe the number of connections when we hit the error, but I would assume we're hitting the limit of ~31k and that's what's causing it.

We're seeing it in two different apps which make very different outbound HTTP connections to different endpoints.

We're also seeing this issue in 2.2 on Ubuntu 18.04 VMs. Netstat seems to show a very large number of connections (outbound HTTPS) in CLOSE_WAIT.

@karelz @davidsh @stephentoub @wfurt what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient. Over a long time, these unclosed sockets cause you to run out of local ports.

what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient. Over a long time, these unclosed sockets cause you to run out of local ports.

In theory that could be contributing to the issue if all of the connections were to different hosts. It's much less likely to be the issue if the number of hosts being targeted is limited; in that case, when the client goes back to the connection pool to grab a connection, it'll see that the connection has been closed by the server and properly dispose of it before creating a new connection. Further, the pool also has a timer that fires every X seconds to clean out such connections, so they shouldn't be building up in the pool.

@robjwalker how many CLOSED_WAIT connections do you see for the same host? If you watch netstat over a short period of time (e.g. 2 minutes) do you see CLOSE_WAIT connections change state to something else?

what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient.

A proper HTTP server will always send "Connection: close" just before it closes the TCP connection. That will alert clients (browsers or HttpClient) that they should also close their side of the TCP connection.

If a server doesn't do that, that a client doesn't know that the socket was closed on the other side unless they try to do a send() or receive() on the socket.

HTTP stacks like SocketsHttpHandler will test a potentially idle connection (which might have been closed by the server) by testing the socket before declaring that the connection is usable. If not usable, then the socket will be closed by the client. SocketsHttpHandler will also close connections on its own without testing if its "idle timeout" has expired.

@robjwalker how many CLOSED_WAIT connections do you see for the same host? If you watch netstat over a short period of time (e.g. 2 minutes) do you see CLOSE_WAIT connections change state to something else?

Over the course of 24 hours, we saw it build to approximately 14,000 CLOSE_WAIT states. They don't seem to ever change once in that state. A different app seems to generate about 3500 CLOSE_WAIT states in the same time period. Probably because it is connecting outbound less. In both cases all connections from each app are to one IP, but the two apps are connecting to different IPs (if that makes sense.) One is an endpoint under our control, the other is Google Pub/Sub.

Our dev team is looking in to it, they are wondering if we are "creating multiple clients" and/or mis-managing httpclient. (I'm just quoting them at this point - I'm an Ops engineer, not a developer.)

A proper HTTP server will always send "Connection: close" just before it closes the TCP connection.

Load balancers in between will just close the connection when they want.

If a server doesn't do that, that a client doesn't know that the socket was closed on the other side unless they try to do a send() or receive() on the socket.

You could poll (that is: use poll/epoll/...) to get notified that the peer closes the connection (timeout or active checking is also fine).

@stephentoub @davidsh the observations from @robjwalker seem to indicate that the expected socket close (when re-using connection, on timeout) is not taking place.

This is where TCP keep-alive helps. On client side, idle or maximum timeout should kick in as well.

@tmds Ok, yes, i can confirm, it is not fixed in core 3

https://gist.github.com/antonioortizpola/78f4a57170841fb221b117fcb7a5ec45

For us it takes around 4 hours to run out of sockets, the workaround of catching and restarting the app has mitigate somehow the problem, but we lose some requests when this happens.

BTW Tested on net core 3 preview 3 and 4, also with strech-slim and alpine, all the same

Just a quick update, our development team have fixed one of our apps that was suffering from this issue. I'm afraid I don't have a huge amount of detail, just that they found a place in our code where "HttpClient wasn't being shared".

Sorry I don't have more details. I'm not sure if this means we're not suffering from the same bug as others, or that we've just worked around it.

@robjwalker let us know how netstat looks with the new version after a few hours.

It's been running around 24 hours now, and netstat is very clean. Only one connection in CLOSE_WAIT which appears not to be related.

So, to sum it up - bunch of folks confirmed that the 3.0 fix we made does NOT help their scenarios.
At least one case clarified it is actually application issue.

I will close this issue (as its original intent to port a fix to 2.1 is not reasonable at this point).
I'd like to ask whoever is willing to dig deeper to file a new issue against 3.0 with some details and be prepared for back-and-forth on investigation. A repro or something would be really lovely. Verification of HttpClient reuse should happen prior to filing such issue.

Let me know if I missed anything.

@robjwalker It would be good to know how are you using the httpClients?, since we are using Typed HttpClients for our rest endpoints.

However our WSDL services are being using like this:

```c#
public async Task GetBalance(BscsServiceRequest balanceQueryRequest)
{
var bscsClient = new InterfaceBSCSClient(
InterfaceBSCSClient.EndpointConfiguration.InterfaceBSCSPort, _bscsConfig.BscsEndpoint); // WSDL Client
var timedWsdlRequestWithLog = new TimedWsdlRequestWithLog(_logger, ServiceName);

var response = await timedWsdlRequestWithLog.ExecuteAndLogRequestDuration(
    bscsClient.Endpoint,async () =>
        await bscsClient.balanceQueryAsync(balanceQueryRequest.SService, _bscsConfig.SAccount)
);
return new BalanceQueryResponse() { Result = response.@return };

}
```

This service is registered as Transient, but I do not know if I should add using for the Wsdl client, since implements IDisposable, but all the examples use the client without being inside of a using block, also I do not know if this could affect the inside of the client, reading from this comment i should not be using using in an Http request.

We did not think much of this because in 2.1 never had any issue of this kind, but maybe with the update our bad practices started to cause problems

@karelz any comments on this? or should i create a new issue to get clear on that? Also, It would be good to know what changed from 2.1 to 2.2 that caused this issue, to have more knowledge about what to avoid

@antonioortizpola I'm having the same issues as you do.. 2.1 works but 2.2 doesn't. I agree with you that I may be using bad practices that didn't cause big issues like this. But I don't know what those are.

@karelz it's not clear to me what you mean by "HttpClient reuse". What do you mean exactly by reusing a HttpClient, that I can't make 2 or more calls using the same instance?

@rbrugnollo according to the docs, you should not be using HttpClient directly, you should be using a IHttpClientFactory or any other type of client (Named, typed or generated).

Our team is using typed clients with the rest requests, so there sould not be a problem, however, thinking more deeply, with the WSDL clients we do not have access to the httpClient directly, I do not know if that could be related to the socket exhaustion problem, in which case I would not know how to make a fix or workaround, unless i drop all my wsdl clients and use direct requests, but this is too much work and basically would be dropping support for wsdl clients.

according to the docs, you should not be using HttpClient directly

It's fine to use HttpClient directly. HttpClientFactory layers on top of that to provide additional management. When you do use HttpClient directly, you should reuse instances as much as possible.

@stephentoub I do not know how to define "fine", and how much is "as much as possible", on the other hand, no, you should not simply hold the client as long as possible, since it will not respect DNS changes.

Again, from the docs

...But there’s a second issue with HttpClient that you can have when you use it as singleton or static object. In this case, a singleton or static HttpClient doesn't respect DNS changes

This is a real problem when you have infrastructure on the cloud, it is not as simple as "hold to your client", that exactly what HttpClientFactory is trying to mitigate

In this case, a singleton or static HttpClient doesn't respect DNS changes

This is a real problem when you have infrastructure on the cloud

That information is out-of-date and no longer accurate.

@stephentoub well, if the docs are wrong then I am lost.

The last update is from 01/06/2019, should I ask for an update?, could you please tell us exactly what is wrong so they can make the updates?

Also, if the best solution is to keep the HttpClient for as long as you can, I should not be better just use it as singleton? this would render the IHttpClientFactory pretty much useless, it would be just a fancy name for singleton.

Also it would be great to make that clear in the Core documentation, that you can use HttpClient as singleton should be an option in the "Making Http requests" part, since it still states:

Manages the pooling and lifetime of underlying HttpClientMessageHandler instances to avoid common DNS problems that occur when manually managing HttpClient lifetimes

could you please tell us exactly what is wrong so they can make the updates?

SocketsHttpHandler, which is the default handler implementation backing the HttpClient starting in .NET Core 2.1, has two properties on it: PooledConnectionIdleTimeout (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionidletimeout?view=netcore-2.2) and PooledConnectionLifetime (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionlifetime?view=netcore-2.2). The latter governs how long a connection is allowed to be reused. It defaults to Infinite, which means it won't be proactively torn down by the client for reasons of how long its been around. But if you set it to something shorter, like 5 minutes, that tells the handler that once the connection has been around for 5 minutes, it shouldn't be reused again and the handler will prevent further requests from using it and will tear it down. Any subsequent requests would be forced to get a new connection, which will again consult DNS to determine to where the connection should be opened.

SocketsHttpHandler, which is the default handler implementation backing the HttpClient starting in .NET Core 2.1, has two properties on it

So, nothing was changed from 2.1 to 2.2 that would explain the errors showing up on 2.2 but not 2.1 right?!

So, nothing was changed from 2.1 to 2.2 that would explain the errors showing up on 2.2 but not 2.1 right?!

Correct. We did just handful of targeted servicing-level bug fixes in 2.2/2.2.x over 2.1.x.
It is quite possible that the "regression" is caused by another component, or we're just "lucky" to hit it on 2.2 due to "random" reasons.
Just to confirm: Did anyone hit it in 2.1 at all?

Just to confirm: Did anyone hit it in 2.1 at all?

Not for us, I can confirm, no problems in 3 months with 2.1, the problem started the day we switched to 2.2.

I am in the process of make my WSDL clients singletons, I hope to end by the next week, that way I could confirm if it is a problem with the HttpClient alone or if WCF is making something wrong

Just to confirm: Did anyone hit it in 2.1 at all?

App working for 6 months on 2.1 without any issues, now happening on 2.2.

I'm trying to filter exactly which call is throwing the error, so I can isolate and run more tests.

Based on the replies here, I don't think it will be simple to create a repro (although it would be most helpful).
I would recommend to get any repro environment where we can experiment - collect more data, try private builds, etc.
If anyone has such environment (incl. production) where they can experiment and work closely with us, please let me know and let's dig deeper into it ...

I have set up a number of ASP.NET Core 3.0 trial projects and apply Kubernetes orchestration support. In every single case, the first time I run a debug session everything works fine. Then when I stop the session and start up again I get this error. The only way to get around it is to close VS2019 (not preview version) and restart for the next debugging session.

This does not happen with ASP.NET Core 2.2.

A screen grab of the issue:

image

@simonziegler that problem is not related to this issue, please feel free to open a new issue so It can be disused appropriately.

Also, it seems an odd error, like the debuger is not finalizing and releasing the port, I would try to install the latest version of VS and Net Core 3 to be sure, and I would use the "report problem" option shipped within Visual Studio instead of this git which is code related only.

System.Net.Http.HttpRequestException: Address already in use ---> System.Net.Sockets.SocketException: Address already in use at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)

This exception happens both in .NET Core 2.1(sdk:2.1.505&runtime:2.1.9) and .NET Core 2.2(sdk:2.2.105&runtime:2.2.3) with k8s Enviornment(v1.6.7/v1.9.7) after a long running(few days) and never happens in .NET Core 2.0. I'll refactor each call instance HttpClient with HttpClientFactory to try resolve this problem although this exception may still happen based on the previous reply.

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

For us, this is the one that we need:
https://github.com/dotnet/corefx/pull/32046/files

@LukePulverenti did you confirm that particular change helps your case? Or did you use latest .NET Core 3.0 to validate that?
@antonioortizpola above mentioned that the change (in .NET Core 3.0) does not help their scenario at all: https://github.com/dotnet/corefx/issues/37044#issuecomment-486335084

Reopening to track solution at least in 3.0

We still need someone to help us track this down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help. Thanks!

It seems like we mixing multiple issues here. Part of the discussion is about UDP and part about HttpClient.

My problem is with HttpClient, my project has two ways to use it:

  • Directly for our restful endpoints
  • Indirectly using WCF for soap services

It is the only thing that my project does and we are hitting the issue as this comment states.

Sadly for time pressure we just set a workaround to restart the program each time this happens, and been working in other things, but if it can help, I could separate my calls in two projects so I can be sure if the problem comes from the SOAP services or our restful services

ok, for HTTP: The error was really puzzling to me. Even of man page to connect() mentions EADDRINUSE, I could not find it while looking at Linux kernel sources.
Only one place I could find, we bind() and we don't use that in HttpClient. It turend out we actually do at Socket.ConnectAsync():

``` c#
if (_rightEndPoint == null)
{
if (endPointSnapshot.AddressFamily == AddressFamily.InterNetwork)
{
InternalBind(new IPEndPoint(IPAddress.Any, 0));
}
else if (endPointSnapshot.AddressFamily != AddressFamily.Unix)
{
InternalBind(new IPEndPoint(IPAddress.IPv6Any, 0));
}
}

 and than https://github.com/torvalds/linux/blob/master/net/ipv4/af_inet.c#L526-L532

```c
if (snum || !(inet->bind_address_no_port ||
              force_bind_address_no_port)) {
        if (sk->sk_prot->get_port(sk, snum)) {
            inet->inet_saddr = inet->inet_rcv_saddr = 0;
            err = -EADDRINUSE;
            goto out_release_sock;
        }

So this error can pop up if system runs out of port numbers.

This can also happen if IPAddress.IPv6Any is not available be we try to connect on IPV6/dual-mode socket. But that should be be pretty deterministic and I would not expect it to fail only some times (or be fixed by restart)

I would suggest to check that and for example follow https://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html

Note that if anybody can give it try, you can do:

strace -f -o trace.txt -s 200 -t -e trace=connect,bind ./myCoolApp

I know the message is confusing but the bottom of this may be system running out of resources.
Also note, that it may be worth of checking process limits for file descriptors and buffers.

It may be worth of monitoring /proc/<PID>/fd to see, if descriptor count is going up.

So this error can pop up if system runs out of port numbers.

This matches with my earlier comment https://github.com/dotnet/corefx/issues/37044#issuecomment-485055490

So either the system is running out of ports due to a limited port range, or HttpClient is leaking sockets (or keeping them open too long).

Repro Szenario for UDP Bug

@karelz - you wrote:

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

and

We still need someone to help us track this down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help. Thanks!

Following up your chat with @LukePulverenti about backporting the fix to 2.2, I have created a reproduction scenario for you: https://github.com/softworkz/ReuseBug

The solution contains a native Linux app and a netcore console app, multi-targeting netcore 2.0, 2.2 and 3.0

This demonstrates:

  • works in 2.0
  • fails in 2.2
  • works again in 3.0

I hope this helps getting the fix backported to 2.2...

@softworkz thank you !

@karelz Yes it would be great to get this back-ported because ever since the 2.1 release we've had to tell users to shutdown all other upnp or dlna software on the machine in order to prevent this from happening.

@softworkz @LukePulverenti I think we may be dealing with multiple problems here as some people on this thread said that 3.0 does not fix it for them.
Either way, we have a repro now, so let's try it -- @tmds or @wfurt will you have time to try it out and reproduce? If we can reproduce in-house, it should be easier for us to track it down. I'd be also interested in the repro result on 2.1.

Thanks @softworkz for repro!!! That is a HUGE step towards root cause and solution. Let's hope we can reproduce it too :)

@karelz - Yes, ours is about the UDP bug https://github.com/dotnet/corefx/issues/32027 which was correctly fixed for 3.0 by PR https://github.com/dotnet/corefx/pull/32046 and we're hoping to get it backported to 2.2. It's getting a bit embarrassing having to tell users that our software cannot coexist with other DLNA software, especially once they've found out that any other two (non-netcore) applications can do that.. ;-)
(even worse is that it has been working in a previous version with netcore 2.0)

Regarding 2.1: It fails with 2.1 as well. I've just updated the repro solution (https://github.com/softworkz/ReuseBug) by adding 2.1 as additional target framework and publishing target.

Thanks @softworkz for repro!!! That is a HUGE step towards root cause and solution. Let's hope we can reproduce it too :)

@karelz @softworkz is talking about a UDP issue https://github.com/dotnet/corefx/issues/32027 which was decided not to be backported: https://github.com/dotnet/corefx/issues/32027#issuecomment-418447086.

The main issue reported here is a TCP issue observed when using HttpClient.

@karelz @softworkz is talking about a UDP issue dotnet/runtime#27274 which was decided not to be backported: #32027 (comment).

And still we're asking for it. It's a bug - not a "corner case".

The main issue reported here is a TCP issue observed when using HttpClient.

Not quite. We're not the only ones referring to the UDP bug here.

Not quite. We're not the only ones referring to the UDP bug here.

Yes, this is causing confusion, so it's good to make clear the difference. The issue reported here is for HttpClient/TCP, and it was assumed the UDP fix would solve it, which is not the case.

Agreed, this issue is already pretty confusing even without mixing it up with UdpClient. Let's keep this issue specific to HttpClient/TCP (I will update the title).
Let's move the discussion about backporting dotnet/runtime#27274 into separate issue please (we can reuse dotnet/runtime#27274 for now) -- BTW: I would be interested in confirmation what exactly are symptoms of UdpClient - in that issue, not here please. Note: So far I believe we have 2 customers hitting it.

@karelz - Can you move posts? Or should I repeat my information in the other issue?

@softworkz posts cannot be moved, please copy relevant information over. Thank you!

BTW: I hid the UdpClient comments from the thread to avoid further confusion.

Just to reiterate:
We still need someone to help us track the HttpClient problem down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help that would unblock us. Thanks!
See instructions from @wfurt above: https://github.com/dotnet/corefx/issues/37044#issuecomment-495425689

@yuezengms @yanrez @arsenhovhannisyan @antoinne85 @blurhkh @EvilBeaver @antonioortizpola @LukePulverenti @sapleu @rrudduck @rbrugnollo @robjwalker @OpenSourceAries I'd like to ask you for 2 favors:

  1. Can you please confirm if your repro is truly on HttpClient/TCP and NOT UdpClient? (please confirm you're on HttpClient/TCP by upvoting this reply)
  2. Is any one of you in position to collect additional logs and work with us to root cause this problem? We would love to address it, but we have nothing actionable at this moment without help from someone who can hit the problem and can collect additional info. Thanks!

We hit this issue on a .net core 2.2 application running on Azure Linux Kubernetes. We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed. Though when running in windows the client was releasing the ports. We updated our dependency injection to use the IHttpClientFactory and used this to create the HttpClients which fixed our issues.

We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed.

Got a repro you can share?

We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed.

Got a repro you can share?

Sorry, unfortunately not.

These are the pages we found that helped us:
https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.2
https://docs.microsoft.com/en-us/dotnet/standard/microservices-architecture/implement-resilient-applications/use-httpclientfactory-to-implement-resilient-http-requests

@dmiller02 are you in position to get back into bad state and help us collect some logs?

@dmiller02 are you in position to get back into bad state and help us collect some logs?

Shouldn't be too difficult. We have the images for the service in question and can re-create the error.
What logging would you need?

@wfurt @stephentoub what kind of logs may help us confirm what is going on, on Linux?

To begin with, can we collect output of netstat -natu wehn it happens and than run sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535" to see if that improves the situation?

I have run netstat -natu in our server and it looks like the error is with the WSDL endpoints (the address 172.17.72.150 is a WSDL service).

To work with the WSDL we are adding the service as Transient:

```c#
public static IServiceCollection AddExternalServices(
this IServiceCollection serviceCollection,
IConfiguration configuration)
{
serviceCollection.AddGrpc();

        serviceCollection.AddTransient<CenamWsAddClaroMicroCredit>();
        return serviceCollection;
    }

And then we use it in the service like this

```c#
        public async Task<AddClaroMicroCreditResponse> AddClaroMicroCredit(RequestInfo requestInfo)
        {
            var wsdlServiceClient = new WsdlServiceClient(
                CenamOperationClient.EndpointConfiguration.CenamOperationPort, _cenamOpConfig.CreditsEndpoint);
            var timedWsdlRequestWithLog = new TimedWsdlRequestWithLog(_logger, ServiceName);

            await _cenamWsThrottling.WaitToActionAndIncrement();
            var response = await timedWsdlRequestWithLog.ExecuteLogDurationAndReturnRequest(
                wsdlServiceClient.Endpoint,async () =>
                    await wsdlServiceClient.addClaroMicroCreditAsync(
                        _cenamOpConfig.Entity,
                        requestInfo.Data1,
                        requestInfo.Data2,
                        requestInfo.Data3)
            );
            // work with the result...
        }

The method to log is just a function to log the request and response with the service name

```c#
public async Task> ExecuteLogDurationAndReturnRequest(ServiceEndpoint serviceEndpoint, Func> action)
{
var requestLogAndTime = new RequestLogAndTimeEndpointBehavior();
serviceEndpoint.EndpointBehaviors.Add(requestLogAndTime);

        T response;
        try
        {
            response = await action();
        }
        catch (Exception e)
        {
            _logger.LogError(e,
                "Unexpected error requesting to wsdl service '{serviceName}' in {responseTime}ms to '{serviceEndpointAddress}' with body '{request}' responded: {response}",
                _serviceName,
                requestLogAndTime.LastResponseTimeInMillis,
                serviceEndpoint.Address,
                requestLogAndTime.LastRequestXml,
                requestLogAndTime.LastResponseXml);
            throw;
        }

        _logger.LogInformation(
            "Request to service '{serviceName}' in {responseTime}ms to '{serviceEndpointAddress}' with body '{request}' responded: {response}",
            _serviceName,
            requestLogAndTime.LastResponseTimeInMillis,
            serviceEndpoint.Address,
            requestLogAndTime.LastRequestXml,
            requestLogAndTime.LastResponseXml);

        return new RequestRawAndResult<T>(response, requestLogAndTime.LastRequestXml, requestLogAndTime.LastResponseXml);
    }

```

Maybe something inside the wsdl client is causing the issue (like creating the HttpClient by itself), but then, how I could workaround this (this way had no problems at all with core 2.1),

From the log:

tcp        1      0 172.20.0.2:38736        172.17.72.150:28085     CLOSE_WAIT

That means the server or client did not finish closing the socket.
You should also see this with lsof -i -n -p <PID>
Can you please do packet capture for few requests @antonioortizpola ? I'm wondering if this happens on every request and it just takes some time to use all port numbers.

I'm not familiar with WSDL code. Can you craft runable repro - just like we got one for the UDP case. I think this also depends on server no closing the socket so we may not be able to reproduce it with only client side.

We do not need to hit the bind error. All we need to reproduce is getting new socket stuck in CLOSE_WAIT state.

@wfurt sure, I can work in my project by removing all code, just leaving one WSDL client.

I will try to work on this on the weekend (right now I am in the office and have some other tasks), creating a simple solution and make some packet capture with wireshark, then I will let you know what i can find.

thanks @antonioortizpola . it would be nice to get to bottom of this. Seeing half-closed TCP is certainly clue.

BTW any chance @antonioortizpola that you can share core dump from time when it is failing? (it does not need to reach port exhaustion - we just need few sockets in half-closed state)
It will be large and it can contain any secrets or private data. But if we can work it out I think we would be able to sort this out.
If that not possible, we may be able to script dump file processing or I can guide you through sequence to get useful info out.

Ok, I have the example!!! @wfurt, @karelz, I will make a readme, but the results are clear.

I created two projects, one with core 3 and one with core 2.1, It should be virtualy the same code, but after some stress tests, the version with core 3 does not release the ports:

....
tcp        1      0 172.18.0.3:39787        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:34693        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:46675        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:38431        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:45375        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:46011        172.18.0.4:80           CLOSE_WAIT
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:49014        0.0.0.0:*

While the 2.1 it does

root@1b70aaaed9f2:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.11:33615        0.0.0.0:*               LISTEN
tcp        0      0 172.18.0.2:35071        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:43885        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:43295        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:33653        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:45709        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:37075        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:36519        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:34009        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:35771        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:46763        172.18.0.4:80           TIME_WAIT
udp        0      0 127.0.0.11:55536        0.0.0.0:*
root@1b70aaaed9f2:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.11:33615        0.0.0.0:*               LISTEN
udp        0      0 127.0.0.11:55536        0.0.0.0:*
root@1b70aaaed9f2:/app#

It is a little late now, but tomorrow I can upload the project and give you access so you can run it yourselves

Ok, I have the repo, I invited @karelz and @wfurt, I hope this can help with something, please let me know if I can help with something else.

thanks @antonioortizpola, I will take a look. Are you suggesting 3.0 fixes the problem?

@wfurt nooo, the problem did not happen in core 2.1, but it is happening in 2.2 and 3, and you are welcome, again, if I can help in something else just let me know.

You can simply try to update the app in 2.1 to 2.2 and the problem will appear.

This is AWESOME! Thanks a lot @antonioortizpola, fingers crossed that we will be now able to quickly root-cause it and fix it in 3.0/2.2! 🙏

@wfurt, were you able to reproduce the scenario? just to know if the repo code worked for you, or if there is something more that I can help with

I'm still diffing through @antonioortizpola .
I got services up and I could see

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp        0      0 172.18.0.3:43294        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:43286        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:34212        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34218        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34214        151.101.52.204:80       TIME_WAIT
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

but if I wait a little bit, all the connections go away.

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

I also could not find any direct usage of HttpClient. Everything seems to be wrapped in some high-level calls so I'll need to unwind that.

@wfurt, did you run over the 3 version or 2.1? that is the behavior that i got from the 2.1 version. In the core 3 version the sockets would never exit the CLOSE_WAIT state, even an hour after the test ended.

a little confused by the comment from @karelz. the fix should be applied to 3.0/2.2, as 2.1 does not have this socket issue, but the memory thing that should be fixed/merged in 2.2

@jarlehal typo, fixed, thanks for pointing it out.

@wfurt has been doing a good job digging into this, and shared with me that he noticed something suspicious, that in a repro when analyzing it with SOS there ended up being a small number of Sockets on the heap but a large number of SafeSocketHandles. Based on that, I have a theory that this is due to https://github.com/dotnet/corefx/pull/32845 / https://github.com/dotnet/corefx/pull/32793. I don’t think it actually caused the problem so much as the bug it was fixing was actually masking this problem that’s existed for a long time.

SocketsHttpHandler creates a Socket for each connection. Each Socket creates a SafeSocketHandle (that’s its name in 3.0; prior to that it was internal and named SafeCloseSocket), a SafeHandle that wraps the underlying file descriptor (there’s actually a secondary SafeHandle in the middle, but that’s not relevant). On Unix, when the Socket is connected, it’s registered with the SocketAsyncEngine, which is the code responsible for running the event loop interacting with the epoll handle. Whenever the epoll wait shows that there’s work available to be done, the event loop maps the relevant file descriptor back to the appropriate SafeSocketHandle so that the relevant work can be performed and callbacks invoked. In order to do that mapping, the SocketAsyncEngine stores a ConcurrentDictionary, and the engines themselves are stored in a static readonly SocketAsyncEngine[] array… the punchline here is that these SafeSocketHandles end up being strongly rooted by a static array.

The other important piece of information is that there’s a Timer inside SocketsHttpHandler that runs periodically to check whether connections in the connection pool are still viable, and if they’re not, Dispose’s of them. The bug that the aforementioned issues fixed was that there was an unexpected cycle formed between the timer and the connection pool that ended up keeping everything alive indefinitely, resulting in a non-trivial memory leak. However, as a side effect of that leak, it meant that the timer would continue to run, and every time it fired, it would loop through all of the open connections and Dispose of the ones that were no longer viable. In the fullness of time, all of them would get Dispose’d. Disposing of the connection would dispose of the Socket which would Dispose of the SafeSocketHandle and remove it from the SocketAsyncEngine’s dictionary.

Now, with the aforementioned fixes, if code fails to Dispose of the HttpClient/SocketsHttpHandler when done with them and drops the references to them, the timer gets collected, as does the connection pool, as do all of the HttpConnection objects in the pool. None of those have finalizers, nor should they need them. But here’s the rub. Socket does have a finalizer, yet its finalizer ends up being a nop. Since the storing of the SafeSocketHandle into the static dictionary isn’t something that can be undone automatically by GC, we actually need a finalizer to remove that registration should everything get dropped. Since all those objects don’t have finalizers, and since Socket’s finalizer isn’t doing the unregistration, everything gets collected above the SafeSocketHandle, which then remains registered effectively forever, never being disposed of, and never closing its file descriptor.

I don’t know for certain whether this is the cause of this issue. It’s just a theory, and @wfurt is working through the repro, debugging, and testing out theories. If this doesn’t turn out to be the root cause here, I suspect it’s still a bug we need to fix. If it does turn out to be the root cause, I don’t think the fix is to revert the aforementioned fixes: they were valid, they just revealed this existing problem they had been masking by creating a different leak that in turn allowed the timer to dispose of these resources... plus, this issue would apply to all uses of Sockets that weren't disposed of, not just those used from SocketsHttpHandler. The actual fix would likely be to either use a weak reference when storing the SafeSocketHandle into the dictionary (which might be the right fix but could also potentially cause perf or otherwise unforeseen problems), or to ensure that a finalizer is put in place to undo that registration (most likely changing Socket’s finalizer accordingly on Unix).

In the meantime, assuming this is the cause, in addition to fixing it in System.Net.Sockets, code using HttpClient/HttpClientHandler/SocketsHttpHandler should also be Dispose'ing of those instances when done with them. If you just create a single HttpClient instance that's stored in a static, there's no real need to dispose of it, as everything will go away when the process ends. But if you're doing something that creates an instance, uses it for one or more requests, and then get rid of it, when getting rid of it it should be disposed.

cc: @geoffkizer, @tmds

I'm making some progress. On the note above: If you add cenamOperationClient.Close() to GetSubscriberDetailsF() after response is received to close wfc, there are no lingering sockets at all @antonioortizpola
With old platform handlers, socket can be closed independently but now usual reference counting works and disposing HttpClient when not used can lead to delayed release of resources. So as any reference to HttpResponseStream would keep underlying socket open.
I think there is definitely issue with 2.2+ but there can be more than one reason for observed behavior.

We have a same issue. All ports created with IHttpClientFactory stays in CLOSE_WAIT state

All ports created with IHttpClientFactory stays in CLOSE_WAIT state

@glennc, @rynowak, is HttpClientFactory disposing of all handlers it creates?

@vasicvuk, are you disposing of all HttpResponseMessages you're given and response Streams you're given?

@wfurt, thanks a lot for your investigation! I will change the code so the service closes the connection.

I am glad the repository could help to replicate the problem and I hope it could help others.

Please correct me if I am wrong, but I think the main problems are:

  • People who is not using HttpClient correctly.
  • People who use libraries that use sockets or HttpClient and make assumptions based in previous behaviors (like WCF and me).

For the first group, please make sure you are using HttpClient correctly, most probably, it will fix the problem and improve your system.

For the second group, search for methods that could close the connection or IDisposable interfaces, and make tests monitoring your sockets (like with netstat -natu) to check if it can help fix the problem. Or check if you can reuse your connections.

If the problem persist tell us how are you using the client or socket or library, and if possible create a simple repository with a reproduction case.

The repro was extremely useful, thanks @antonioortizpola.

In either case we should not leak OS resource and we do right now in some cases with 2.2 code.
Disposing explicitly is best as everything is released when not needed. In the other case socket can stay opened until GC kicks in and that may take some time depending on many variables.

I think the main problems are

There are two issues here:

  1. There's a bug in Sockets on Unix where if you allow a connected Socket to be collected without it having been disposed, it'll leak the underlying handle.
  2. Consumers of HttpClient are sometimes not using it correctly, leading to the above bug getting triggered.

Fixing either of those is technically sufficient to address this issue, although even when (1) is fixed, it's important for (2) to be done, as the fix for (1) is still non-deterministic and could take an unbounded amount of time to kick-in.

@stephentoub Hi,
We are using HttpClientFactory and using for HttpResponseMessages. So as i understand issue is that socket is not beign disposed after some time if there is no strict dispose in code. I guess that using block will fix this?

I guess that using block will fix this?

yes, a using block causes Dispose to be called.

I submitted fix to 3.0 master. It would be great if anybody can grab daily build and verify that it solves their issue.

Since this is somewhat generic error, there may be more than one issue under the cover. In either case any feedback would be useful.

kudos to @antonioortizpola who was able to isolate repro.

Will it be possible to get this fix in 2.2?

possibly. It will be easier to get approval if we can confirm that this fix solves observed issues. e.g try 3.0 before and after.

For us having exact issue symptoms isolated due to the application code.

This issue is making big troubles for us in production apps.

We reviewed all HttpClient usages in our code and migrated to HttpClientFactory. But it still happens from time to time.

The interesting thing is that we're having 2 similar apps (they use HttpClient the same way) on Azure Kubernetes and GKE (Google's Kubernetes) and the issue only happens on Azure.

Any plans to fix it in 2.2?

2.2 port is waiting for verification @alxbog. We need to get enough evidence that dotnet/corefx#38499 fixes it or we need separate repro for 2.2. Until then it is unlikely we get permission for 2.x changes.

If someone is willing to try a private build with the fix ported to 2.2/2.1, that would help us prove it is worth porting.
If you are that person let us know and we can provide privates with the change ported for testing ...

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite: https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#httpclient-and-lifetime-management

@phillijw note that closed issues are not monitored. The docs you linked are for IHttpClientFactory (part of ASP.NET code).
In general, you should not dispose HttpClient, but if you want to avoid stale DNS records problem, it is healthy to recycle your static instance on regular basis (HttpClientFactory does it for you).

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite

The short answer is: HttpClient is an IDisposable, and as with any IDisposable, you should Dispose of it whenever you're done with it.

The question then becomes "when should I be done with it?"

If you're creating your own HttpClient instance, e.g. new HttpClient() or new HttpClient(new SocketsHttpHandler()), then you should ideally be reusing the instance over and over and over, rather than creating a new one per request. That's because the underlying handler owns the connection pool, disposing of the handler will dispose of the connection pool, and the aformentioned constructors end up using the public HttpClient(HttpMessageHandler handler, bool disposeHandler) constructor with disposeHandler:true. You still want to Dispose of the HttpClient when you're done with it, but in the common case you shouldn't ever be "done" with it, as you just stash it into a static field and use it for all subsequent requests. If you do decide to be done with it at some point, such as if you want to replace it with a different instance for some reason, then you'd want to Dispose of it then. Again, in this way, it's no different from any other IDisposable: it owns resources, dispose of it when you no longer need those resources.

However, the docs you link to are for IHttpClientFactory. It muddies the waters a little, because it maintains and manages its own set of HttpMessageHandler instances, and its design is to give you back a new HttpClient instance every time you ask for one: the intent here is that you use that HttpClient for the lifetime of your request, at which point you're "done" with it, and as such per my previous comments at which point it should be disposed. That instance wraps one of these shared HttpMessageHandler instances, but was constructed with disposeHandler:false argument:
https://github.com/aspnet/Extensions/blob/c2147ae6a07c5ebf6aa6ef2f8de86e0851fc13ca/src/HttpClientFactory/Http/src/DefaultHttpClientFactory.cs#L117-L134
such that disposing of the HttpClient won't dispose of the underlying shared handler.

Thanks for the explanation @stephentoub. The clarity on disposeHandler is what I was really missing. I feel like the examples on the docs could be updated to maybe discuss that point a bit more or to show examples where the http client DOES get disposed. For instance, https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#use-ihttpclientfactory-in-a-console-app does not dispose even though it fits your example of where it should be disposed, I think.

does not dispose even though it fits your example of where it should be disposed

Yes, the client in the GetPage method in that sample should be disposed. Thanks for pointing that out.
cc: @glennc, @rynowak

@karelz I experience this problem with .net core 2.2. Is it possible to get private libs to test the backport?

@wfurt can you please create a private build against 2.2 for @MrZoidberg to test?

UPDATE: Disregard this comment... the System.Net.Sockets.SocketException: Address already in use bug kept happening in both cases until. I've only solved it by moving my class from services.AddScoped into services.AddSingleton

PREVIOUS COMMENT (again, disregard):
I'm getting this error a lot now in situation where I've switched from:

```c#
httpResponse = await _client.GetAsync(method);


to:

```c#
using (var req = create(HttpMethod.Get, finalUrl)) {
    httpResponse = await _client.SendAsync(req);
}

protected virtual HttpRequestMessage create(HttpMethod method, string url, string postBody = null) {
    var req = new HttpRequestMessage(method, url);
    return req;
}

I'm disposing of httpResponse after deserializing response in both cases. Any opinion on the code @karelz @stephentoub? I'm on .NET Core 2.2 and would prefer not upgrading to .NET Core 3.0, unless it would help you further debug and solve the problem.

The problem with the socket comes from the HttpClient, not HttpRequestMessage, how are you generating the HttpClient? are you using DI, HttpClientFactory?

Have you registered your HttpClient at bootstrap ? i.e. something like this:
```c#
services.AddHttpClient(c =>
{
c.BaseAddress = new Uri("https://base_uri");
c.Timeout = TimeSpan.FromSeconds(30);
}).SetHandlerLifetime(TimeSpan.FromSeconds(30));

Then from your `MyService` inject it in:
```c#
private readonly HttpClient _client;

public MyService(HttpClient client)
{
    _client = client;
}

Implementation should be as simple as:
c# using (var response = await _client.GetAsync(url)) { return await response.Content?.ReadAsStringAsync(); }
Since you're using the GET method the GetAsync(url) function should be sufficient.

This lets .NET Core handle all your connection pooling for you so you don't have to worry about managing it yourself.

I'm creating HttpClient by myself in both cases. However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

If I just use GetAsync - no problems.

@karelz @stephentoub can you please review this and provide any feedback? I'm hoping if you investigate in this direction that you can potentially solve the problem for others as well.

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

i got same error in production too(.netcore 2.2 ubuntu k8s)
we follow the sample code with aspnet suggest doc and here is the demo code:

in Startup.cs
```c#
public void ConfigureServices(IServiceCollection services)
{
services.AddMvc();

        // clients
        services.AddHttpClient<IAddressClient, AddressClient>()
            .AddHttpMessageHandler(handler => new TimingOutDelegatingHandler())
            .AddHttpMessageHandler(handler => new RetryPolicyDelegatingHandler());

here is the AddressClient.cs
```c#
public AddressClient(
            HttpClient httpClient,
            IOptions<UsersApiDomainConfig> usersApiConfig,
            ILogger<AddressClient> logger)
            : base(usersApiConfig)
        {
            _httpClient = httpClient;
            _logger = logger;
        }

here is the use of httpclient:
c# public async Task SyncAddressAsync(AddressModel address) { var url = GetUrl(Path); var response = await _httpClient.PutAsync( url, new StringContent(JsonConvert.SerializeObject(address), Encoding.UTF8, HttpClientConstants.ApplicationJson));

i think we do as the doc suggest to

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

Thanks for your help @stephentoub and @OpenSourceAries ... in the end I solved my problem in different fashion since I was still getting errors... updated original post: https://github.com/dotnet/corefx/issues/37044#issuecomment-545038295

I'm also seeing this exception with MongoDB driver in K8s, which isn't using the HttpClient. So is this fix this a thing or is the scope larger than originally anticipated?

Was this page helpful?
0 / 5 - 0 ratings