runtime 🚀 - Address "System.Net.Sockets.SocketException: Address already in use" on K8S/Linux using HttpClient/...

@karelz , @davidsh - can you have a look?

leecow on 26 Jan 2019

Duplicate of https://github.com/dotnet/corefx/issues/32027

karelz on 26 Jan 2019

Can you please try .NET Core 3.0? It was fixed there ...

karelz on 26 Jan 2019

😄2 👍2

I'll try upgrading. Thanks!

yuezengms on 26 Jan 2019

❤1

@karelz do you plan to backport this fix to 2.2 ?

yanrez on 26 Jan 2019

👍5

@yanrez not unless there is some good business justification - higher number of affected customers. Is that the case? (this is 2nd ask "only" so far)
Also, it would help to validate this is truly the root cause, e.g. by trying 3.0 preview/daily build.

karelz on 26 Jan 2019

It is happening in some of the clusters. It doesn't seem to consistently repro, but some pods go into this state and stay in it until being terminated. I understand 3.0 is still few months away (I don't know actual timeline though), so my question about 2.2 was based on assumption that hotfix for 2.2 could come earlier than 3.0 release.
We will look into trying to upgrade and see if it solves the issue.

yanrez on 26 Jan 2019

❤1

BTW: It might be good for you to register as MS employees - at least by linking your accounts: https://github.com/dotnet/core/blob/master/Documentation/microsoft-team.md ... that allows other FTEs to see you are MSFT ;)

karelz on 26 Jan 2019

👍1

@yanrez what is the service? How large is it roughly? How often it happens? ... That may support the business justification (even if you were not MSFT ;)).
If you can verify on 3.0 that would be great. Either way, we may need to verify on early 2.1/2.2 build as part of "test signoff" to make sure we are fixing the real root cause here.

karelz on 26 Jan 2019

I will follow up offline, but in some of the regions we see it happening more often - taking down several pods in our k8s cluster per day. It's very annoying at the moment, causing us few dev-hours a day to act on it and mitigate.
We are also looking into automated mitigation using liveness probe wired into check if we start getting these exceptions and signalling k8s to kill the pod. Unfortunately, it's also non-trivial amount of dev work to build and deploy. Considering we can't exactly predict frequency of the issue, risk is that liveness pod might still impact our availability and cause us missing SLA.

yanrez on 26 Jan 2019

👍3

I have same exception on AKS (ver. 1.11.4) , and container microsoft/dotnet:2.2-aspnetcore-runtime
, Region West Europe .

arsenhovhannisyan on 4 Feb 2019

Just chiming in to say that my business is also experiencing this issue:
AKS (v1.9.6)
Region: Central US
Image: microsoft/dotnet:2.2-aspnetcore-runtime

antoinne85 on 7 Feb 2019

We applied automated mitigation to count number of these exceptions and report negative signal to k8s liveness check. It helped us mitigate the impact. We didn't verify yet if latest builds of net core 3 would resolve the issue

yanrez on 7 Feb 2019

FWIW, we implemented the same liveness check but then subsequently manage to fix the issue altogether in our deployment.

For us, we had a service client that was using HttpClient internally. The class was getting instantiated for each incoming request by the DI container (resulting in a new HttpClient for each incoming request). We changed the way the client was registered such that it is only instantiated once for the entire application and the issue was resolved.

antoinne85 on 7 Feb 2019

I experimenced this problem since yesterday,everything using socket throw "System.Net.Sockets.SocketException: Address already in use". Like Mysql connection, redis, httpclient.

EventHorizon1024 on 28 Feb 2019

For us, we had a service client that was using HttpClient internally. The class was getting instantiated for each incoming request by the DI container (resulting in a new HttpClient for each incoming request). We changed the way the client was registered such that it is only instantiated once for the entire application and the issue was resolved.

@antoinne85 this is a common mistake, people do regarding HttpClient and one of the reasons why IHttpClientFactory was added in DotNet Core 2.1. (another being singleton HttpClients not respecting DNS changes by default)

See https://docs.microsoft.com/en-us/dotnet/standard/microservices-architecture/implement-resilient-applications/use-httpclientfactory-to-implement-resilient-http-requests

The original and well-known HttpClient class can be easily used, but in some cases, it isn't being properly used by many developers.

As a first issue, while this class is disposable, using it with the using statement is not the best choice because even when you dispose HttpClient object, the underlying socket is not immediately released and can cause a serious issue named ‘sockets exhaustion’. For more information about this issue, see You're using HttpClient wrong and it's destabilizing your software blog post.

Kirides on 28 Feb 2019

Same problem with NEST elasticsearch client on Linux under Core 2.2. Backporting fix to 2.2 would be nice

EvilBeaver on 6 Mar 2019

👍2

Same here @Kirides, @karelz, we are hitting this issue also in a site with a lot of traffic requests (10-15 per second, maybe more), It worked fine in 2.1, the issue has been happening since we update to 2.2, it is happening with our HttpClients, but we are already using IHttpClientFactory.

We have to restart our docker containers to fix the problem, and it is happening at least once a day. I am also afraid to update to 3, since it is a productive site, and the official release is not ready yet.

antonioortizpola on 25 Mar 2019

We have to restart our docker containers to fix the problem

I'm doing the same. I set docker container mode "restart=always" and in my app. Then I'm catching this SocketException. If it's caught - i'm killing the app and docker engine restarts it. Works fine, but for complex apps it should be fixed properly at dotnet level.

EvilBeaver on 26 Mar 2019

👍1

@karelz I have an app with a large number of users affected by this. Any Dlna media app will be affected by this. A back-port would be much appreciated. Thanks.

LukePulverenti on 5 Apr 2019

👍1

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

karelz on 5 Apr 2019

It is becoming really frustrating, our code with 2.1 has a memory leak, we fixed it with 2.2, but we cannot update our servers because of this, I really do not want to update to 3, since is not prod ready yet, the migration is not so straightforward and we have some dependencies which we are not so sure will work as is in core 3 (like structure map, which we are changing to lamar).

@karelz i will try to update my project and let you know. The problem is that I would need to update to 3 and publish to prod, since this error is only seen after some hours (sometimes 2, sometimes more than a day) with high traffic, so i need to be really careful.

antonioortizpola on 5 Apr 2019

@antonioortizpola understood. I was hoping someone has a "repro" in environment where trying out and deploying 3.0 for a few days would be ok.
Alternatively, if someone is capable to build private version out of 2.2/2.1 servicing branch with the cherrypicked fix, that would be preferred even for us. It is just a bit more involved on the prep side.

karelz on 5 Apr 2019

@karelz Ok, after some hard work I could update to Core 3, I was excited since our project is a GRPC Server, and I tried the Grpc Template. Sadly after just around 8 hours working, we hit the same issue.

The project is very simple, it just receives the grpc request and make a http call or a WSDL call to an external service, this services has various response times, from 200 milliseconds to timeouts after one minute, then return the response object as is, no complex processing, no database connections or anything weird.

When the error starts happening, all the http clients start showing the errors, the direct ones and the ones coming form a WSDL definition.

_HttpClient throwing exception_

_WSDL Client throwing the same exception_

The csproj is

<Project Sdk="Microsoft.NET.Sdk.Web">

    <PropertyGroup>
        <TargetFramework>netcoreapp3.0</TargetFramework>
        <DockerDefaultTargetOS>Linux</DockerDefaultTargetOS>
    </PropertyGroup>

    <ItemGroup>
        <PackageReference Include="Grpc.AspNetCore.Server" Version="0.1.19-pre1" />
        <PackageReference Include="Microsoft.AspNet.WebApi.Client" Version="5.2.7" />
        <PackageReference Include="Microsoft.VisualStudio.Azure.Containers.Tools.Targets" Version="1.4.10" />
        <PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />
    </ItemGroup>

</Project>

If it helps, we are running the project in an Amazon Linux in a EC2 instance with docker, the docker file is

FROM mcr.microsoft.com/dotnet/core/aspnet:3.0-stretch-slim AS base
WORKDIR /app
EXPOSE 80

FROM mcr.microsoft.com/dotnet/core/sdk:3.0-stretch AS build
WORKDIR /src
COPY ["vtae.myProject.gateway/vtae.myProject.gateway.csproj", "vtae.myProject.gateway/"]
COPY ["vtae.myProject.gateway.proto/vtae.myProject.gateway.proto.csproj", "vtae.myProject.gateway.proto/"]
RUN dotnet restore "vtae.myProject.gateway/vtae.myProject.gateway.csproj"
COPY . .
WORKDIR "/src/vtae.myProject.gateway"
RUN dotnet build "vtae.myProject.gateway.csproj" -c Release -o /app

FROM build AS publish
RUN dotnet publish "vtae.myProject.gateway.csproj" -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "vtae.myProject.gateway.dll"]

There was no increase in the CPU after the error, but no request was made successful after the first error shows up. Again, this was not happening in 2.1 but it is happening in 2.2 and 3.

All my http clients are Typed, I do not know if this dependency affects something

<PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />

But I am using response.Content.ReadAsAsync<SomeClass>(); and _httpClient.PostAsJsonAsync(_serviceUrl, someRequestObject));

I would also like to know a way to stop the app form the app itself, so I can catch the exception and stop the server to let docker restart the container, I do not like the idea of doing just a System.Exit, but I could not find a way to do it in Core 3

EDIT

Ok, I ended up restarting the app, adding first a reference in Program.cs (A little dirty, but I guess is temporary until a fix is found).

```c#
public class Program
{
public static IHost SystemHost { get; private set; }

public static void Main(string[] args)
{
    SystemHost = CreateHostBuilder(args).Build();
    SystemHost.Run();
}

public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultBuilder(args)
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder
                .UseStartup<Startup>()
                .ConfigureKestrel((context, options) => { options.Limits.MinRequestBodyDataRate = null; });
        });

}


Then in my interceptor I catch the exception with a contains. This is because if the error comes from a simple `HttpClient`, is thrown as `HttpRequestException`, but if comes from a WSDL services, is thrown as `CommunicationException `.

```c#
public async Task<T> ScopedLoggingExceptionWsdlActionService<T>(Func<TService, Task<T>> action)
{
    try
    {
        return await _scopedExecutorService.ScopedActionService(async service => await action(service));
    }
    catch (CommunicationException e)
    {
        await HandleAddressAlreadyInUseBug(e);
        var errorMessage = $"There was a communication error calling the wsdl service in '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unavailable, errorMessage + ". Error message: " + e.Message));
    }
    catch (Exception e)
    {
        var errorMessage = $"There was an error calling the service '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unknown, errorMessage + ". Error message: " + e.Message));
    }
}

// TODO: Remove this after https://github.com/dotnet/core/issues/2253 is fixed    
private async Task HandleAddressAlreadyInUseBug(Exception e)
{
    if (string.IsNullOrWhiteSpace(e.Message) || !e.Message.Contains("Address already in use"))
        return;
    var errorMessage = "Hitting bug 'Address already in use', stopping server to force restart. More info at https://github.com/dotnet/core/issues/2253";
    _logger.LogCritical(e, errorMessage);
    await Program.SystemHost.StopAsync();
    throw new RpcException(new Status(StatusCode.ResourceExhausted, errorMessage + ". Error message: " + e.Message));
}

antonioortizpola on 14 Apr 2019

🚀3 ❤1

Having the same issue on microsoft/dotnet:2.2-runtime-deps using ElasticSearch NEST 5.6.6. Very annoying issue. Can't go back to 2.1 since invested a lot of time upgrading from 2.1 to 2.2. Upgrade to 3.0 Preview is not an option.

+1 to include this fix into next 2.2 release.

sapleu on 15 Apr 2019

👍8

@sapleu do not update to 3 to fix this problem, as https://github.com/dotnet/core/issues/2253#issuecomment-482918706 states, this still happens in Core 3

antonioortizpola on 15 Apr 2019

👍5 ❤1

We just got hit by this as well. It's very rare, but I'm (somewhat) glad to see it's a know issue.

rrudduck on 18 Apr 2019

👍1

Got hit by this issue 2 days ago as well. It doesn't happen very often but as soon as first 'Address already in use' shows up, we can't make any other calls until the system is restarted.

rbrugnollo on 19 Apr 2019

Still waiting for someone to have an environment where it happens with some frequence (aka production repro) and who can try to deploy private patch out of 2.1 or 2.1 branch.
Do we have someone like that? Without that this issue is sadly blocked ...

karelz on 20 Apr 2019

Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).

This assumption is not correct. The fix is for UDP, the issues reported here are for HTTP (which is TCP).

Getting "Address already in use" on a TCP connect is weird. If the local end isn't bound, it should pick a port that is not in use.
You may be running out of port numbers. Running netstat can help you find out what sockets are around and who owns them.

tmds on 20 Apr 2019

👍3

When I run netstat I do not see antything weird, the ports looks prety much the same than with 2.1.

Still waiting for someone to have an environment where it happens with some frequence (aka production repro) and who can try to deploy private patch out of 2.1 or 2.1 branch.
Do we have someone like that? Without that this issue is sadly blocked ...

@karelz, I already updated to 3 and the problem still exists, the error shows up in 8-12 hours, is there anything else that I can do to help with the problem?

I know that bing is running in core 2.1, have you update yourselves to 2.2? This problem is becoming really frustrating, I do not understand how a simple project that just call some http services is causing this issue. This si really causing trust issues in the team, now I want to update for security updates but I am not sure something internal and hiden is going to be broken for the next release.

antonioortizpola on 20 Apr 2019

👍1

When I run netstat I do not see antything weird, the ports looks prety much the same than with 2.1.

Did you run this after a few hours? How does it change over time?

tmds on 23 Apr 2019

@tmds Yes, we have a load balancer in AWS, so we put the 2.1 version in one side and the core 3 in the other, after around 4-8 hours working, the server with the 3 (or 2.2 version, also tried with that) started crashing, i did a netstat -a in both servers, there were many connections open, but it seemed very the same as the 2.1 which was still working with no problems).

If it is really necessary i can do the test again to send some screen caps, sadly this wont be easy, since we already ported the project to net core 3 and many of the new code is not on the other versions.

Netstat in a server with core 2.1

Netstat in a server with core 3

This was captured with two servers working (there was no error at the capture time), i will remove my workaround to restart the server and take a capture when the error is happening, In case I miss something, because some days ago I did that test and the outputs looked the same.

Also, I tried running netstat again and again, but I did not catch anything weird, I must recognize, I do not know if I am using the netstat command right, so If I am missing something please tell me so I can try again

antonioortizpola on 23 Apr 2019

I'd run netstat -at to show all tcp connections.
You should run it once at the start, and then once when your application has been running for a couple of hours.
The netstat output you provided doesn't have any http connections. So I guess you made this at the start.

You can see the local port range that your system is choosing from like this:

$ cat /proc/sys/net/ipv4/ip_local_port_range

tmds on 24 Apr 2019

We're also seeing this issue in 2.2 on Ubuntu 18.04 VMs. Netstat seems to show a very large number of connections (outbound HTTPS) in CLOSE_WAIT. Restarting the app fixes the issue, but the connections start climbing again.

It takes several days for us to see the issue, so I haven't yet been able to observe the number of connections when we hit the error, but I would assume we're hitting the limit of ~31k and that's what's causing it.

We're seeing it in two different apps which make very different outbound HTTP connections to different endpoints.

robjwalker on 24 Apr 2019

We're also seeing this issue in 2.2 on Ubuntu 18.04 VMs. Netstat seems to show a very large number of connections (outbound HTTPS) in CLOSE_WAIT.

@karelz @davidsh @stephentoub @wfurt what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient. Over a long time, these unclosed sockets cause you to run out of local ports.

tmds on 24 Apr 2019

what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient. Over a long time, these unclosed sockets cause you to run out of local ports.

In theory that could be contributing to the issue if all of the connections were to different hosts. It's much less likely to be the issue if the number of hosts being targeted is limited; in that case, when the client goes back to the connection pool to grab a connection, it'll see that the connection has been closed by the server and properly dispose of it before creating a new connection. Further, the pool also has a timer that fires every X seconds to clean out such connections, so they shouldn't be building up in the pool.

stephentoub on 24 Apr 2019

@robjwalker how many CLOSED_WAIT connections do you see for the same host? If you watch netstat over a short period of time (e.g. 2 minutes) do you see CLOSE_WAIT connections change state to something else?

tmds on 24 Apr 2019

what may be the issue: the HTTP server closes the TCP connection, but that doesn't result in a close of the socket used by HttpClient.

A proper HTTP server will always send "Connection: close" just before it closes the TCP connection. That will alert clients (browsers or HttpClient) that they should also close their side of the TCP connection.

If a server doesn't do that, that a client doesn't know that the socket was closed on the other side unless they try to do a send() or receive() on the socket.

HTTP stacks like SocketsHttpHandler will test a potentially idle connection (which might have been closed by the server) by testing the socket before declaring that the connection is usable. If not usable, then the socket will be closed by the client. SocketsHttpHandler will also close connections on its own without testing if its "idle timeout" has expired.

davidsh on 24 Apr 2019

@robjwalker how many CLOSED_WAIT connections do you see for the same host? If you watch netstat over a short period of time (e.g. 2 minutes) do you see CLOSE_WAIT connections change state to something else?

Over the course of 24 hours, we saw it build to approximately 14,000 CLOSE_WAIT states. They don't seem to ever change once in that state. A different app seems to generate about 3500 CLOSE_WAIT states in the same time period. Probably because it is connecting outbound less. In both cases all connections from each app are to one IP, but the two apps are connecting to different IPs (if that makes sense.) One is an endpoint under our control, the other is Google Pub/Sub.

Our dev team is looking in to it, they are wondering if we are "creating multiple clients" and/or mis-managing httpclient. (I'm just quoting them at this point - I'm an Ops engineer, not a developer.)

robjwalker on 24 Apr 2019

A proper HTTP server will always send "Connection: close" just before it closes the TCP connection.

Load balancers in between will just close the connection when they want.

If a server doesn't do that, that a client doesn't know that the socket was closed on the other side unless they try to do a send() or receive() on the socket.

You could poll (that is: use poll/epoll/...) to get notified that the peer closes the connection (timeout or active checking is also fine).

@stephentoub @davidsh the observations from @robjwalker seem to indicate that the expected socket close (when re-using connection, on timeout) is not taking place.

tmds on 24 Apr 2019

This is where TCP keep-alive helps. On client side, idle or maximum timeout should kick in as well.

wfurt on 24 Apr 2019

@tmds Ok, yes, i can confirm, it is not fixed in core 3

https://gist.github.com/antonioortizpola/78f4a57170841fb221b117fcb7a5ec45

For us it takes around 4 hours to run out of sockets, the workaround of catching and restarting the app has mitigate somehow the problem, but we lose some requests when this happens.

BTW Tested on net core 3 preview 3 and 4, also with strech-slim and alpine, all the same

antonioortizpola on 24 Apr 2019

Just a quick update, our development team have fixed one of our apps that was suffering from this issue. I'm afraid I don't have a huge amount of detail, just that they found a place in our code where "HttpClient wasn't being shared".

Sorry I don't have more details. I'm not sure if this means we're not suffering from the same bug as others, or that we've just worked around it.

robjwalker on 25 Apr 2019

@robjwalker let us know how netstat looks with the new version after a few hours.

tmds on 25 Apr 2019

It's been running around 24 hours now, and netstat is very clean. Only one connection in CLOSE_WAIT which appears not to be related.

robjwalker on 25 Apr 2019

So, to sum it up - bunch of folks confirmed that the 3.0 fix we made does NOT help their scenarios.
At least one case clarified it is actually application issue.

I will close this issue (as its original intent to port a fix to 2.1 is not reasonable at this point).
I'd like to ask whoever is willing to dig deeper to file a new issue against 3.0 with some details and be prepared for back-and-forth on investigation. A repro or something would be really lovely. Verification of HttpClient reuse should happen prior to filing such issue.

Let me know if I missed anything.

karelz on 25 Apr 2019

@robjwalker It would be good to know how are you using the httpClients?, since we are using Typed HttpClients for our rest endpoints.

However our WSDL services are being using like this:

```c#
public async Task GetBalance(BscsServiceRequest balanceQueryRequest)
{
var bscsClient = new InterfaceBSCSClient(
InterfaceBSCSClient.EndpointConfiguration.InterfaceBSCSPort, _bscsConfig.BscsEndpoint); // WSDL Client
var timedWsdlRequestWithLog = new TimedWsdlRequestWithLog(_logger, ServiceName);

var response = await timedWsdlRequestWithLog.ExecuteAndLogRequestDuration(
    bscsClient.Endpoint,async () =>
        await bscsClient.balanceQueryAsync(balanceQueryRequest.SService, _bscsConfig.SAccount)
);
return new BalanceQueryResponse() { Result = response.@return };

}
```

This service is registered as Transient, but I do not know if I should add using for the Wsdl client, since implements IDisposable, but all the examples use the client without being inside of a using block, also I do not know if this could affect the inside of the client, reading from this comment i should not be using using in an Http request.

We did not think much of this because in 2.1 never had any issue of this kind, but maybe with the update our bad practices started to cause problems

@karelz any comments on this? or should i create a new issue to get clear on that? Also, It would be good to know what changed from 2.1 to 2.2 that caused this issue, to have more knowledge about what to avoid

antonioortizpola on 25 Apr 2019

@antonioortizpola I'm having the same issues as you do.. 2.1 works but 2.2 doesn't. I agree with you that I may be using bad practices that didn't cause big issues like this. But I don't know what those are.

@karelz it's not clear to me what you mean by "HttpClient reuse". What do you mean exactly by reusing a HttpClient, that I can't make 2 or more calls using the same instance?

rbrugnollo on 26 Apr 2019

@rbrugnollo according to the docs, you should not be using HttpClient directly, you should be using a IHttpClientFactory or any other type of client (Named, typed or generated).

Our team is using typed clients with the rest requests, so there sould not be a problem, however, thinking more deeply, with the WSDL clients we do not have access to the httpClient directly, I do not know if that could be related to the socket exhaustion problem, in which case I would not know how to make a fix or workaround, unless i drop all my wsdl clients and use direct requests, but this is too much work and basically would be dropping support for wsdl clients.

antonioortizpola on 26 Apr 2019

👍1

according to the docs, you should not be using HttpClient directly

It's fine to use HttpClient directly. HttpClientFactory layers on top of that to provide additional management. When you do use HttpClient directly, you should reuse instances as much as possible.

stephentoub on 26 Apr 2019

@stephentoub I do not know how to define "fine", and how much is "as much as possible", on the other hand, no, you should not simply hold the client as long as possible, since it will not respect DNS changes.

Again, from the docs

...But there’s a second issue with HttpClient that you can have when you use it as singleton or static object. In this case, a singleton or static HttpClient doesn't respect DNS changes

This is a real problem when you have infrastructure on the cloud, it is not as simple as "hold to your client", that exactly what HttpClientFactory is trying to mitigate

antonioortizpola on 26 Apr 2019

In this case, a singleton or static HttpClient doesn't respect DNS changes

This is a real problem when you have infrastructure on the cloud

That information is out-of-date and no longer accurate.

stephentoub on 26 Apr 2019

@stephentoub well, if the docs are wrong then I am lost.

The last update is from 01/06/2019, should I ask for an update?, could you please tell us exactly what is wrong so they can make the updates?

Also, if the best solution is to keep the HttpClient for as long as you can, I should not be better just use it as singleton? this would render the IHttpClientFactory pretty much useless, it would be just a fancy name for singleton.

Also it would be great to make that clear in the Core documentation, that you can use HttpClient as singleton should be an option in the "Making Http requests" part, since it still states:

Manages the pooling and lifetime of underlying HttpClientMessageHandler instances to avoid common DNS problems that occur when manually managing HttpClient lifetimes

antonioortizpola on 29 Apr 2019

❤1

could you please tell us exactly what is wrong so they can make the updates?

SocketsHttpHandler, which is the default handler implementation backing the HttpClient starting in .NET Core 2.1, has two properties on it: PooledConnectionIdleTimeout (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionidletimeout?view=netcore-2.2) and PooledConnectionLifetime (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionlifetime?view=netcore-2.2). The latter governs how long a connection is allowed to be reused. It defaults to Infinite, which means it won't be proactively torn down by the client for reasons of how long its been around. But if you set it to something shorter, like 5 minutes, that tells the handler that once the connection has been around for 5 minutes, it shouldn't be reused again and the handler will prevent further requests from using it and will tear it down. Any subsequent requests would be forced to get a new connection, which will again consult DNS to determine to where the connection should be opened.

stephentoub on 29 Apr 2019

👍1

SocketsHttpHandler, which is the default handler implementation backing the HttpClient starting in .NET Core 2.1, has two properties on it

So, nothing was changed from 2.1 to 2.2 that would explain the errors showing up on 2.2 but not 2.1 right?!

rbrugnollo on 29 Apr 2019

So, nothing was changed from 2.1 to 2.2 that would explain the errors showing up on 2.2 but not 2.1 right?!

Correct. We did just handful of targeted servicing-level bug fixes in 2.2/2.2.x over 2.1.x.
It is quite possible that the "regression" is caused by another component, or we're just "lucky" to hit it on 2.2 due to "random" reasons.
Just to confirm: Did anyone hit it in 2.1 at all?

karelz on 29 Apr 2019

Just to confirm: Did anyone hit it in 2.1 at all?

Not for us, I can confirm, no problems in 3 months with 2.1, the problem started the day we switched to 2.2.

I am in the process of make my WSDL clients singletons, I hope to end by the next week, that way I could confirm if it is a problem with the HttpClient alone or if WCF is making something wrong

antonioortizpola on 30 Apr 2019

👍1

Just to confirm: Did anyone hit it in 2.1 at all?

App working for 6 months on 2.1 without any issues, now happening on 2.2.

I'm trying to filter exactly which call is throwing the error, so I can isolate and run more tests.

rbrugnollo on 30 Apr 2019

👍1

Based on the replies here, I don't think it will be simple to create a repro (although it would be most helpful).
I would recommend to get any repro environment where we can experiment - collect more data, try private builds, etc.
If anyone has such environment (incl. production) where they can experiment and work closely with us, please let me know and let's dig deeper into it ...

karelz on 2 May 2019

👍1

I have set up a number of ASP.NET Core 3.0 trial projects and apply Kubernetes orchestration support. In every single case, the first time I run a debug session everything works fine. Then when I stop the session and start up again I get this error. The only way to get around it is to close VS2019 (not preview version) and restart for the next debugging session.

This does not happen with ASP.NET Core 2.2.

A screen grab of the issue:

simonziegler on 15 May 2019

@simonziegler that problem is not related to this issue, please feel free to open a new issue so It can be disused appropriately.

Also, it seems an odd error, like the debuger is not finalizing and releasing the port, I would try to install the latest version of VS and Net Core 3 to be sure, and I would use the "report problem" option shipped within Visual Studio instead of this git which is code related only.

antonioortizpola on 15 May 2019

System.Net.Http.HttpRequestException: Address already in use ---> System.Net.Sockets.SocketException: Address already in use at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)

This exception happens both in .NET Core 2.1(sdk:2.1.505&runtime:2.1.9) and .NET Core 2.2(sdk:2.2.105&runtime:2.2.3) with k8s Enviornment(v1.6.7/v1.9.7) after a long running(few days) and never happens in .NET Core 2.0. I'll refactor each call instance HttpClient with HttpClientFactory to try resolve this problem although this exception may still happen based on the previous reply.

OpenSourceAries on 22 May 2019

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

For us, this is the one that we need:
https://github.com/dotnet/corefx/pull/32046/files

LukePulverenti on 22 May 2019

@LukePulverenti did you confirm that particular change helps your case? Or did you use latest .NET Core 3.0 to validate that?
@antonioortizpola above mentioned that the change (in .NET Core 3.0) does not help their scenario at all: https://github.com/dotnet/corefx/issues/37044#issuecomment-486335084

karelz on 23 May 2019

Reopening to track solution at least in 3.0

karelz on 23 May 2019

We still need someone to help us track this down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help. Thanks!

karelz on 23 May 2019

It seems like we mixing multiple issues here. Part of the discussion is about UDP and part about HttpClient.

wfurt on 23 May 2019

👍2

My problem is with HttpClient, my project has two ways to use it:

Directly for our restful endpoints
Indirectly using WCF for soap services

It is the only thing that my project does and we are hitting the issue as this comment states.

Sadly for time pressure we just set a workaround to restart the program each time this happens, and been working in other things, but if it can help, I could separate my calls in two projects so I can be sure if the problem comes from the SOAP services or our restful services

antonioortizpola on 23 May 2019

ok, for HTTP: The error was really puzzling to me. Even of man page to connect() mentions EADDRINUSE, I could not find it while looking at Linux kernel sources.
Only one place I could find, we bind() and we don't use that in HttpClient. It turend out we actually do at Socket.ConnectAsync():

``` c#
if (_rightEndPoint == null)
{
if (endPointSnapshot.AddressFamily == AddressFamily.InterNetwork)
{
InternalBind(new IPEndPoint(IPAddress.Any, 0));
}
else if (endPointSnapshot.AddressFamily != AddressFamily.Unix)
{
InternalBind(new IPEndPoint(IPAddress.IPv6Any, 0));
}
}

 and than https://github.com/torvalds/linux/blob/master/net/ipv4/af_inet.c#L526-L532

```c
if (snum || !(inet->bind_address_no_port ||
              force_bind_address_no_port)) {
        if (sk->sk_prot->get_port(sk, snum)) {
            inet->inet_saddr = inet->inet_rcv_saddr = 0;
            err = -EADDRINUSE;
            goto out_release_sock;
        }

So this error can pop up if system runs out of port numbers.

This can also happen if IPAddress.IPv6Any is not available be we try to connect on IPV6/dual-mode socket. But that should be be pretty deterministic and I would not expect it to fail only some times (or be fixed by restart)

I would suggest to check that and for example follow https://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html

Note that if anybody can give it try, you can do:

strace -f -o trace.txt -s 200 -t -e trace=connect,bind ./myCoolApp

I know the message is confusing but the bottom of this may be system running out of resources.
Also note, that it may be worth of checking process limits for file descriptors and buffers.

It may be worth of monitoring /proc/<PID>/fd to see, if descriptor count is going up.

wfurt on 24 May 2019

So this error can pop up if system runs out of port numbers.

This matches with my earlier comment https://github.com/dotnet/corefx/issues/37044#issuecomment-485055490

So either the system is running out of ports due to a limited port range, or HttpClient is leaking sockets (or keeping them open too long).

tmds on 24 May 2019

Repro Szenario for UDP Bug

@karelz - you wrote:

@LukePulverenti did you validate your problem is indeed the same root cause and is fixed in .NET Core 3.0? (and that it is not just same symptom)
There seems to be enough +1s to justify backport, we just need to be sure it is the right fix ... first step would be to validate on 3.0. Then we can cherry pick and ask for private validation on 2.2/2.1 build.

and

We still need someone to help us track this down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help. Thanks!

Following up your chat with @LukePulverenti about backporting the fix to 2.2, I have created a reproduction scenario for you: https://github.com/softworkz/ReuseBug

The solution contains a native Linux app and a netcore console app, multi-targeting netcore 2.0, 2.2 and 3.0

This demonstrates:

works in 2.0
fails in 2.2
works again in 3.0

I hope this helps getting the fix backported to 2.2...

softworkz on 24 May 2019

@softworkz thank you !

@karelz Yes it would be great to get this back-ported because ever since the 2.1 release we've had to tell users to shutdown all other upnp or dlna software on the machine in order to prevent this from happening.

LukePulverenti on 24 May 2019

👍2

BTW Here is nice reading: https://idea.popcount.org/2014-04-03-bind-before-connect/

wfurt on 24 May 2019

@softworkz @LukePulverenti I think we may be dealing with multiple problems here as some people on this thread said that 3.0 does not fix it for them.
Either way, we have a repro now, so let's try it -- @tmds or @wfurt will you have time to try it out and reproduce? If we can reproduce in-house, it should be easier for us to track it down. I'd be also interested in the repro result on 2.1.

Thanks @softworkz for repro!!! That is a HUGE step towards root cause and solution. Let's hope we can reproduce it too :)

karelz on 24 May 2019

@karelz - Yes, ours is about the UDP bug https://github.com/dotnet/corefx/issues/32027 which was correctly fixed for 3.0 by PR https://github.com/dotnet/corefx/pull/32046 and we're hoping to get it backported to 2.2. It's getting a bit embarrassing having to tell users that our software cannot coexist with other DLNA software, especially once they've found out that any other two (non-netcore) applications can do that.. ;-)
(even worse is that it has been working in a previous version with netcore 2.0)

Regarding 2.1: It fails with 2.1 as well. I've just updated the repro solution (https://github.com/softworkz/ReuseBug) by adding 2.1 as additional target framework and publishing target.

softworkz on 24 May 2019

Thanks @softworkz for repro!!! That is a HUGE step towards root cause and solution. Let's hope we can reproduce it too :)

@karelz @softworkz is talking about a UDP issue https://github.com/dotnet/corefx/issues/32027 which was decided not to be backported: https://github.com/dotnet/corefx/issues/32027#issuecomment-418447086.

The main issue reported here is a TCP issue observed when using HttpClient.

tmds on 25 May 2019

👍1

@karelz @softworkz is talking about a UDP issue dotnet/runtime#27274 which was decided not to be backported: #32027 (comment).

And still we're asking for it. It's a bug - not a "corner case".

The main issue reported here is a TCP issue observed when using HttpClient.

Not quite. We're not the only ones referring to the UDP bug here.

softworkz on 25 May 2019

Not quite. We're not the only ones referring to the UDP bug here.

Yes, this is causing confusion, so it's good to make clear the difference. The issue reported here is for HttpClient/TCP, and it was assumed the UDP fix would solve it, which is not the case.

tmds on 25 May 2019

👍1

Agreed, this issue is already pretty confusing even without mixing it up with UdpClient. Let's keep this issue specific to HttpClient/TCP (I will update the title).
Let's move the discussion about backporting dotnet/runtime#27274 into separate issue please (we can reuse dotnet/runtime#27274 for now) -- BTW: I would be interested in confirmation what exactly are symptoms of UdpClient - in that issue, not here please. Note: So far I believe we have 2 customers hitting it.

karelz on 25 May 2019

@karelz - Can you move posts? Or should I repeat my information in the other issue?

softworkz on 25 May 2019

@softworkz posts cannot be moved, please copy relevant information over. Thank you!

BTW: I hid the UdpClient comments from the thread to avoid further confusion.

karelz on 25 May 2019

Just to reiterate:
We still need someone to help us track the HttpClient problem down:
Anyone has an environment where it happens on somewhat regular basis, where we could work with you to collect more logs and experiment? It would be great help that would unblock us. Thanks!
See instructions from @wfurt above: https://github.com/dotnet/corefx/issues/37044#issuecomment-495425689

karelz on 25 May 2019

@yuezengms @yanrez @arsenhovhannisyan @antoinne85 @blurhkh @EvilBeaver @antonioortizpola @LukePulverenti @sapleu @rrudduck @rbrugnollo @robjwalker @OpenSourceAries I'd like to ask you for 2 favors:

Can you please confirm if your repro is truly on HttpClient/TCP and NOT UdpClient? (please confirm you're on HttpClient/TCP by upvoting this reply)
Is any one of you in position to collect additional logs and work with us to root cause this problem? We would love to address it, but we have nothing actionable at this moment without help from someone who can hit the problem and can collect additional info. Thanks!

karelz on 25 May 2019

👍5

We hit this issue on a .net core 2.2 application running on Azure Linux Kubernetes. We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed. Though when running in windows the client was releasing the ports. We updated our dependency injection to use the IHttpClientFactory and used this to create the HttpClients which fixed our issues.

dmiller02 on 31 May 2019

We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed.

Got a repro you can share?

stephentoub on 31 May 2019

We tried using .net core 3 and while this improved some of the issues we still consistently hit this error. Our investigations found that the HttpClient wasn't releasing ports despite the clients being disposed.

Got a repro you can share?

Sorry, unfortunately not.

These are the pages we found that helped us:
https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.2
https://docs.microsoft.com/en-us/dotnet/standard/microservices-architecture/implement-resilient-applications/use-httpclientfactory-to-implement-resilient-http-requests

dmiller02 on 31 May 2019

@dmiller02 are you in position to get back into bad state and help us collect some logs?

karelz on 31 May 2019

@dmiller02 are you in position to get back into bad state and help us collect some logs?

Shouldn't be too difficult. We have the images for the service in question and can re-create the error.
What logging would you need?

dmiller02 on 31 May 2019

❤1

@wfurt @stephentoub what kind of logs may help us confirm what is going on, on Linux?

karelz on 31 May 2019

To begin with, can we collect output of netstat -natu wehn it happens and than run sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535" to see if that improves the situation?

wfurt on 31 May 2019

I have run netstat -natu in our server and it looks like the error is with the WSDL endpoints (the address 172.17.72.150 is a WSDL service).

To work with the WSDL we are adding the service as Transient:

```c#
public static IServiceCollection AddExternalServices(
this IServiceCollection serviceCollection,
IConfiguration configuration)
{
serviceCollection.AddGrpc();

        serviceCollection.AddTransient<CenamWsAddClaroMicroCredit>();
        return serviceCollection;
    }


And then we use it in the service like this

```c#
        public async Task<AddClaroMicroCreditResponse> AddClaroMicroCredit(RequestInfo requestInfo)
        {
            var wsdlServiceClient = new WsdlServiceClient(
                CenamOperationClient.EndpointConfiguration.CenamOperationPort, _cenamOpConfig.CreditsEndpoint);
            var timedWsdlRequestWithLog = new TimedWsdlRequestWithLog(_logger, ServiceName);

            await _cenamWsThrottling.WaitToActionAndIncrement();
            var response = await timedWsdlRequestWithLog.ExecuteLogDurationAndReturnRequest(
                wsdlServiceClient.Endpoint,async () =>
                    await wsdlServiceClient.addClaroMicroCreditAsync(
                        _cenamOpConfig.Entity,
                        requestInfo.Data1,
                        requestInfo.Data2,
                        requestInfo.Data3)
            );
            // work with the result...
        }

The method to log is just a function to log the request and response with the service name

```c#
public async Task> ExecuteLogDurationAndReturnRequest(ServiceEndpoint serviceEndpoint, Func> action)
{
var requestLogAndTime = new RequestLogAndTimeEndpointBehavior();
serviceEndpoint.EndpointBehaviors.Add(requestLogAndTime);

        T response;
        try
        {
            response = await action();
        }
        catch (Exception e)
        {
            _logger.LogError(e,
                "Unexpected error requesting to wsdl service '{serviceName}' in {responseTime}ms to '{serviceEndpointAddress}' with body '{request}' responded: {response}",
                _serviceName,
                requestLogAndTime.LastResponseTimeInMillis,
                serviceEndpoint.Address,
                requestLogAndTime.LastRequestXml,
                requestLogAndTime.LastResponseXml);
            throw;
        }

        _logger.LogInformation(
            "Request to service '{serviceName}' in {responseTime}ms to '{serviceEndpointAddress}' with body '{request}' responded: {response}",
            _serviceName,
            requestLogAndTime.LastResponseTimeInMillis,
            serviceEndpoint.Address,
            requestLogAndTime.LastRequestXml,
            requestLogAndTime.LastResponseXml);

        return new RequestRawAndResult<T>(response, requestLogAndTime.LastRequestXml, requestLogAndTime.LastResponseXml);
    }

```

Maybe something inside the wsdl client is causing the issue (like creating the HttpClient by itself), but then, how I could workaround this (this way had no problems at all with core 2.1),

antonioortizpola on 31 May 2019

From the log:

tcp        1      0 172.20.0.2:38736        172.17.72.150:28085     CLOSE_WAIT

That means the server or client did not finish closing the socket.
You should also see this with lsof -i -n -p <PID>
Can you please do packet capture for few requests @antonioortizpola ? I'm wondering if this happens on every request and it just takes some time to use all port numbers.

I'm not familiar with WSDL code. Can you craft runable repro - just like we got one for the UDP case. I think this also depends on server no closing the socket so we may not be able to reproduce it with only client side.

We do not need to hit the bind error. All we need to reproduce is getting new socket stuck in CLOSE_WAIT state.

wfurt on 31 May 2019

@wfurt sure, I can work in my project by removing all code, just leaving one WSDL client.

I will try to work on this on the weekend (right now I am in the office and have some other tasks), creating a simple solution and make some packet capture with wireshark, then I will let you know what i can find.

antonioortizpola on 31 May 2019

thanks @antonioortizpola . it would be nice to get to bottom of this. Seeing half-closed TCP is certainly clue.

wfurt on 31 May 2019

👍1

BTW any chance @antonioortizpola that you can share core dump from time when it is failing? (it does not need to reach port exhaustion - we just need few sockets in half-closed state)
It will be large and it can contain any secrets or private data. But if we can work it out I think we would be able to sort this out.
If that not possible, we may be able to script dump file processing or I can guide you through sequence to get useful info out.

wfurt on 4 Jun 2019

👍1

Ok, I have the example!!! @wfurt, @karelz, I will make a readme, but the results are clear.

I created two projects, one with core 3 and one with core 2.1, It should be virtualy the same code, but after some stress tests, the version with core 3 does not release the ports:

....
tcp        1      0 172.18.0.3:39787        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:34693        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:46675        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:38431        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:45375        172.18.0.4:80           CLOSE_WAIT
tcp        1      0 172.18.0.3:46011        172.18.0.4:80           CLOSE_WAIT
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:49014        0.0.0.0:*

While the 2.1 it does

root@1b70aaaed9f2:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.11:33615        0.0.0.0:*               LISTEN
tcp        0      0 172.18.0.2:35071        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:43885        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:43295        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:33653        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:45709        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:37075        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:36519        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:34009        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:35771        172.18.0.4:80           TIME_WAIT
tcp        0      0 172.18.0.2:46763        172.18.0.4:80           TIME_WAIT
udp        0      0 127.0.0.11:55536        0.0.0.0:*
root@1b70aaaed9f2:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:50051           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.11:33615        0.0.0.0:*               LISTEN
udp        0      0 127.0.0.11:55536        0.0.0.0:*
root@1b70aaaed9f2:/app#

It is a little late now, but tomorrow I can upload the project and give you access so you can run it yourselves

antonioortizpola on 7 Jun 2019

Ok, I have the repo, I invited @karelz and @wfurt, I hope this can help with something, please let me know if I can help with something else.

antonioortizpola on 7 Jun 2019

❤1 👍1

thanks @antonioortizpola, I will take a look. Are you suggesting 3.0 fixes the problem?

wfurt on 7 Jun 2019

👍1

@wfurt nooo, the problem did not happen in core 2.1, but it is happening in 2.2 and 3, and you are welcome, again, if I can help in something else just let me know.

You can simply try to update the app in 2.1 to 2.2 and the problem will appear.

antonioortizpola on 7 Jun 2019

This is AWESOME! Thanks a lot @antonioortizpola, fingers crossed that we will be now able to quickly root-cause it and fix it in 3.0/2.2! 🙏

karelz on 7 Jun 2019

👍5

@wfurt, were you able to reproduce the scenario? just to know if the repo code worked for you, or if there is something more that I can help with

antonioortizpola on 10 Jun 2019

I'm still diffing through @antonioortizpola .
I got services up and I could see

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp        0      0 172.18.0.3:43294        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:43286        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:34212        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34218        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34214        151.101.52.204:80       TIME_WAIT
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

but if I wait a little bit, all the connections go away.

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

I also could not find any direct usage of HttpClient. Everything seems to be wrapped in some high-level calls so I'll need to unwind that.

wfurt on 10 Jun 2019

@wfurt, did you run over the 3 version or 2.1? that is the behavior that i got from the 2.1 version. In the core 3 version the sockets would never exit the CLOSE_WAIT state, even an hour after the test ended.

antonioortizpola on 10 Jun 2019

a little confused by the comment from @karelz. the fix should be applied to 3.0/2.2, as 2.1 does not have this socket issue, but the memory thing that should be fixed/merged in 2.2

jarlehal on 10 Jun 2019

@jarlehal typo, fixed, thanks for pointing it out.

karelz on 10 Jun 2019

👍1

@wfurt has been doing a good job digging into this, and shared with me that he noticed something suspicious, that in a repro when analyzing it with SOS there ended up being a small number of Sockets on the heap but a large number of SafeSocketHandles. Based on that, I have a theory that this is due to https://github.com/dotnet/corefx/pull/32845 / https://github.com/dotnet/corefx/pull/32793. I don’t think it actually caused the problem so much as the bug it was fixing was actually masking this problem that’s existed for a long time.

SocketsHttpHandler creates a Socket for each connection. Each Socket creates a SafeSocketHandle (that’s its name in 3.0; prior to that it was internal and named SafeCloseSocket), a SafeHandle that wraps the underlying file descriptor (there’s actually a secondary SafeHandle in the middle, but that’s not relevant). On Unix, when the Socket is connected, it’s registered with the SocketAsyncEngine, which is the code responsible for running the event loop interacting with the epoll handle. Whenever the epoll wait shows that there’s work available to be done, the event loop maps the relevant file descriptor back to the appropriate SafeSocketHandle so that the relevant work can be performed and callbacks invoked. In order to do that mapping, the SocketAsyncEngine stores a ConcurrentDictionary, and the engines themselves are stored in a static readonly SocketAsyncEngine[] array… the punchline here is that these SafeSocketHandles end up being strongly rooted by a static array.

The other important piece of information is that there’s a Timer inside SocketsHttpHandler that runs periodically to check whether connections in the connection pool are still viable, and if they’re not, Dispose’s of them. The bug that the aforementioned issues fixed was that there was an unexpected cycle formed between the timer and the connection pool that ended up keeping everything alive indefinitely, resulting in a non-trivial memory leak. However, as a side effect of that leak, it meant that the timer would continue to run, and every time it fired, it would loop through all of the open connections and Dispose of the ones that were no longer viable. In the fullness of time, all of them would get Dispose’d. Disposing of the connection would dispose of the Socket which would Dispose of the SafeSocketHandle and remove it from the SocketAsyncEngine’s dictionary.

Now, with the aforementioned fixes, if code fails to Dispose of the HttpClient/SocketsHttpHandler when done with them and drops the references to them, the timer gets collected, as does the connection pool, as do all of the HttpConnection objects in the pool. None of those have finalizers, nor should they need them. But here’s the rub. Socket does have a finalizer, yet its finalizer ends up being a nop. Since the storing of the SafeSocketHandle into the static dictionary isn’t something that can be undone automatically by GC, we actually need a finalizer to remove that registration should everything get dropped. Since all those objects don’t have finalizers, and since Socket’s finalizer isn’t doing the unregistration, everything gets collected above the SafeSocketHandle, which then remains registered effectively forever, never being disposed of, and never closing its file descriptor.

I don’t know for certain whether this is the cause of this issue. It’s just a theory, and @wfurt is working through the repro, debugging, and testing out theories. If this doesn’t turn out to be the root cause here, I suspect it’s still a bug we need to fix. If it does turn out to be the root cause, I don’t think the fix is to revert the aforementioned fixes: they were valid, they just revealed this existing problem they had been masking by creating a different leak that in turn allowed the timer to dispose of these resources... plus, this issue would apply to all uses of Sockets that weren't disposed of, not just those used from SocketsHttpHandler. The actual fix would likely be to either use a weak reference when storing the SafeSocketHandle into the dictionary (which might be the right fix but could also potentially cause perf or otherwise unforeseen problems), or to ensure that a finalizer is put in place to undo that registration (most likely changing Socket’s finalizer accordingly on Unix).

In the meantime, assuming this is the cause, in addition to fixing it in System.Net.Sockets, code using HttpClient/HttpClientHandler/SocketsHttpHandler should also be Dispose'ing of those instances when done with them. If you just create a single HttpClient instance that's stored in a static, there's no real need to dispose of it, as everything will go away when the process ends. But if you're doing something that creates an instance, uses it for one or more requests, and then get rid of it, when getting rid of it it should be disposed.

cc: @geoffkizer, @tmds

stephentoub on 11 Jun 2019

❤6

I'm making some progress. On the note above: If you add cenamOperationClient.Close() to GetSubscriberDetailsF() after response is received to close wfc, there are no lingering sockets at all @antonioortizpola
With old platform handlers, socket can be closed independently but now usual reference counting works and disposing HttpClient when not used can lead to delayed release of resources. So as any reference to HttpResponseStream would keep underlying socket open.
I think there is definitely issue with 2.2+ but there can be more than one reason for observed behavior.

wfurt on 11 Jun 2019

👍1

We have a same issue. All ports created with IHttpClientFactory stays in CLOSE_WAIT state

vasicvuk on 12 Jun 2019

All ports created with IHttpClientFactory stays in CLOSE_WAIT state

@glennc, @rynowak, is HttpClientFactory disposing of all handlers it creates?

@vasicvuk, are you disposing of all HttpResponseMessages you're given and response Streams you're given?

stephentoub on 12 Jun 2019

@wfurt, thanks a lot for your investigation! I will change the code so the service closes the connection.

I am glad the repository could help to replicate the problem and I hope it could help others.

Please correct me if I am wrong, but I think the main problems are:

People who is not using HttpClient correctly.
People who use libraries that use sockets or HttpClient and make assumptions based in previous behaviors (like WCF and me).

For the first group, please make sure you are using HttpClient correctly, most probably, it will fix the problem and improve your system.

For the second group, search for methods that could close the connection or IDisposable interfaces, and make tests monitoring your sockets (like with netstat -natu) to check if it can help fix the problem. Or check if you can reuse your connections.

If the problem persist tell us how are you using the client or socket or library, and if possible create a simple repository with a reproduction case.

antonioortizpola on 13 Jun 2019

❤1 👍1

The repro was extremely useful, thanks @antonioortizpola.

In either case we should not leak OS resource and we do right now in some cases with 2.2 code.
Disposing explicitly is best as everything is released when not needed. In the other case socket can stay opened until GC kicks in and that may take some time depending on many variables.

wfurt on 13 Jun 2019

👍1

I think the main problems are

There are two issues here:

There's a bug in Sockets on Unix where if you allow a connected Socket to be collected without it having been disposed, it'll leak the underlying handle.
Consumers of HttpClient are sometimes not using it correctly, leading to the above bug getting triggered.

Fixing either of those is technically sufficient to address this issue, although even when (1) is fixed, it's important for (2) to be done, as the fix for (1) is still non-deterministic and could take an unbounded amount of time to kick-in.

stephentoub on 13 Jun 2019

👍1

@stephentoub Hi,
We are using HttpClientFactory and using for HttpResponseMessages. So as i understand issue is that socket is not beign disposed after some time if there is no strict dispose in code. I guess that using block will fix this?

vasicvuk on 13 Jun 2019

I guess that using block will fix this?

yes, a using block causes Dispose to be called.

tmds on 13 Jun 2019

I submitted fix to 3.0 master. It would be great if anybody can grab daily build and verify that it solves their issue.

Since this is somewhat generic error, there may be more than one issue under the cover. In either case any feedback would be useful.

kudos to @antonioortizpola who was able to isolate repro.

wfurt on 16 Jun 2019

❤3 👍2

Will it be possible to get this fix in 2.2?

jarroda on 16 Jun 2019

👍2

possibly. It will be easier to get approval if we can confirm that this fix solves observed issues. e.g try 3.0 before and after.

wfurt on 17 Jun 2019

👍3

For us having exact issue symptoms isolated due to the application code.

anilpras on 8 Jul 2019

This issue is making big troubles for us in production apps.

We reviewed all HttpClient usages in our code and migrated to HttpClientFactory. But it still happens from time to time.

The interesting thing is that we're having 2 similar apps (they use HttpClient the same way) on Azure Kubernetes and GKE (Google's Kubernetes) and the issue only happens on Azure.

Any plans to fix it in 2.2?

alxbog on 2 Aug 2019

2.2 port is waiting for verification @alxbog. We need to get enough evidence that dotnet/corefx#38499 fixes it or we need separate repro for 2.2. Until then it is unlikely we get permission for 2.x changes.

wfurt on 2 Aug 2019

👍2

If someone is willing to try a private build with the fix ported to 2.2/2.1, that would help us prove it is worth porting.
If you are that person let us know and we can provide privates with the change ported for testing ...

karelz on 3 Aug 2019

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite: https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#httpclient-and-lifetime-management

phillijw on 5 Sep 2019

@phillijw note that closed issues are not monitored. The docs you linked are for IHttpClientFactory (part of ASP.NET code).
In general, you should not dispose HttpClient, but if you want to avoid stale DNS records problem, it is healthy to recycle your static instance on regular basis (HttpClientFactory does it for you).

karelz on 5 Sep 2019

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite

The short answer is: HttpClient is an IDisposable, and as with any IDisposable, you should Dispose of it whenever you're done with it.

The question then becomes "when should I be done with it?"

If you're creating your own HttpClient instance, e.g. new HttpClient() or new HttpClient(new SocketsHttpHandler()), then you should ideally be reusing the instance over and over and over, rather than creating a new one per request. That's because the underlying handler owns the connection pool, disposing of the handler will dispose of the connection pool, and the aformentioned constructors end up using the public HttpClient(HttpMessageHandler handler, bool disposeHandler) constructor with disposeHandler:true. You still want to Dispose of the HttpClient when you're done with it, but in the common case you shouldn't ever be "done" with it, as you just stash it into a static field and use it for all subsequent requests. If you do decide to be done with it at some point, such as if you want to replace it with a different instance for some reason, then you'd want to Dispose of it then. Again, in this way, it's no different from any other IDisposable: it owns resources, dispose of it when you no longer need those resources.

However, the docs you link to are for IHttpClientFactory. It muddies the waters a little, because it maintains and manages its own set of HttpMessageHandler instances, and its design is to give you back a new HttpClient instance every time you ask for one: the intent here is that you use that HttpClient for the lifetime of your request, at which point you're "done" with it, and as such per my previous comments at which point it should be disposed. That instance wraps one of these shared HttpMessageHandler instances, but was constructed with disposeHandler:false argument:
https://github.com/aspnet/Extensions/blob/c2147ae6a07c5ebf6aa6ef2f8de86e0851fc13ca/src/HttpClientFactory/Http/src/DefaultHttpClientFactory.cs#L117-L134
such that disposing of the HttpClient won't dispose of the underlying shared handler.

stephentoub on 5 Sep 2019

Thanks for the explanation @stephentoub. The clarity on disposeHandler is what I was really missing. I feel like the examples on the docs could be updated to maybe discuss that point a bit more or to show examples where the http client DOES get disposed. For instance, https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#use-ihttpclientfactory-in-a-console-app does not dispose even though it fits your example of where it should be disposed, I think.

phillijw on 5 Sep 2019

does not dispose even though it fits your example of where it should be disposed

Yes, the client in the GetPage method in that sample should be disposed. Thanks for pointing that out.
cc: @glennc, @rynowak

stephentoub on 5 Sep 2019

👍1

@karelz I experience this problem with .net core 2.2. Is it possible to get private libs to test the backport?

MrZoidberg on 1 Oct 2019

@wfurt can you please create a private build against 2.2 for @MrZoidberg to test?

karelz on 1 Oct 2019

UPDATE: Disregard this comment... the System.Net.Sockets.SocketException: Address already in use bug kept happening in both cases until. I've only solved it by moving my class from services.AddScoped into services.AddSingleton

PREVIOUS COMMENT (again, disregard):
I'm getting this error a lot now in situation where I've switched from:

```c#
httpResponse = await _client.GetAsync(method);


to:

```c#
using (var req = create(HttpMethod.Get, finalUrl)) {
    httpResponse = await _client.SendAsync(req);
}

protected virtual HttpRequestMessage create(HttpMethod method, string url, string postBody = null) {
    var req = new HttpRequestMessage(method, url);
    return req;
}

I'm disposing of httpResponse after deserializing response in both cases. Any opinion on the code @karelz @stephentoub? I'm on .NET Core 2.2 and would prefer not upgrading to .NET Core 3.0, unless it would help you further debug and solve the problem.

rockstardev on 22 Oct 2019

The problem with the socket comes from the HttpClient, not HttpRequestMessage, how are you generating the HttpClient? are you using DI, HttpClientFactory?

antonioortizpola on 22 Oct 2019

Have you registered your HttpClient at bootstrap ? i.e. something like this:
```c#
services.AddHttpClient(c =>
{
c.BaseAddress = new Uri("https://base_uri");
c.Timeout = TimeSpan.FromSeconds(30);
}).SetHandlerLifetime(TimeSpan.FromSeconds(30));

Then from your `MyService` inject it in:
```c#
private readonly HttpClient _client;

public MyService(HttpClient client)
{
    _client = client;
}

Implementation should be as simple as:
c# using (var response = await _client.GetAsync(url)) { return await response.Content?.ReadAsStringAsync(); }
Since you're using the GET method the GetAsync(url) function should be sufficient.

This lets .NET Core handle all your connection pooling for you so you don't have to worry about managing it yourself.

adelhelal on 23 Oct 2019

I'm creating HttpClient by myself in both cases. However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

If I just use GetAsync - no problems.

@karelz @stephentoub can you please review this and provide any feedback? I'm hoping if you investigate in this direction that you can potentially solve the problem for others as well.

rockstardev on 25 Oct 2019

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

stephentoub on 25 Oct 2019

I'm hoping if you

GetAsync is just a wrapper around SendAsync.

https://github.com/dotnet/corefx/blob/092ede257ff5521c6bae20b27f1334536b1b985c/src/System.Net.Http/src/System/Net/Http/HttpClient.cs#L300-L304

OpenSourceAries on 25 Oct 2019

i got same error in production too(.netcore 2.2 ubuntu k8s)
we follow the sample code with aspnet suggest doc and here is the demo code:

in Startup.cs
```c#
public void ConfigureServices(IServiceCollection services)
{
services.AddMvc();

        // clients
        services.AddHttpClient<IAddressClient, AddressClient>()
            .AddHttpMessageHandler(handler => new TimingOutDelegatingHandler())
            .AddHttpMessageHandler(handler => new RetryPolicyDelegatingHandler());


here is the AddressClient.cs
```c#
public AddressClient(
            HttpClient httpClient,
            IOptions<UsersApiDomainConfig> usersApiConfig,
            ILogger<AddressClient> logger)
            : base(usersApiConfig)
        {
            _httpClient = httpClient;
            _logger = logger;
        }

here is the use of httpclient:
c# public async Task SyncAddressAsync(AddressModel address) { var url = GetUrl(Path); var response = await _httpClient.PutAsync( url, new StringContent(JsonConvert.SerializeObject(address), Encoding.UTF8, HttpClientConstants.ApplicationJson));

i think we do as the doc suggest to

draco111 on 25 Dec 2019

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

Thanks for your help @stephentoub and @OpenSourceAries ... in the end I solved my problem in different fashion since I was still getting errors... updated original post: https://github.com/dotnet/corefx/issues/37044#issuecomment-545038295

rockstardev on 2 Jan 2020

I'm also seeing this exception with MongoDB driver in K8s, which isn't using the HttpClient. So is this fix this a thing or is the scope larger than originally anticipated?

chrisdrobison on 1 Apr 2020

Runtime: Address "System.Net.Sockets.SocketException: Address already in use" on K8S/Linux using HttpClient/TCP

Issue Title

General

Most helpful comment

All 138 comments

EDIT

Repro Szenario for UDP Bug

Related issues