Actix-web: UDP Socket Leak

Created on 7 Dec 2018 · 35Comments · Source: actix/actix-web

I have a service that after ~18 - 24 hours runs out of file descriptors after a lot of investigation and some production debugging we were able to track it down to being leaked UDP sockets. My first thought is something about DNS resolution is wrong. When caught in the act the server had 418 UDP sockets open and it climbs up to the ulimit of 1024 and then server CPU spikes and the health check fails and the container is cycled.

C-bug-upstream

Source

glademiller

Most helpful comment

@bluejekyll thanks for publishing that. I have been running it for about 2 hours so far and am not seeing the UDP socket leak. Previously it would show up almost immediately for my use case -- lots of outgoing connections (~3000/s) with ~300 ms timeouts -- and about 5% of the connections ultimately timeout.

jgrimes on 17 Dec 2018

👍3

All 35 comments

Hmm, unexpected.

Actix uses udp only for dns, via trust-dns.
@bluejekyll do you aware about any leaks? actix code is pretty simple.

fafhrd91 on 7 Dec 2018

This is the first mention of a potential issue here, but of course anything is possible. Is there anything interesting about these work loads? Like a lot of unique outbound requests to different DNS names?

bluejekyll on 7 Dec 2018

For some context the application is executing in a docker container running in an AWS ECS cluster. I have been unable to reproduce the problem locally using the same docker container so I think it must be something peculiar to the DNS configuration in AWS. I'm working on getting more information.

glademiller on 7 Dec 2018

As far as the requests that are being made they fall into the following categories:
1) Requests to Servers within the same ECS cluster resolved via Route53 private DNS
2) Requests to a server inside another ECS cluster
3) Requests to SAP hosted servers

glademiller on 7 Dec 2018

did you try previous version of actix?

fafhrd91 on 7 Dec 2018

I will attempt a downgrade now

glademiller on 7 Dec 2018

Initial results after the downgrade are that file descriptors no longer appear to be climbing. I will monitor for a while and report back if that ceases to be the case.

glademiller on 7 Dec 2018

trust-dns shouldn't be hanging onto UDP sockets after Timeouts, so this is definitely strange. On top of that, it should de-prioritize nameservers if they aren't returning results. If downgrading does fix the issue, it would be good to publish the Cargo.lock files from both versions so we can try and track this down.

btw, trust-dns doesn't currently have any bounds on request queues, so it's feasible that under a high load and/or delayed responses that sockets backup, though I would expect those to get cleaned-up after timeouts are hit.

bluejekyll on 7 Dec 2018

It has been running several hours now and I haven't seen any evidence of problems since I downgraded. I will check again on Monday but at this point I'm reasonably confident something about the trust-dns upgrade is probably causing the problem.

glademiller on 8 Dec 2018

@glademiller can you post the versions of arctix you were using and, if you have it, the trust-dns versions?

bluejekyll on 8 Dec 2018

Latest actix uses trust-dns 0.5 and 0.10, version that doesn’t leak uses 0.4.3 and 0.9.1

fafhrd91 on 8 Dec 2018

👍1

Just in case it isn't absolutely clear I wasn't able to get the leak to
occur locally running on docker on my machine. It only happened in an ecs
cluster on aws. The Actix versions mentioned were the ones I was/am on.

On Sat, Dec 8, 2018, 13:21 Nikolay Kim <[email protected] wrote:

Latest actix uses trust-dns 0.5 and 0.10, version that doesn’t leak uses
0.4.3 and 0.9.1

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/actix/actix-web/issues/616#issuecomment-445487103,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB6chL-ilH0GRSri6S33wEhRPozpbV2nks5u3B86gaJpZM4ZI1Op
.

glademiller on 8 Dec 2018

I may have to eat my words I just checked the logs and it looks like file descriptor use has risen to ~700 on one of the containers running the downgraded version. I won't be able to access this particular container to confirm it is an accumulation of udp sockets but it seems likely. The only other possibility would be http client sockets that aren't being cleaned up. If happens in the other environment I have more full access to I will confirm the cause.

glademiller on 10 Dec 2018

The issue has since manifested itself on the other (staging) cluster as well even while running the downgraded version. 617 open file descriptors all UDP sockets.

glademiller on 10 Dec 2018

@bluejekyll @fafhrd91 I guess several hours wasn't enough to diagnose whether the problem was gone or not on the older version.

glademiller on 10 Dec 2018

That is strange. actix itself does not use udp in any form except dns resolution. You can replace resolver actor with with custom version that uses thread pool for name resolution

fafhrd91 on 10 Dec 2018

Are you sure these sockets are owned by this process? Do you have any log collectors or other things on this system that might be using UDP for message delivery?

As to trust-dns, if you have a way of monitoring out bound connections (tcpdump, etc), then you could see if these sockets are being used for DNS, as all the outbound requests would be for port 53.

bluejekyll on 10 Dec 2018

1) I am 100% certain the sockets belong to the service process netstat and lsof -p both agree on that. Being in a docker container I have also confirmed no other processes are running apart from my shell and ps when I have attached to the container.
2) Log collection is being handled by docker outside of the container taking the results of STDOUT etc and sending them to Graylog.
3) As far as confirming all the sockets belong to DNS that is a good idea just to make sure there isn't something else going on.

A subset of the sockets as reported by ss is below. The Peer Address:Port unfortunately is just showing up as * perhaps the case because the sockets are in UNCONN state?

State      Recv-Q Send-Q Local Address:Port               Peer Address:Port              
UNCONN     0      0            *:41388                    *:*                  
UNCONN     0      0            *:49589                    *:*                  
UNCONN     0      0            *:51638                    *:*                  
UNCONN     0      0            *:10679                    *:*                  
UNCONN     0      0            *:20921                    *:*                  
UNCONN     0      0            *:47562                    *:*                  
UNCONN     0      0            *:4562                     *:*                  
UNCONN     0      0            *:25044                    *:*                  
UNCONN     0      0            *:22998                    *:*                  
UNCONN     0      0            *:45531                    *:*                  
UNCONN     0      0            *:55772                    *:*                  
UNCONN     0      0            *:25059                    *:*                  
UNCONN     0      0            *:43493                    *:*                  
UNCONN     0      0            *:49638                    *:*                  
UNCONN     0      0            *:57831                    *:*                  
UNCONN     0      0            *:23019                    *:*                  
UNCONN     0      0            *:8684                     *:*                  
UNCONN     0      0            *:10732                    *:*                  
UNCONN     0      0            *:35319                    *:*

glademiller on 10 Dec 2018

afaik there is no state in UDP, so I don't think there should be any other state than UNCONN

You can try to replace trust with own resolver by providing recipient to own resolver https://actix.rs/actix-web/actix_web/client/struct.ClientConnector.html#method.resolver

DoumanAsh on 10 Dec 2018

Yes, you need to actually do a packet capture to see what is being sent from those sockets. This would be a bit annoying, but can be done with tcpdump.

bluejekyll on 11 Dec 2018

I'll see what I can do the difficult part about this is capturing it in action. I upgraded all dependencies and removed my dependency on futures-await in favor of tokio-async-await. So far on the staging environment the problem hasn't replicated itself again. That isn't to say it won't. I'll keep watching and try tcpdump the next time it happens and keep you posted. Thanks for all your support on this it has been a very puzzling issue to deal with.

glademiller on 11 Dec 2018

👍1

I turned on trace logging in trust dns and the issue started manifesting itself today. I don't have TCP dump but I have no doubt the extra sockets are coming from Trust DNS. With trace logging on I see a large number of log lines that look like either
trust_dns_resolver::name_server_pool] reconnecting: NameServerConfig { socket_addr: V4(10.10.160.2:53), protocol: Udp, tls_dns_name: None }
or
trust_dns_resolver::name_server_pool] connecting: Udp { socket_addr: V4(10.10.160.2:53), timeout: 2s }

In the meantime I've switched to a custom resolver based on https://crates.io/crates/tokio-dns-unofficial.

glademiller on 12 Dec 2018

Ok. Thanks. I’ll look at building a test case to see if I can reproduce this.

bluejekyll on 12 Dec 2018

I think I'm hitting the same problem. I've recently added several dns resolutions on a 5s loop and upgraded my crates, now my application crashes on OpenWRT after about 4-5 hours with similar file descriptor symptoms.

jkilpatr on 12 Dec 2018

@jkilpatr do you have additional info on dependencies and when you started experiencing this?

I've recently added several dns resolutions on a 5s loop

Was this before or after upgrading?

I'll be publishing anything I find in regards to this issue here: https://github.com/bluejekyll/trust-dns/issues/633

bluejekyll on 13 Dec 2018

@bluejekyll

https://github.com/althea-mesh/althea_rs/blob/release/rita/Cargo.toml

this is what we currently have running in prod doing dns resolution and other things on 5s loops it's been up and stable on ~15ish devices for 3-4 weeks now so thousands of operational hours with no issues.

https://github.com/althea-mesh/althea_rs/blob/master/rita/Cargo.toml

This is what we are running now, the newer code does perform more dns queries but it's just additional uses of the same codepaths, maximum operational time I've seen with these dep versions is 5 hours. I've been trying to downgrade all morning to test with new code but older deps. Downgrading does not seem to be viable though because cargo will always chose the newer actix and with it the newer trustdns probably because of complicated deps tree issues.

jkilpatr on 13 Dec 2018

FYI: https://github.com/bluejekyll/trust-dns/pull/635 has what I believe should fix these issues. It ensures that in the case of timeouts, UDP sockets are closed.

Feel free to test against it. I'm working to get it merged into the library, and then will publish a 0.10.1 release.

bluejekyll on 16 Dec 2018

If anyone wants to verify the patch before I publish 0.10, please do so. This should fix the issues, though I haven't been able to reproduce the errors mentioned, through code review I do believe this fixes a potential issue around timeouts as they related to UDP operations.

bluejekyll on 17 Dec 2018

trust-dns-resolver 0.10.1 has been published with a fix. please reach out if issues continue.

bluejekyll on 17 Dec 2018

👍2 ❤1

@glademiller could you check new version of trust-dns?

fafhrd91 on 17 Dec 2018

jgrimes on 17 Dec 2018

👍3

I will setup a test today and say something if it happens within the next few days. It sounds like @jgrimes has a lot more quick and reliable way to reproduce the problem so I doubt I will have any issues.

glademiller on 17 Dec 2018

Thanks for verifying @jgrimes

bluejekyll on 17 Dec 2018

@bluejekyll Hmm just wanted to let you know interesting thing about recent update of trust-dns it seems with it we started hitting some problem with compiler when not having recursion_limit

It only happens in doctest so far, JFYI maybe someone else will have it too😄
https://ci.appveyor.com/project/fafhrd91/actix-n9e64/builds/21095175/job/r9g5ai9ufyypjdd4#L522

DoumanAsh on 18 Dec 2018

hm, interesting. I haven't noticed that, though I did override that inside trust-dns a long time ago to #![recursion_limit = "2048"] in proto, and #![recursion_limit = "128"], I can't say why those numbers were picked.

We used to have a dependency on error_chain that required pretty high recursion_limits, but that's all been replaced with failure at this point, so I'm not sure why we might be triggering that now.

bluejekyll on 18 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings