Actix-web: Connection not closed properly

Created on 11 Jan 2020 · 11Comments · Source: actix/actix-web

I've been using 1.x version of actix-web for months, had to restart my app every now and then (sometimes after minutes, sometimes after days) since there are a lot of ESTABLISHED connections left there hanging, eventually causing too many open files error (I've increased the limit drastically). I'm using my server with keep-alive disabled, the rest of the settings are the defaults. I have since tried to upgrade to 2.0.0 to see if it solves the problem, but it's the same thing.

The service itself gets around 500-1000 requests per second in production currently.

needs-investigation

Source

orangesoup

👍1

All 11 comments

Could you create reproducible example?

fafhrd91 on 11 Jan 2020

Basically it's a simple hello world app:

use std::io;

use actix_http::KeepAlive;
use actix_web::{web, App, HttpResponse, HttpServer};

#[actix_rt::main]
async fn main() -> io::Result<()> {
  let app = move || {
    App::new().service(web::resource("/").route(
      web::get().to(|| HttpResponse::Ok().body("Hello World!")),
    ))
  };

  HttpServer::new(app)
    .keep_alive(KeepAlive::from(None))
    .backlog(8192)
    .bind("0.0.0.0:14444")?
    .run()
    .await
}

Cargo:

[dependencies]
actix-rt = "1.0.0"
actix-http = "1.0.1"
actix-web = "2.0.0"

I've ran wrk from another server like this:

wrk -t20 -c40000 -d15 http://url:14444

After this short test I've got 23 unclosed connections left there, all in ESTABLISHED status.

orangesoup on 11 Jan 2020

@orangesoup would you be so kind to provide more information about your environment? That may be quite helpful.

OS, distro and architecture
any special/uncommon/unobvious firewall/iptables rules

Also, does this issue reproduces one type of server? Or in any environment you've tried?

Thank you!

tyranron on 21 Jan 2020

I've run the example above trying different wrk workloads and no described problem has appeared. So we have at least one platform when it seems to be OK:

System Version: macOS 10.12.6 (16G2136)
Kernel Version: Darwin 16.7.0
x86_64

tyranron on 21 Jan 2020

@orangesoup did you try debug/trace what is going on inside actix-web/actix-http with this?

tyranron on 21 Jan 2020

I have tried that as well and was able to reproduce

Linux max 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Common KVM processor
80 CPU

After 15 seconds of wrk it stays with about 30 ESTABLISHED connections for a long while, actually I had to kill server.

Is there a reason that client_shutdown is not applied for regular http1 connections and only applied for TLS?
https://docs.rs/actix-web/2.0.0/src/actix_web/server.rs.html#177-180 turns into only for TLS in the builder:
https://docs.rs/actix-http/1.0.1/src/actix_http/builder.rs.html#114-117

dunnock on 22 Jan 2020

Can help and run any tracing, though need some advice - what's best way to trace through the stack, e.g. how can I trace those connections which stay open. Maybe there are some tools or modules or hints how to enable tracing. It seems App::wrap is useless as it should go level deeper down to connection.

dunnock on 22 Jan 2020

@dunnock

std::env::set_var("RUST_LOG", "trace");
env_logger::init();

at the beginning of the main should help.

You also can opt-in/opt-out logs granularly for crates/modules, see docs.

tyranron on 22 Jan 2020

@orangesoup what is ulimit -n in the shell where you running actix server?

Mine had 1024 which is default and I spot ~24 errors in the log and ~10 hang up ESTABLISHED connections:

[2020-01-22T18:46:52Z ERROR actix_server::accept] Error accepting connection: Too many open files (os error 24)

Increased limit with ulimit -n 65535, after restarting I did not see any errors in the log and all connections closed after the test. Can you please check and confirm you see the same on your side.

@tyranron if that confirmed that would be workaround via proper server setup, maybe should be documented as those errors do not show up unless logger enabled as you adviced. But we still should look for the reason why connection sometimes hang (I suppose some unsafe code in error handler etc)

dunnock on 22 Jan 2020

👀2

@tyranron, @dunnock Sorry for the late response, haven't checked back since the project has been gone for a while. :)

As for the OS goes, I'm using

4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I've been increasing the ulimit since forever, that would not solve the problem. Even further than the mentioned 65535.

The thing is, sometimes it's fine for days. By fine, I mean it's slowly building up, so it's not causing major issues. And then suddenly everything snaps. This could happen after hours of starting of my app or days. For example, my production app is running for 3 days now and this is what lsof shows:

lsof -p 6508 | wc -l
10195

More than 10k of that are ESTABLISHED connections that are not cleaned up.

I could definitely use a lower ulimit to prevent other apps failing, but I'm pretty sure it's still a problem that shouldn't happen at all. How can I help to solve this issue faster?

orangesoup on 28 Jan 2020

I have the same issue: An application that receives around 10 requests per minute fails over the span of a couple hours, because it runs out of available file descriptors.

4.4.0-165-generic #193-Ubuntu SMP Tue Sep 17 17:42:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 16.04.7)

Interestingly, if i look in /proc/<pid>/fd it looks like almost all the sockets were created at the same time. However that time correlates with when i first checked there, so that might be a kernel artifact.

Edit: I have tried reproducing the issue on another computer using ab (apache benchmark) with 10k requests, 1k concurrent and not found any issues. Information of that system: 5.8.10-arch1-1 #1 SMP PREEMPT Thu, 17 Sep 2020 18:01:06 +0000 x86_64 GNU/Linux (Arch Linux)