Actix-web: Actix-web hangs forever if request processing takes a long time.

Created on 25 Jun 2019 · 15Comments · Source: actix/actix-web

The code (for actix-web-0.7, I didn't try reproducing on 1.0 yet):

use actix_web::{HttpRequest, HttpResponse};

use std::sync::{Arc, Mutex};

struct Unit;
struct SlowStuff {}

impl SlowStuff {
    pub fn new() -> SlowStuff {
        // some delay here seems crucial, since I couldn't reproduce without it
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_secs(2));
        SlowStuff {}
    }
}

struct State {
    unit: Arc<Mutex<Unit>>,
    // wrapping the slow stuff into Arc<Mutex<>> can be done but doesn't matter
    slow_stuff: SlowStuff,
}

fn index(req: &HttpRequest<State>) -> HttpResponse {
    println!("got request to: {:?}", req.uri());

    {
        // Taking the lock is moderately important.
        // I think I've reproduces this several times without taking the
        // lock but it's very inconsistent.
        let unit: &Unit = &req.state().unit.lock().unwrap();
        // this delay is important. Couldn't reproduce without
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_millis(1000));
        println!("got unit: {:?}", unit as *const Unit);
    }

    let res = format!("hello");
    let resp = HttpResponse::Ok().body(res);
    println!("made a response");

    resp
}

fn main() {
    actix_web::server::new(move || {
        actix_web::App::with_state(State {
            unit: Arc::new(Mutex::new(Unit)),
            slow_stuff: SlowStuff::new(),
        })
        .resource("/", |r| r.f(index))
    })
    // worker count does not matter, but I assume 1 is easier to debug?
    .workers(1)
    .bind("127.0.0.1:8081")
    .unwrap()
    .run();
}

Go to http://localhost:80801/, see the page showing fine.
If you refresh the page several times (it might need several refreshes in quick succession? unsure.. I just tap F5 real quick), the page would get into an eternal "loading" state, further attempts to access the endpoint (also from other browser tabs) would never give response.

Stacktrace as reported by ProcessExplorer in the frozen state (identical for both the main thread, and the two others):

ntoskrnl.exe!KiCpuId+0xaa
ntoskrnl.exe!ExAcquireSpinLockExclusiveAtDpcLevel+0x2c2
ntoskrnl.exe!KeWaitForSingleObject+0x1a3
ntoskrnl.exe!ExReleaseResourceForThreadLite+0x60
ntoskrnl.exe!KeInsertQueueApc+0x725
ntoskrnl.exe!ExAcquireSpinLockExclusiveAtDpcLevel+0x4cd
ntoskrnl.exe!KeRemoveQueueEx+0x323
ntoskrnl.exe!PsAssignImpersonationToken+0x13c
ntoskrnl.exe!RtlPinAtomInAtomTable+0x1338
ntoskrnl.exe!longjmp+0x5b93
ntdll.dll!ZwRemoveIoCompletionEx+0xa
KERNELBASE.dll!GetQueuedCompletionStatusEx+0xdf
kernel32.dll!GetQueuedCompletionStatusEx+0x19
actix-hang.exe!miow::iocp::CompletionPort::get_many+0x215

Windows 7, firefox browser.

P.s.: I cannot reproduce it on linux, so probably this is windows-specific..

C-bug O-windows needs-investigation v1.x

Source

moxian

Most helpful comment

I have delved into this a bit.
The callstack is the following:
actix_server::accept::Accept::process_cmd https://github.com/actix/actix-net/blob/server-v0.5.1/actix-server/src/accept.rs#L303
mio::poll::Poll::register https://github.com/tokio-rs/mio/blob/v0.6.19/src/poll.rs#L790
mio::sys::windows::selector::ReadyBinding::register_socket https://github.com/tokio-rs/mio/blob/v0.6.19/src/sys/windows/selector.rs#L361
mio::sys::windows::selector::Binding::register_socket https://github.com/tokio-rs/mio/blob/v0.6.19/src/sys/windows/selector.rs#L200
miow::iocp::CompletionPort::add_socket https://github.com/alexcrichton/miow/blob/0.2.1/src/iocp.rs#L74
miow::iocp::CompletionPort::_add https://github.com/alexcrichton/miow/blob/0.2.1/src/iocp.rs#L80
winapi::CreateIoCompletionPort

This is not exposed in the docs for some reason, but there is this interesting comment inside mio::register:

        /*
         * Undefined behavior:
         * - Reusing a token with a different `Evented` without deregistering
         * (or closing) the original `Evented`.
         */

Following this all way down to CreateIoCompletionPort MSDN entry, we see it mention, in passing: This completion key should be unique for each file handle, and it accompanies the file handle throughout the internal completion queuing process. And, evidently, violating this requirement gets is treated as passing in incorrect parameter, which is fair, I guess.

When I add diagnostic print, I see, that CreateIoCompletionPort is indeed called with the same non-null parameters multiple times. And this makes sense, since in the above trace logs you can see (The number in (brackets) is said to be thread number):

06:52:28 [TRACE] (6) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:28 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:30 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
06:52:30 [ERROR] Can not resume socket accept process: The parameter is incorrect. (os error 87)
<...>

Why is token the same? I'm not sure... I tried reading actix-server code but didn't get far, unfortunately.

Hope this helps!..

(This is probably not needed, but just in case: another Trace log with a tiny bit more logging (and slightly different line numbers as a result): listener-1561709109-trace.log )

moxian on 28 Jun 2019

👍3

All 15 comments

std::thread::sleep blocks entire thread so no other request being served by that thread. so eventually you block all threads.

you can not execute blocking synchronous operation on async thread

fafhrd91 on 25 Jun 2019

you can not execute blocking synchronous operation on async thread
Would that mean reading a file to serve it from a handler is not allowed, as that is a synchronous blocking operation?

I don't find this explanation compelling.

Yes, std::thread::sleep blocks the entire thread, but it only does so for until the sleep is over. Once the specified duration has passed, the thread should become unblocked and ready to serve further requests.
In my case, however, the application hangs forever.

I've just restarted the binary, refreshed the page, refreshed it again, and then pressed F5 5 times in quick succession. It's been at least a minute since then, the page is still "loading", and trying to open it in a new tab yields no results. The console has:

PS D:\work\trash\repros\tera-hang> cargo run
   Compiling tera-hang v0.1.0 (D:\work\trash\repros\tera-hang)
warning: field is never used: `slow_stuff`
  --> src\main.rs:21:5
   |
21 |     slow_stuff: SlowStuff,
   |     ^^^^^^^^^^^^^^^^^^^^^
   |
   = note: #[warn(dead_code)] on by default

    Finished dev [unoptimized + debuginfo] target(s) in 29.53s
     Running `target\debug\tera-hang.exe`
SystemTime { intervals: 132059921846500000 } got request to: /
got unit: 0x257a89
made a response
SystemTime { intervals: 132059921866920000 } got request to: /
got unit: 0x257a89
made a response
SystemTime { intervals: 132059921886330000 } got request to: /
got unit: 0x257a89
made a response

The application is not being stuck inside my handler, but somewhere in actix instead.

moxian on 26 Jun 2019

Could you port your example to 1.0,
0.7 is in maintenance mode

fafhrd91 on 26 Jun 2019

Here is what seems to be 1.0 port:

use actix_web::{HttpRequest, HttpResponse};

use std::sync::{Arc, Mutex};

struct Unit;
struct SlowStuff {}

impl SlowStuff {
    pub fn new() -> SlowStuff {
        // some delay here seems crucial, since I couldn't reproduce without it
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_secs(2));
        SlowStuff {}
    }
}

struct State {
    unit: Arc<Mutex<Unit>>,
    // wrapping the slow stuff into Arc<Mutex<>> can be done but doesn't matter
    slow_stuff: SlowStuff,
}

fn index(req: HttpRequest, data: actix_web::web::Data<State>) -> HttpResponse {
    // let data = req.app_data::<State>().unwrap();
    println!(
        "{:?} got request to: {:?}",
        std::time::SystemTime::now(),
        req.uri()
    );

    {
        // Taking the lock is moderately important.
        // I think I've reproduces this several times without taking the
        // lock but it's very inconsistent.
        let unit: &Unit = &data.unit.lock().unwrap();
        // this delay is important. Couldn't reproduce without
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_millis(1000));
        println!("got unit: {:?}", unit as *const Unit);
    }

    let res = format!("hello");
    let resp = HttpResponse::Ok().body(res);
    println!("made a response");

    resp
}

fn main() {
    actix_web::HttpServer::new(move || {
        actix_web::App::new()
            .data(State {
                unit: Arc::new(Mutex::new(Unit)),
                slow_stuff: SlowStuff::new(),
            })
            .service(actix_web::web::resource("/").to(index))
    })
    // worker count does not matter, but I assume 1 is easier to debug?
    .workers(1)
    .bind("127.0.0.1:8080")
    .unwrap()
    .run()
    .unwrap();
}

but I cannot reproduce the issue with that.

Without understanding of what the root cause of the issue is in 0.7, I feel a little bit uneasy not knowing if it was actually fixed in 1.0, or just hidden (and now needs a somehow slightly tweaked example to resurface).
In other words, I would be happy to see an investigation into what's going on, but not necessarily a fix.

That said, closing the issue now is probably also totally reasonable.

moxian on 26 Jun 2019

Hello, I ported my real code to actix 1.0, but unfortunately hit the same issue...

Here's the reproduction case - the only difference from above that was needed is an extra function that takes a reference to mutex-guarded contents:

Code

```rust
use actix_web::{HttpRequest, HttpResponse};

use std::sync::{Arc, Mutex};

struct Unit;
struct SlowStuff {}

impl SlowStuff {
pub fn new() -> SlowStuff {
// some delay here seems crucial, since I couldn't reproduce without it
// it does not need to be thread::sleep, but can be
// e.g. reading and parsing a big file.
std::thread::sleep(std::time::Duration::from_secs(2));
SlowStuff {}
}
}

struct State {
unit: Arc>,
// wrapping the slow stuff into Arc slow_stuff: SlowStuff,
}

fn index(req: HttpRequest, data: actix_web::web::Data) -> HttpResponse {
// let data = req.app_data::().unwrap();
println!(
"{:?} got request to: {:?}",
std::time::SystemTime::now(),
req.uri()
);

let res = {
    let unit: &Unit = &data.unit.lock().unwrap();
    // it is somehow important to pass the lock to a different function
    stuff(unit)
};

let resp = HttpResponse::Ok().body(res);
println!("made a response");

resp

}
fn stuff(_unit: &Unit) -> String {
std::thread::sleep(std::time::Duration::from_millis(1000));
// this can be a constant string, using time only for demonstration
return format!("{:?}", std::time::SystemTime::now());
}

fn main() {
actix_web::HttpServer::new(move || {
actix_web::App::new()
.data(State {
unit: Arc::new(Mutex::new(Unit)),
slow_stuff: SlowStuff::new(),
})
.service(actix_web::web::resource("/").to(index))
})
// worker count does not matter, but I assume 1 is easier to debug?
.workers(1)
.bind("127.0.0.1:8080")
.unwrap()
.run()
.unwrap();
}
```

P.s.: A data point: If there are several HttpServers in the actix system, then only the one bound to this particular address gets unresponsive. The rest continue working fine.

moxian on 27 Jun 2019

I tried running the example under a logger and got the following:

    Finished dev [unoptimized + debuginfo] target(s) in 6.00s
     Running `target\debug\main1.exe`
2019-06-27 05:53:36,219 ←[36mINFO ←[0m [actix_server::builder] Starting 1 workers
2019-06-27 05:53:36,220 ←[36mINFO ←[0m [actix_server::builder] Starting server on 127.0.0.1:8080
2019-06-27 05:53:38,223 ←[31mERROR←[0m [actix_server::accept] Can not resume socket accept process: The parameter is incorrect. (os error 87)
SystemTime { intervals: 132060884182248000 } got request to: /
made a response
SystemTime { intervals: 132060884203728000 } got request to: /
made a response

So this is the same root cause as in https://github.com/actix/actix-web/issues/905

I couldn't reproduce it with Debug or lower logging level so far, but I'll update the bug if I manage to

P.s.: I mentioned that my go-to reproduction steps here is to just press F5 several times in quick succession, but when I encountered that in real code, i was refreshing the page as one normally would, with no urgency, when the server was totally idle otherwise.

P.p.s: Managed to get Debug level:

log

06:30:13 [ INFO] Starting 1 workers
06:30:13 [ INFO] Starting server on 127.0.0.1:8080
06:30:13 [DEBUG] (4) tokio_reactor::background: starting background reactor
06:30:13 [DEBUG] (1) tokio_reactor: adding I/O source: 0
06:30:13 [DEBUG] (1) tokio_reactor: adding I/O source: 4194305
06:30:13 [DEBUG] (1) tokio_reactor::registration: scheduling Read for: 0
06:30:13 [DEBUG] (1) tokio_reactor::registration: scheduling Read for: 1
06:30:15 [ERROR] Can not resume socket accept process: The parameter is incorrect. (os error 87)
06:30:15 [DEBUG] (2) tokio_reactor: adding I/O source: 0
06:30:15 [DEBUG] (2) tokio_reactor::registration: scheduling Read for: 0
06:30:15 [ INFO] got a request
06:30:16 [ INFO] made a response, returning
06:30:16 [DEBUG] (2) tokio_reactor: dropping I/O source: 0

moxian on 27 Jun 2019

Caught the thing with Trace logging on - listener-trace-1561618348.log . This is a real application, so a bunch of unrelated stuff might be going on.
There are, unfortunately, two http servers. The one affected is on port 8081

Edit: another one - listener-trace-1561619274.log. This time there's an Info line when exiting and entering the handler. The socket gets lost way before getting there. I believe (but am surprisingly unsure) that the content did get served to the browser, and it's only the subsequent requests that got stalled.

moxian on 27 Jun 2019

I don't have access to windows, and I can not reproduce this on osx

fafhrd91 on 27 Jun 2019

Is there anything I can help here with?
I don't understand actix enough to debug this on my own, but would be happy to add whatever diagnostics needed.

moxian on 27 Jun 2019

This is not exposed in the docs for some reason, but there is this interesting comment inside mio::register:

        /*
         * Undefined behavior:
         * - Reusing a token with a different `Evented` without deregistering
         * (or closing) the original `Evented`.
         */

06:52:28 [TRACE] (6) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:28 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:30 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
06:52:30 [ERROR] Can not resume socket accept process: The parameter is incorrect. (os error 87)
<...>

Why is token the same? I'm not sure... I tried reading actix-server code but didn't get far, unfortunately.

Hope this helps!..

(This is probably not needed, but just in case: another Trace log with a tiny bit more logging (and slightly different line numbers as a result): listener-1561709109-trace.log )

moxian on 28 Jun 2019

👍3

@moxian thanks for investigation, that is helpful. i'll try to come up with some fixes

fafhrd91 on 29 Jun 2019

👍2

I ran into this on Windows 10 with actix-web 2.0 but unfortunately I did not have debug logging enabled and now I can't replicate. @moxian do you have any more information about this bug apart from what is mentioned in this issue? I'm happy to help troubleshoot in any way :)

imor on 23 Feb 2020

No, sorry, I didn't use actix much since filing this report. This issue captures all I know about the problem.

moxian on 24 Feb 2020

I was able to see this issue, using React useEffect without adding a return [] in the effect method,
I have a Rust/Actix app,
in my routes

.route(
        "/api/v1/users{_:/?}",
        web::post().to(users_controller::index),
 )

all what my user controller is doing is somehow calling reqwest (this is some sort of proxy http call)
```let json_response: serde_json::Value = reqwest::Client::new()
.post("http://someotherwebservice:3333/api/legacy_users")
.json(&request_body)
.send()
.await
.expect("error")
.json()
.await
.expect("error");


When i call this from the React client:
this responds successfully the first time with data, but then second time it completely hangs Actix

React.useEffect(() => {
await fetch('http://localhost_actix_port/api/v1/users');
});


But when adding [] to the React.useEffect it doesn't block the backend. This is a strange behavior because basically any app can completely create a DDoS attach to any backend that uses an endpoint that proxy http-requests in this way

React.useEffect(() => {
await fetch('http://localhost_actix_port/api/v1/users');
}, []); <- this avoids an infinite loop
```

actix-cors = "0.2.0"
actix-web = "2"
actix-service = "1.0.5"
actix-rt = "1"
reqwest = { version = "0.10", features = ["blocking", "json"] }

I can create an example if necessary with that part of the app.

pikitgb on 30 Jun 2020

After digging in this issue a little more, my issue is actually different, seems like i was sending a burst of request to actix_web from the frontend (React), after x amount of request let's say a > 50+ per second actix was locking internally, and all the other subsequents requests also failed. I was able to resolve my issue by using this:
add this dep to your Cargo.toml
actix-ratelimit = "0.2.1"

use std::time::{Duration};
use actix_ratelimit::{RateLimiter, MemoryStore, MemoryStoreActor};

add this middleware:


.wrap(
        RateLimiter::new(
        MemoryStoreActor::from(store.clone()).start())
            .with_interval(Duration::from_secs(1))
            .with_max_requests(5))

Idk if this helps someone else with the same issue.

pikitgb on 17 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings