Actix-web: client timeout starts too early causing "408" errors

Created on 11 Jan 2020  路  20Comments  路  Source: actix/actix-web

I see occasional "408 Request Timeout" errors when the system is busy. Tracing it shows e.g.

13:43:46.715691 accept4(17, {sa_family=AF_INET, sin_port=htons(48642), sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_CLOEXEC) = 86
13:43:52.199250 ioctl(86, FIONBIO, [1]) = 0
13:43:52.199293 epoll_ctl(24, EPOLL_CTL_ADD, 86, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=7, u64=64768106823687}}) = 0
13:43:52.199324 getpeername(86, {sa_family=AF_INET, sin_port=htons(48642), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
13:43:52.201563 sendto(86, "HTTP/1.1 408 Request Timeout\r\ncontent-length: 0\r\nconnection: close\r\ndate: Sat, 11 Jan 2020 12:43:46 GMT\r\n\r\n", 107, MSG_NOSIGNAL, NULL, 0) = 107
13:43:52.201657 shutdown(86, SHUT_WR)   = 0

on the actix-web server and

13:43:46.715522 socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 60
13:43:46.715560 ioctl(60, FIONBIO, [1]) = 0
13:43:46.715584 epoll_ctl(9, EPOLL_CTL_ADD, 60, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3586986337, u64=139705988223329}}) = 0
13:43:46.715607 connect(60, {sa_family=AF_INET, sin_port=htons(3001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
13:43:46.715688 getsockopt(60, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
13:43:46.715714 writev(60, [{iov_base="HEAD /api/v1/download/...iov_", len=372}], 1) = 372
13:43:52.201614 recvfrom(60, "HTTP/1.1 408 Request Timeout\r\ncontent-length: 0\r\nconnection: close\r\ndate: Sat, 11 Jan 2020 12:43:46 GMT\r\n\r\n", 4096, 0, NULL, NULL) = 107
13:43:52.201735 close(60)               = 0

on the client (nginx proxy).

E.g. it appears the actix-web does:

  1. accepts connection
  2. starts timeout timer
  3. starts listening for events
  4. throws "408" when timeout expired

Here, there are more than 5 seconds between 2. and 3. so that the timeout triggers.

I am not sure, why there are more than 5 seconds between 1 and 3; usually it are <10 ms but in cases like above it affects a cluster of connections.

But imo it would be better to start the timer only when there we are ready to handle the request (e.g. at the moment of the first EPOLL_CTL_ADD).

actix-web 2.0.0
actix-rt 1.0.0
Linux 5.4.7

C-bug

All 20 comments

Same issue here ! If all the workers are busy all other incoming request receive 408.
This is a big issue for me because i generate image that can take up to 10sc so with the default settings i can only serve 8 clients (because default setting is set to 8 workers), all the other requests got timeout.
Setting client_timeout(0) doesn't fix the issue :cry:
I create a repo where you can find an exemple with benchmark -> https://github.com/Farkal/actix-test-timeout

So does it happen because the request is blocking the worker, what about implement that work as a future?

I tried but doesn't work. I have finally fixed it using web::block. My repo have been updated with the solution.
I think we could add some explanation to the doc to prevent block the actix thread because lot of people will just see the 408 and may have trouble to understand that the main thread is busy.

Well if all the requests are handled to 1 actor's mailbox per worker it makes sense because of the nature of actors handling messages by order but I guess we should research the request process internal further to fully understand what is actually happening there

@Farkal If I'm not mistaken web::block is exactly that (implementing that work as a future, with that future just being a thread pool). What you did was essentially move the work out of the thread handling the request, freeing it to handle more requests...

@vmalloc Yeah but first i thought that actix pop new thread for each new request. That was a mistake from my understanding of the actix server. I just think it could be more explicit in the doc.

Yes I agree. The runtime behavior of actix should appear somewhere, like how extractors are run and what鈥檚 the blocking behavior. I can try to write a PR for it.

Where in the docs do you think this can appear so that it makes sense?

I think it should be on the server page, at least in the multi-threading section https://actix.rs/docs/server/
Thank you for the PR :blush:

@Farkal I created a PR adding some information. Currently awaiting some feedback from the community: https://github.com/actix/actix-website/pull/150

@vmalloc Seems great ! Thanks !

As correct usage is documented now can we close issue?

I do not think that this is only a documentation issue and imo the documentation fix is incomplete. Here, a single request takes <50ms (some database queries) and it is to expensive to convert this to a future. It can happen that there are 50 parallel requests which make the requests to take longer (e..g 500ms) and trigger the seen timeout.

It would be ok, when the client generates the timeout because the response has not been seen within 5s.

But it is bad that actix-web generates this error because it erroneously thinks that the client did not respond within 5s.

My dirty hack was to increase client_timeout() blindly. But meaning of this function is

Defines a timeout for reading client request header. If a client does not transmit the entire set headers within this time, the request is terminated with the 408 (Request Time-out) error.

and it might be useful to limit the client request time in some cases.

At least the client_timeout() documentation should be changed to explain this. It might be even useful to rename this method to match the actual behaviour (e.g. to handshake_timeout()).

But I would rather see that the timeout starts ticking after the server waits for data from client.

hi @ensc, not sure I got about 50 parallel requests and why it is expensive to convert to a future? Parallel requests processing that's exactly the case why you would need async, otherwise you might just run 1000 workers in parallel but that will be more expensive. Running blocking work in a separate thread pool is a common pattern across the whole async Rust ecosystem, AFAIK there is no alternative to that. It is cheaper to run blocking code from a dedicated thread pool, than starting new threads for every new blocking request. It is also better on performance than waiting until your blocking request would complete. Sending data between threads also takes significantly less than a network packet transmission. That the right way to do it and would unlock workers to read data from incoming requests while waiting for response from external resource.

Consider that you have external queries which take 10ms on average. While worker is locked all other connection requests just queue up, though If you unlock worker it will be able to poll on other requests sending those requests in parallel. Now if you get one exceptional request which would block for 5 seconds all other requests will just halt hanging client request.

why it is expensive to convert to a future

afais diesel (which I am using) does not support async i/o and I fear that porting to it (when available) would require a large code rewrite (additional Send requirements, many ugly Arc hacks). Rust does not seem to support async closures yet which are required e.g. for transaction (not only) in diesel.

Having client_timeout() to work as documented would help to solve practical issues.

using diesel in the correct way is as simple as wrapping calls in web::block (and .awaiting) to run then on a background task queue and keeping the HTTP workers free to accept new connections

For large queries this might be the correct way. But for short queries (perhaps <50ms) it is just overkill and will make the normal case slower resp. opens vectors for DDOS attacks.

Again, having client_timeout() to work as documented (and expected by its naming) would be a better fix.

... and putting SQL queries into web::block would mean at the end, that every HTTP request is executed in an own thread because there is almost every time some session information which is read/updated

UPDATE: although off-topic for this issue, some numbers:

normal response time

Query ran in 0.6 ms: SELECT 1
Query ran in 0.2 ms: BEGIN
Query ran in 0.7 ms: SELECT "session"."id", "session"."token",
Query ran in 0.9 ms: UPDATE "session" SET "expires" = LEAST("s
Query ran in 12.4 ms: COMMIT
Query ran in 0.4 ms: SELECT "session_setup"."id", "session_set
Query ran in 4.3 ms: UPDATE "session_info" SET "dl_get_tot" =
Query ran in 0.4 ms: SELECT "project_ref"."id", "project_ref".
Query ran in 2.6 ms: SELECT "upload"."id", "upload"."file", "u
Query ran in 0.4 ms: SELECT "file_data"."id", "file_data"."fsi
Query ran in 0.4 ms: SELECT "upload"."id", "upload"."package",
Query ran in 0.2 ms: BEGIN
Query ran in 0.8 ms: UPDATE "session_info" SET "dl_get_hit" =
Query ran in 0.8 ms: UPDATE "file" SET "rtime" = GREATEST(COAL
Query ran in 3.8 ms: COMMIT

slow response time (parallel queries)

Query ran in 0.5 ms: SELECT 1
Query ran in 0.2 ms: BEGIN
Query ran in 0.7 ms: SELECT "session"."id", "session"."token",
Query ran in 10.5 ms: UPDATE "session" SET "expires" = LEAST("
Query ran in 40.3 ms: COMMIT
Query ran in 0.3 ms: SELECT "session_setup"."id", "session_set
Query ran in 138.1 ms: UPDATE "session_info" SET "dl_get_tot"
Query ran in 0.3 ms: SELECT "project_ref"."id", "project_ref".
Query ran in 3.2 ms: SELECT "upload"."id", "upload"."file", "u
Query ran in 0.3 ms: SELECT "file_data"."id", "file_data"."fsi
Query ran in 0.3 ms: SELECT "upload"."id", "upload"."package",
Query ran in 0.2 ms: BEGIN
Query ran in 2.5 ms: UPDATE "session_info" SET "dl_get_hit" =
Query ran in 0.6 ms: UPDATE "file" SET "rtime" = GREATEST(COAL
Query ran in 3.5 ms: COMMIT

Where would you set the limit for web::block?

You're going to run into these issues either way. When an operation does network I/O you cannot assume it will take <50ms no matter how simple the query or small the data-set. Using web::block is a well-tested, good and simple way to use these blocking calls in ways that don't result in HTTP worker overloading.

[sorry; updated my previous comment shortly be I saw your response...]

But web::block will increase server load and reduce response time in normal case.

I can deal with longer response times for rare overload situations (clients will have their timeout just in case).

But it is bad that whole request fails in actix-web with 408 because client_timeout() is triggered in the described way.

Yes this is indeed off topic, lets discuss this on the gitter chat and if needed we can reopen this issue later.

Was this page helpful?
0 / 5 - 0 ratings