Node: Regression issue with keep alive connections

Created on 23 Apr 2019  路  23Comments  路  Source: nodejs/node

  • Version: 10.15.3
  • Platform: Linux
  • Subsystem:

Hi,

We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X where X is higher than the keep-alive timeout on the load balancer side.

With version 10.15.3 setting server.keepAliveTimeout = X does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 https://github.com/nodejs/node/commit/1a7302bd48 which might have caused the issue we are seeing.

Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.

http

Most helpful comment

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 馃帀

All 23 comments

// cc @nodejs/http

The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)

Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?

In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.

I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.

I guess that headersTimeout should be longer than keepAliveTimeout because after the first request of a keep-alive connection,headersTimeout is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.

@OrKoN What happens with your test suite if you put a headersTimeout longer than keepAliveTimeout?

Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout seems to work only when headers are sent in multiple packets.)

To illustrate the issue with an example of two requests on a keep-alive connection:

  1. A connection is made
  2. The server receives the first packet of the first request's headers
  3. The server receives the second packet of the first request's headers
  4. The server sends the response for the first request
  5. (...idle time...)
  6. The server receives the first packet of the second request's headers
  7. The server receives the second packet of the second request's headers

keepAliveTimeout works for 4-6 (the period between 4 and 6). headersTimeout works for 3-7. So headersTimeout should be longer than keepAliveTimeout in order to keep connections until keepAliveTimeout.

I wonder whether headersTimeout should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout.

@shuhei so you mean that headersTimeout spans multiple requests on the same connection? I have not tried to change the headersTimeout because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.

@OrKoN Yes, headersTimeout spans parts of two requests on the same connection including the interval between the two requests. Before https://github.com/nodejs/node/commit/1a7302bd48, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout to subsequent requests in the same connection.

I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming is only called once the headers are parsed, so it need to be some other place then.

P.S. I will run our tests with increased headerTimeout today to see if it helps.

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 馃帀

I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while.

I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on.

Why does not anybody else find this bug alarming? My guess is that -

  1. there are no traces of it on Node instances itself. No log messages, nothing.
  2. Web users connecting to Node services directly may simply ignore that few connections are dropped. The rate was not high in my case (maybe a couple of dozens per day while we serve millions of connections daily), so the chance of a particular visitor to experience this is relatively small.
  3. and I found the bug indirectly based on the load balancer's logs: not everyone keeps an eye on the logs closely.

We're having the same problem after upgrading from 8.x to 10.15.3.
However, I don't think it's a regression from 10.15.2 to 10.15.3.
This discussion goes way back to this issue #13391
I forked the example code there and created a new test case that fails on all the following versions:
12.3.1, 10.15.3, 10.15.2, 10.15.1, 10.15.0, 8.11.2

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

So I tweaked the test so that:

  1. The server time to answer a request is 3 * keepAliveTimeout minus few (random) milliseconds (keeping 2-3 request alive).
  2. The client fires not only one request, but a request every (exactly) keepAliveTimeout. This makes sure that the client request are aligned with the server connection keepAliveTimeout.

The result are pretty consistent:

Error: socket hang up
    at createHangUpError (_http_client.js:343:17)
    at Socket.socketOnEnd (_http_client.js:444:23)
    at Socket.emit (events.js:205:15)
    at endReadableNT (_stream_readable.js:1137:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:9) {
  code: 'ECONNRESET'
}

You can clone the code from yoavain/node8keepAliveTimeout

npm install
npm run test -- --keepAliveTimeout 5000
(Note that the keepAliveTimeout is also the client requests interval)

When setting keepAliveTimeout to 0, the problem is gone.

npm run test -- --keepAliveTimeout 5000 --keepAliveDisabled

Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away.

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request.

To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

PrintScreen of a packet capture taken on the nodeJS side of the connection is attached:
image

Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests)

Can confirm this is still the case in v10.19.0 release.

We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue:

server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000;

This is what solved it for us in AppEngine:

this.server.keepAliveTimeout = 600 * 1000

@mcollina @OrKoN Is there a plan to release 10.15.4 with the fix?

Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog.

Would love to see it in 10.20.x or similar if it鈥檚 at all feasible to port it back.

Note that this also appears to be broken in v12, not just the recent v10s.

Since #32329 was merged, now I don't need to set headersTimeout bigger than keepAliveTimeout, right ?

@yokomotod if you're asking for v12, v12.19.0 contains the fix

https://github.com/nodejs/node/compare/v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

Was this page helpful?
0 / 5 - 0 ratings

Related issues

addaleax picture addaleax  路  3Comments

Brekmister picture Brekmister  路  3Comments

seishun picture seishun  路  3Comments

fanjunzhi picture fanjunzhi  路  3Comments

stevenvachon picture stevenvachon  路  3Comments