Node: Regression in idle socket handling

Created on 12 Dec 2018 · 13Comments · Source: nodejs/node

Version: v8.14.0
Platform: macOS 10.13.3, node:carbon LTS docker image
Subsystem: http

I'd like to report a possible regression introduced to the http module between versions 8.9.4 and 8.14.0.

Sockets that are opened but data is not transferred are closed immediately after data transmission once they have idled for >40 seconds.

Reproduction available here: https://github.com/timcosta/node_tcp_regression_test

We ran a tcpdump on our servers that were 504ing, and saw that node is responding with an ACK followed almost immediately by a duplicate ACK with an additional RST on the same socket.

Timeouts are set to 60 seconds on the client (AWS ELB) and 2 minutes on the node server (hapi.js).

I'm filing this as a node core issue as the error can be reproduced by using both hapi and the bare node http module as can be seen in this travis build: https://travis-ci.com/timcosta/node_tcp_regression_test/builds/94440224

The error seems to be not consistent per travis on versions 8.14.0, 10.14.2, and 11.4.0 but the build consistently passes on v8.9.4 which leads me to believe there is a possible regression.

cc: @jtymann @dstreby

http

Source

timcosta

👍3

Most helpful comment

Hm okay, I'd propose the default value be changed to something > 60 seconds, as that's the default timeout for ELBs and this issue likely broke node in a default configuration behind ELBs for more than just us.

timcosta on 12 Dec 2018

👍8

All 13 comments

@nodejs/http

Trott on 12 Dec 2018

Could this be caused by https://github.com/nodejs/node/commit/eb43bc04b1?

lpinca on 12 Dec 2018

Seems likely @lpinca. The timings and behavior match up. That seems to have broken AWS ELBs with default settings that front node.js back ends though, as this timeout is lower than the ELB default of 60 seconds.

timcosta on 12 Dec 2018

I see, that change was part of a security release so it didn't go through the normal release cycle. I'm not sure why 40 sec was chosen as default value but it can be customised.

lpinca on 12 Dec 2018

timcosta on 12 Dec 2018

👍8

cc: @mcollina

lpinca on 15 Dec 2018

Currently we are starting to wait for the headers when we receive a connection, however we could do this on first byte solving the issue at hand. I'll see if I can code something up that will address this.

Note that this is configurable https://nodejs.org/api/http.html#http_server_headerstimeout, so you can increase that to 60s, solving your immediate issue.

We picked 40 seconds, because it is the default Apache is using.

cc @nodejs/lts @MylesBorins

mcollina on 15 Dec 2018

Does #26166 look related to you guys? I was just looking through other connection RST issues here.

thomasjungblut on 4 Mar 2019

cross posting this with #26166:

we tried setting the header timeout to 0s on top of node:dubnium-jessie-slim (currently at v10.15.3), but to no avail. Also tried on top of node:11 now and we still get these pesky 504s. Also setting the IDLE timeout to 30s on the ELB seemed to not help either.

furthermore we have a similar issue using ALBs and websockets which give us 502s in that scenario

edit: we ultimately fixed it by just sidecar'ing a golang single host reverse proxy in front of the nodejs process.

thomasjungblut on 17 Mar 2019

@thomasjungblut how did sidecaring a golang reverse proxy fix your issue? I'm not sure I understand. I've been dealing with this ELB <-> Node issue for weeks and have not been able to find a solution. For awhile I thought it was an unlikely k8s bug causing an RST packet, but after applying several patches to k8s for different problems, nothing worked.

Currently, my idle timeout setup looks like this (the high numbers are recommended by Google's load balancer documentation which I used as a reference):

Cloudflare idle timeout: 300s (not configurable)
ELB idle timeout: 600s
Server keepalive timeout: 620s
Server headers timeout: 621s
Server timeout: 120s (should this be higher?)

Which should _technically_ work, right? But it doesn't. And it doesn't make sense why I'm seeing so many 504s and connection resets. Any ideas?

ezekg on 16 Jul 2019

@ezekg you can read a bit more about how I solved it here: https://www.timcosta.io/how-we-found-a-tcp-hangup-issue-between-aws-elbs-and-node-js/

There are code snippets in that article to help you figure out exactly when the socket timeout is occurring, which will tell you which of the timeouts you are hitting.

tldr though is that all of your server timeouts need to be above the ELB timeout, so my guess is that yes, your server timeout needs to be higher.

timcosta on 17 Jul 2019

👍1

@ezekg apparently Go closes connections properly with a FIN and it deals with the RST packets from the nodejs somewhat gracefully in that regard.

thomasjungblut on 17 Jul 2019

👍1

Friends have had the same problem After working for 15 days, the source of the problem is "server.headersTimeout = 7200000;" you can fix it by adding this code