node:carbon LTS docker imageI'd like to report a possible regression introduced to the http module between versions 8.9.4 and 8.14.0.
Sockets that are opened but data is not transferred are closed immediately after data transmission once they have idled for >40 seconds.
Reproduction available here: https://github.com/timcosta/node_tcp_regression_test
We ran a tcpdump on our servers that were 504ing, and saw that node is responding with an ACK followed almost immediately by a duplicate ACK with an additional RST on the same socket.
Timeouts are set to 60 seconds on the client (AWS ELB) and 2 minutes on the node server (hapi.js).
I'm filing this as a node core issue as the error can be reproduced by using both hapi and the bare node http module as can be seen in this travis build: https://travis-ci.com/timcosta/node_tcp_regression_test/builds/94440224
The error seems to be not consistent per travis on versions 8.14.0, 10.14.2, and 11.4.0 but the build consistently passes on v8.9.4 which leads me to believe there is a possible regression.
cc: @jtymann @dstreby
@nodejs/http
Could this be caused by https://github.com/nodejs/node/commit/eb43bc04b1?
Seems likely @lpinca. The timings and behavior match up. That seems to have broken AWS ELBs with default settings that front node.js back ends though, as this timeout is lower than the ELB default of 60 seconds.
I see, that change was part of a security release so it didn't go through the normal release cycle. I'm not sure why 40 sec was chosen as default value but it can be customised.
Hm okay, I'd propose the default value be changed to something > 60 seconds, as that's the default timeout for ELBs and this issue likely broke node in a default configuration behind ELBs for more than just us.
cc: @mcollina
Currently we are starting to wait for the headers when we receive a connection, however we could do this on first byte solving the issue at hand. I'll see if I can code something up that will address this.
Note that this is configurable https://nodejs.org/api/http.html#http_server_headerstimeout, so you can increase that to 60s, solving your immediate issue.
We picked 40 seconds, because it is the default Apache is using.
cc @nodejs/lts @MylesBorins
Does #26166 look related to you guys? I was just looking through other connection RST issues here.
cross posting this with #26166:
we tried setting the header timeout to 0s on top of node:dubnium-jessie-slim (currently at v10.15.3), but to no avail. Also tried on top of node:11 now and we still get these pesky 504s. Also setting the IDLE timeout to 30s on the ELB seemed to not help either.
furthermore we have a similar issue using ALBs and websockets which give us 502s in that scenario
edit: we ultimately fixed it by just sidecar'ing a golang single host reverse proxy in front of the nodejs process.
@thomasjungblut how did sidecaring a golang reverse proxy fix your issue? I'm not sure I understand. I've been dealing with this ELB <-> Node issue for weeks and have not been able to find a solution. For awhile I thought it was an unlikely k8s bug causing an RST packet, but after applying several patches to k8s for different problems, nothing worked.
Currently, my idle timeout setup looks like this (the high numbers are recommended by Google's load balancer documentation which I used as a reference):
Which should _technically_ work, right? But it doesn't. And it doesn't make sense why I'm seeing so many 504s and connection resets. Any ideas?
@ezekg you can read a bit more about how I solved it here: https://www.timcosta.io/how-we-found-a-tcp-hangup-issue-between-aws-elbs-and-node-js/
There are code snippets in that article to help you figure out exactly when the socket timeout is occurring, which will tell you which of the timeouts you are hitting.
tldr though is that all of your server timeouts need to be above the ELB timeout, so my guess is that yes, your server timeout needs to be higher.
@ezekg apparently Go closes connections properly with a FIN and it deals with the RST packets from the nodejs somewhat gracefully in that regard.
Friends have had the same problem After working for 15 days, the source of the problem is "server.headersTimeout = 7200000;" you can fix it by adding this code
Most helpful comment
Hm okay, I'd propose the default value be changed to something > 60 seconds, as that's the default timeout for ELBs and this issue likely broke node in a default configuration behind ELBs for more than just us.