Pm2: PM2 randomly not ACK the nginx connect......

Created on 30 Jul 2015 · 33Comments · Source: Unitech/pm2

I saw lots of nginx error logs with "(connect timed out) while connect to upstream"

I run pm2 cluster mode, which start up 16 ( 16 core server ) processes node.js app. The connect timeout occurs even when there is few requests.

I capture packets with tcpdump on both nginx side and pm2 side, and find out that the Nginx send SYN and not ACK recieved, the retransmission, and then timed out. At the pm2 side, tcpdump shows that SYN from Nginx came, and no ACK was sent.

So, I was confused what prevents pm2 answer the SYN?

Icebox Bug

Source

rrfeng

Most helpful comment

Hi, had the same problem.

Solution:
1) use keepalive connection;
2) use backup upstream.

Configuration example:
_nginx.conf_:

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! upstream will use this value
    # ...
}

_your-host.com.conf_:

upstream upstreamname {
    server 127.0.0.1:3000;
    server 127.0.0.1:3000 max_fails=1 fail_timeout=30s backup; # ! Backup upstream
    keepalive 64; # ! keepalive connections for PM2 upstream
}

server {
    # ...
    keepalive_timeout 30; # ! keepalive_timeout for clients
    # ...
    location / {
        # ...
        proxy_pass http://upstreamname; # ! upstream name
        proxy_http_version 1.1; # ! Important! Use http 1.1 for keepalive connections
        proxy_set_header Connection ""; ! Important! Clear "Connection" header for keepalive connections
        proxy_set_header Host $host;
        proxy_redirect off;
      # ...

Also, want to note: _very seldom I saw few errors of Nginx error logs with "(connect timed out) while connecting to upstream" but users don't notice 504 error because Nginx uses backup upstream for such situations._

peteychuk on 16 Feb 2016

😄2 👍2

All 33 comments

Does one of your 16 workers crash at some point ? (see the number of restarts)

jshkurti on 31 Jul 2015

Could you try to run netstat -n | awk '/^tcp/ {++NS[$NF]} END {for(ns in NS) print ns, NS[ns]}' on your server and paste the results here.

Tjatse on 31 Jul 2015

@Tjatse
CLOSE_WAIT 292
TIME_WAIT 14
ESTABLISHED 359
even there are little users, it happens.

@jshkurti
no crash, the workers run well since last reload, 3D before now.

rrfeng on 10 Aug 2015

I have exactly the same problem and can't find why.
PM2 is in cluster mode and I have a NGINX in front.

After some hours, I have some "connect timeout" errors. Days after days I have more and more.

A startOrGracefulReload doesn't change anything. But if I kill PM2, it's working again great.

Bacto on 4 Dec 2015

Same thing.

@Tjatse

TIME_WAIT 452
CLOSE_WAIT 27
SYN_SENT 4
FIN_WAIT1 9
ESTABLISHED 446
FIN_WAIT2 177
SYN_RECV 2
LAST_ACK 3

estliberitas on 8 Dec 2015

@Unitech Guys, what info do you need to look into the issue? The thing is I see these errors very often and had to schedule pm2 gracefulReload ... to save our services from such errors. Suprising part is that for a month I did not do anything except PM2 update to latest version.

I offer several steps to understand where the bug is. In every step we put load on nginx API endpoint.

Setup nginx -> pm2 -> node.js web app. So I would keep nginx configuration as is. And replace my node.js web app with simple http.createServer() as this one is simple example with 99% no buggy behavior (only if in node.js core).
If issue persists, then I would make it just nginx -> node.js web app without PM2 and cluster mode respectively. If issue is eliminated, then it's our node.js application which causes problems.
If issue persists, then I guess it's nginx configuration issue. If not, then we can say that we have something related to PM2 (or Node.js itself).

I'm gonna make this experiment and write down results here later.

P.S.: @rrfeng @Bacto did you make sure it's PM2 and not nginx configuration or your applications? In my case there are lots of shared locks which possibly can cause timeouts.
P.P.S.: Node.js 4.1.2, PM2 0.15.5

estliberitas on 10 Dec 2015

@estliberitas Sorry for missing this issue。
We found out what was the problem。We run 16 process on a 16 core VM，and PM2 God Process would be 100% cpu usage, the PM2 process cannot handle more request from Nginx，then Nginx throw erros。

rrfeng on 11 Dec 2015

@estliberitas I'm still investigating the issue.
Good news, I've just found a way to reproduce it.
I'll complet this issue when I'll have more informations.

Bacto on 11 Dec 2015

@rrfeng Well, my situation is a bit different. I do not see God daemon making high CPU usage. It's always less than 10%.

Also, I was not able to reproduce a bug after first step of testing I described above. So currently I guess it's my app which blocks HTTP requests. Testing it with Siege in order to get timeouts. Also, very interested in what @Bacto will say.

estliberitas on 11 Dec 2015

Ok so, it drove me crazy. But I have some informations.

I reproduced the bug with this configuration:

client => Nginx (port 80) => PM2 (port 3000)

PM2 was just loading a test app, with Express and a route that just responds "ok".

I didn't have the problem with this kind of configurations:

client => Nginx (port 80) => PM2 (port 5000)
client => PM2 (port 3000)
client => Nginx (port 80) => direct app (port 3000)

That's really strange: I have the problem only with Nginx + PM2 + port 3000. If I change something in the equation, I don't have anymore the problem.

The server that runs PM2 didn't reboot since 320 days and I made some PM2 updates in the past.
I had one PM2 GOD that was running with version 0.14.3, one with 0.14.7 and one with 0.15.10
I killed the old one but it didn't change anything.

I didn't found anything around the kernel or iptables and nothing in logs.

Finally I rebooted the server and now it seems to work. I will wait some days / weeks to be sure of that.
Maybe it's a problem with PM2 and old versions or a problem with the kernel. I have no idea. That's really frustrating.

Hope it will help you @estliberitas

Bacto on 11 Dec 2015

@Bacto What Node.js version do you have at moment? What Node.js it was when you started to saw timeouts?

estliberitas on 11 Dec 2015

@estliberitas I have node v4.2.2
It's working for now... Will see in a few days/weeks.

Bacto on 11 Dec 2015

Hi, had the same problem.

Solution:
1) use keepalive connection;
2) use backup upstream.

Configuration example:
_nginx.conf_:

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! upstream will use this value
    # ...
}

_your-host.com.conf_:

upstream upstreamname {
    server 127.0.0.1:3000;
    server 127.0.0.1:3000 max_fails=1 fail_timeout=30s backup; # ! Backup upstream
    keepalive 64; # ! keepalive connections for PM2 upstream
}

server {
    # ...
    keepalive_timeout 30; # ! keepalive_timeout for clients
    # ...
    location / {
        # ...
        proxy_pass http://upstreamname; # ! upstream name
        proxy_http_version 1.1; # ! Important! Use http 1.1 for keepalive connections
        proxy_set_header Connection ""; ! Important! Clear "Connection" header for keepalive connections
        proxy_set_header Host $host;
        proxy_redirect off;
      # ...

peteychuk on 16 Feb 2016

😄2 👍2

@Bacto Can you give feedback, your server continue working without any errors?

peteychuk on 16 Feb 2016

My server that runs PM2 didn't reboot since 250 days.

peteychuk on 16 Feb 2016

Right now I don't have anymore problem.
I'm curious to see if the reboot will help you @Peteychuk

Bacto on 16 Feb 2016

I don't know if this can be related.

I get some errors:
connect() failed (111: Connection refused) while connecting to upstream

Flow:

Nginx got a POST request yesterday at 22:00
Nginx says "connect() failed (111: Connection refused) while connecting to upstream" and returns 504 to the client.
06:00 (8 hours later), is the code executed in node.

Any thoughts about why node execute the code 8 hours later?

I was running Node 5.1.1 and I don't know what version of pm2 I was using I think it was 0.14.6
I'm now trying 0.14.3 with Node LTS 4.3.1

jontelm on 1 Mar 2016

Any news about this issue. Same problem here :(

atorgfr on 27 Oct 2016

@atorgfr Node version? PM2 version?

jontelm on 28 Oct 2016

node 4.3.1
pm2 2.0.19

atorgfr on 28 Oct 2016

I had same problem . If the app server had not request long time , when a request got ,the pm2 can't handle the request . if always have request , the server is normal .

justquanyin on 11 Nov 2016

@jontelm
Node: v4.4.7
PM2: 1.1.3

No issue with Nginx configuration:
_nginx.conf:_

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! I think keepalive help me to decrease count of errors.
    # ...
}

peteychuk on 16 Nov 2016

Same issue here after updating PM2 from an old version (1.1.0).
The problem is very ugly, since you may not notice that some connections are randomly dropped and the only "workaround" is to restart the PM2 processes.

Nginx: 1.10.0
Node: v6.9.2
PM2: 2.2.3 (cluster mode, 0 restarts)
OS: Ubuntu 16.04.1

The solution proposed by @peteychuk seems to mitigate the problem, but doesn't solve it. Time-outs still occur sometimes.

@jontelm: the issue with spaces in URL in the http core module doesn't seem to be strictly related to this one. I'm not using white-spaces in URLs.

g-div on 6 Jan 2017

The errors found in the nginx logs are:

upstream timed out (110: Connection timed out) while reading response header from upstream

and then

no live upstreams while connecting to upstream

g-div on 6 Jan 2017

I have been running this in Fork for the last days and I don't see this problem anymore, running multiple Forks on different ports and let nginx fix the load balancing. I can post later when I have more data.

jontelm on 6 Jan 2017

Well, I was somehow aware that the problem occurs in cluster_mode only, but if I'm using PM2 is also to profit of the embedded load-balancing system. I will switch to fork_mode + nginx load-balancing only as a last resort if no one comes up with a more comfortable solution.

g-div on 6 Jan 2017

Also tried a reboot, as proposed by @Bacto. Unfortunately it didn't solve the problem.

g-div on 9 Jan 2017

Hi, had same problem 2 months ago.

I don't think this is right solution.

But, I solved with NginX load balancing and I don't have problem until now .

I used Nginx load balancing
I run pm2 with fork mode NOT Cluster Mode

_Configuration file (nginx):_

upstream app_nodes {
    ip_hash; 
    server 127.0.0.1:8080;
    server 127.0.0.1:8081;
}

location / {
  .....
  proxy_pass http://app_nodes;
  ....
}

_pm2 configuration ( pm2_myapp.js )_

{
  "apps": [
  {
    "name": "myapp-0",
    "script": "app.js",
    "watch": false,
    "env": {
      "NODE_ENV": "production",
      "PORT": 8080
    },
    "exec_mode": "fork_mode"
  },
  {
    "name": "myapp-1",
    "script": "app.js",
    "watch": false,
    "env":{
      "NODE_ENV": "production",
      "PORT": 8081
    },
    "exec_mode": "fork_mode"
  }
  ]
}

I run pm2 start pm2_myapp.js

TK-one on 10 Jan 2017

The solution with nginx is to use keepalive 64; in your nginx upstream, it should works.
If you have the problem without nginx (http server stop responding to request) see #2300

vmarchaud on 17 Jan 2017

Check this nginx.conf, it works like a charm for us: https://gist.github.com/Unitech/ec40871357db0257e76d6899cb6762dc

Unitech on 18 Jan 2017

just modify some nginx conf：

upstream backend_nodejs {
  ...
  keepalive 512;
  ...
}

and

location / {
    proxy_set_header   Connection "";
    proxy_http_version 1.1;
    proxy_pass http://backend_nodejs;
}

you can get more detail from https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads

cody1991 on 4 Nov 2018

Hi, does the problem have any progress?

The problem still happened after PM2 runs long time on our machines.

Changing nginx config to use keepalive connections between nginx and upstream that only can significantly reduce the frequency of 504, only restart daemon can solve the problem itself.

I trying to figure out how long time it will happen, but sill no idea now.

Node: 9.4.0
PM2: 3.3.1