Pm2: PM2 randomly not ACK the nginx connect......

Created on 30 Jul 2015  Â·  33Comments  Â·  Source: Unitech/pm2

I saw lots of nginx error logs with "(connect timed out) while connect to upstream"

I run pm2 cluster mode, which start up 16 ( 16 core server ) processes node.js app. The connect timeout occurs even when there is few requests.

I capture packets with tcpdump on both nginx side and pm2 side, and find out that the Nginx send SYN and not ACK recieved, the retransmission, and then timed out. At the pm2 side, tcpdump shows that SYN from Nginx came, and no ACK was sent.

So, I was confused what prevents pm2 answer the SYN?

Icebox Bug

Most helpful comment

Hi, had the same problem.

Solution:
1) use keepalive connection;
2) use backup upstream.

Configuration example:
_nginx.conf_:

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! upstream will use this value
    # ...
}

_your-host.com.conf_:

upstream upstreamname {
    server 127.0.0.1:3000;
    server 127.0.0.1:3000 max_fails=1 fail_timeout=30s backup; # ! Backup upstream
    keepalive 64; # ! keepalive connections for PM2 upstream
}

server {
    # ...
    keepalive_timeout 30; # ! keepalive_timeout for clients
    # ...
    location / {
        # ...
        proxy_pass http://upstreamname; # ! upstream name
        proxy_http_version 1.1; # ! Important! Use http 1.1 for keepalive connections
        proxy_set_header Connection ""; ! Important! Clear "Connection" header for keepalive connections
        proxy_set_header Host $host;
        proxy_redirect off;
      # ...

Also, want to note: _very seldom I saw few errors of Nginx error logs with "(connect timed out) while connecting to upstream" but users don't notice 504 error because Nginx uses backup upstream for such situations._

All 33 comments

Does one of your 16 workers crash at some point ? (see the number of restarts)

Could you try to run netstat -n | awk '/^tcp/ {++NS[$NF]} END {for(ns in NS) print ns, NS[ns]}' on your server and paste the results here.

@Tjatse
CLOSE_WAIT 292
TIME_WAIT 14
ESTABLISHED 359
even there are little users, it happens.

@jshkurti
no crash, the workers run well since last reload, 3D before now.

I have exactly the same problem and can't find why.
PM2 is in cluster mode and I have a NGINX in front.

After some hours, I have some "connect timeout" errors. Days after days I have more and more.

A startOrGracefulReload doesn't change anything. But if I kill PM2, it's working again great.

Same thing.

@Tjatse

TIME_WAIT 452
CLOSE_WAIT 27
SYN_SENT 4
FIN_WAIT1 9
ESTABLISHED 446
FIN_WAIT2 177
SYN_RECV 2
LAST_ACK 3

@Unitech Guys, what info do you need to look into the issue? The thing is I see these errors very often and had to schedule pm2 gracefulReload ... to save our services from such errors. Suprising part is that for a month I did not do anything except PM2 update to latest version.

I offer several steps to understand where the bug is. In every step we put load on nginx API endpoint.

  1. Setup nginx -> pm2 -> node.js web app. So I would keep nginx configuration as is. And replace my node.js web app with simple http.createServer() as this one is simple example with 99% no buggy behavior (only if in node.js core).
  2. If issue persists, then I would make it just nginx -> node.js web app without PM2 and cluster mode respectively. If issue is eliminated, then it's our node.js application which causes problems.
  3. If issue persists, then I guess it's nginx configuration issue. If not, then we can say that we have something related to PM2 (or Node.js itself).

I'm gonna make this experiment and write down results here later.

P.S.: @rrfeng @Bacto did you make sure it's PM2 and not nginx configuration or your applications? In my case there are lots of shared locks which possibly can cause timeouts.
P.P.S.: Node.js 4.1.2, PM2 0.15.5

@estliberitas Sorry for missing this issue。
We found out what was the problem。We run 16 process on a 16 core VM,and PM2 God Process would be 100% cpu usage, the PM2 process cannot handle more request from Nginx,then Nginx throw erros。

@estliberitas I'm still investigating the issue.
Good news, I've just found a way to reproduce it.
I'll complet this issue when I'll have more informations.

@rrfeng Well, my situation is a bit different. I do not see God daemon making high CPU usage. It's always less than 10%.

Also, I was not able to reproduce a bug after first step of testing I described above. So currently I guess it's my app which blocks HTTP requests. Testing it with Siege in order to get timeouts. Also, very interested in what @Bacto will say.

Ok so, it drove me crazy. But I have some informations.

I reproduced the bug with this configuration:

  • client => Nginx (port 80) => PM2 (port 3000)

PM2 was just loading a test app, with Express and a route that just responds "ok".

I didn't have the problem with this kind of configurations:

  • client => Nginx (port 80) => PM2 (port 5000)
  • client => PM2 (port 3000)
  • client => Nginx (port 80) => direct app (port 3000)

That's really strange: I have the problem only with Nginx + PM2 + port 3000. If I change something in the equation, I don't have anymore the problem.

The server that runs PM2 didn't reboot since 320 days and I made some PM2 updates in the past.
I had one PM2 GOD that was running with version 0.14.3, one with 0.14.7 and one with 0.15.10
I killed the old one but it didn't change anything.

I didn't found anything around the kernel or iptables and nothing in logs.

Finally I rebooted the server and now it seems to work. I will wait some days / weeks to be sure of that.
Maybe it's a problem with PM2 and old versions or a problem with the kernel. I have no idea. That's really frustrating.

Hope it will help you @estliberitas

@Bacto What Node.js version do you have at moment? What Node.js it was when you started to saw timeouts?

@estliberitas I have node v4.2.2
It's working for now... Will see in a few days/weeks.

Hi, had the same problem.

Solution:
1) use keepalive connection;
2) use backup upstream.

Configuration example:
_nginx.conf_:

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! upstream will use this value
    # ...
}

_your-host.com.conf_:

upstream upstreamname {
    server 127.0.0.1:3000;
    server 127.0.0.1:3000 max_fails=1 fail_timeout=30s backup; # ! Backup upstream
    keepalive 64; # ! keepalive connections for PM2 upstream
}

server {
    # ...
    keepalive_timeout 30; # ! keepalive_timeout for clients
    # ...
    location / {
        # ...
        proxy_pass http://upstreamname; # ! upstream name
        proxy_http_version 1.1; # ! Important! Use http 1.1 for keepalive connections
        proxy_set_header Connection ""; ! Important! Clear "Connection" header for keepalive connections
        proxy_set_header Host $host;
        proxy_redirect off;
      # ...

Also, want to note: _very seldom I saw few errors of Nginx error logs with "(connect timed out) while connecting to upstream" but users don't notice 504 error because Nginx uses backup upstream for such situations._

@Bacto Can you give feedback, your server continue working without any errors?

My server that runs PM2 didn't reboot since 250 days.

Right now I don't have anymore problem.
I'm curious to see if the reboot will help you @Peteychuk

I don't know if this can be related.

I get some errors:
connect() failed (111: Connection refused) while connecting to upstream

Flow:

  1. Nginx got a POST request yesterday at 22:00
  2. Nginx says "connect() failed (111: Connection refused) while connecting to upstream" and returns 504 to the client.
  3. 06:00 (8 hours later), is the code executed in node.

Any thoughts about why node execute the code 8 hours later?

I was running Node 5.1.1 and I don't know what version of pm2 I was using I think it was 0.14.6
I'm now trying 0.14.3 with Node LTS 4.3.1

Any news about this issue. Same problem here :(

@atorgfr Node version? PM2 version?

node 4.3.1
pm2 2.0.19

I had same problem . If the app server had not request long time , when a request got ,the pm2 can't handle the request . if always have request , the server is normal .

@jontelm
Node: v4.4.7
PM2: 1.1.3

No issue with Nginx configuration:
_nginx.conf:_

# ...
http {
    # ...
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 2; # ! I think keepalive help me to decrease count of errors.
    # ...
}

Same issue here after updating PM2 from an old version (1.1.0).
The problem is very ugly, since you may not notice that some connections are randomly dropped and the only "workaround" is to restart the PM2 processes.

Nginx: 1.10.0
Node: v6.9.2
PM2: 2.2.3 (cluster mode, 0 restarts)
OS: Ubuntu 16.04.1

The solution proposed by @peteychuk seems to mitigate the problem, but doesn't solve it. Time-outs still occur sometimes.

@jontelm: the issue with spaces in URL in the http core module doesn't seem to be strictly related to this one. I'm not using white-spaces in URLs.

The errors found in the nginx logs are:

upstream timed out (110: Connection timed out) while reading response header from upstream

and then

no live upstreams while connecting to upstream

I have been running this in Fork for the last days and I don't see this problem anymore, running multiple Forks on different ports and let nginx fix the load balancing. I can post later when I have more data.

Well, I was somehow aware that the problem occurs in cluster_mode only, but if I'm using PM2 is also to profit of the embedded load-balancing system. I will switch to fork_mode + nginx load-balancing only as a last resort if no one comes up with a more comfortable solution.

Also tried a reboot, as proposed by @Bacto. Unfortunately it didn't solve the problem.

Hi, had same problem 2 months ago.

I don't think this is right solution.

But, I solved with NginX load balancing and I don't have problem until now .

  1. I used Nginx load balancing

  2. I run pm2 with fork mode NOT Cluster Mode

_Configuration file (nginx):_

upstream app_nodes {
    ip_hash; 
    server 127.0.0.1:8080;
    server 127.0.0.1:8081;
}
location / {
  .....
  proxy_pass http://app_nodes;
  ....
}

_pm2 configuration ( pm2_myapp.js )_

{
  "apps": [
  {
    "name": "myapp-0",
    "script": "app.js",
    "watch": false,
    "env": {
      "NODE_ENV": "production",
      "PORT": 8080
    },
    "exec_mode": "fork_mode"
  },
  {
    "name": "myapp-1",
    "script": "app.js",
    "watch": false,
    "env":{
      "NODE_ENV": "production",
      "PORT": 8081
    },
    "exec_mode": "fork_mode"
  }
  ]
}

I run pm2 start pm2_myapp.js

The solution with nginx is to use keepalive 64; in your nginx upstream, it should works.
If you have the problem without nginx (http server stop responding to request) see #2300

Check this nginx.conf, it works like a charm for us: https://gist.github.com/Unitech/ec40871357db0257e76d6899cb6762dc

just modify some nginx conf:

upstream backend_nodejs {
  ...
  keepalive 512;
  ...
}

and

location / {
    proxy_set_header   Connection "";
    proxy_http_version 1.1;
    proxy_pass http://backend_nodejs;
}

you can get more detail from https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads

Hi, does the problem have any progress?

The problem still happened after PM2 runs long time on our machines.

Changing nginx config to use keepalive connections between nginx and upstream that only can significantly reduce the frequency of 504, only restart daemon can solve the problem itself.

I trying to figure out how long time it will happen, but sill no idea now.

Node: 9.4.0
PM2: 3.3.1

Having similar issues, has anyone found solution to this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

webchaz picture webchaz  Â·  3Comments

morfies picture morfies  Â·  3Comments

FujiHaruka picture FujiHaruka  Â·  3Comments

rangercyh picture rangercyh  Â·  4Comments

jubairsaidi picture jubairsaidi  Â·  3Comments