pm2 reload causing connection timeout and downtime

Created on 14 Dec 2016  路  11Comments  路  Source: Unitech/pm2

I am investigating an issue where pm2 reload is intermittently causing connection errors (HTTP 502 from an Nginx proxy in front). When running pm2 reload during a benchmark test (to simulate traffic to the server), the connection either times out or gets reset by peer, ultimately causing a few seconds of downtime.

Expected behaviour

When running pm2 reload during an Apache Bench test, all requests should complete successfully.

Actual behaviour

Connection times out:

Brendons-MacBook-Pro:api brendonboshell$ ab -n 10000 -c 1 http://127.0.0.1:4000/v1/
This is ApacheBench, Version 2.3 <$Revision: 1748469 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 2906 requests completed

In PM2 2.1.6, I very occasionally got "connection reset by peer", although I haven't been able to replicate this in PM2 2.2.1.

(The ab completes successfully if pm2 reload is not run during the test.

Steps to reproduce

Run Apache Bench:

ab -n 10000 -c 1 http://127.0.0.1:4000/v1/

While the benchmark is running, run pm2 reload:

pm2 reload api

Problem occurs ~50% of the time.

Software versions used

OS         : OS X 10.12.1
node.js    : 6.9.1
PM2        : 2.2.1

PM2 Log output

Use the command: tail --lines 50 ~/.pm2/pm2.log

2016-12-14 17:32:37: RPC closed
2016-12-14 17:32:37: PUB closed
2016-12-14 17:32:37: PM2 successfully stopped
2016-12-14 17:32:38: ===============================================================================
2016-12-14 17:32:38: --- New PM2 Daemon started ----------------------------------------------------
2016-12-14 17:32:38: Time                 : Wed Dec 14 2016 17:32:38 GMT+0000 (GMT)
2016-12-14 17:32:38: PM2 version          : 2.2.1
2016-12-14 17:32:38: Node.js version      : 6.9.1
2016-12-14 17:32:38: Current arch         : x64
2016-12-14 17:32:38: PM2 home             : /Users/brendonboshell/.pm2
2016-12-14 17:32:38: PM2 PID file         : /Users/brendonboshell/.pm2/pm2.pid
2016-12-14 17:32:38: RPC socket file      : /Users/brendonboshell/.pm2/rpc.sock
2016-12-14 17:32:38: BUS socket file      : /Users/brendonboshell/.pm2/pub.sock
2016-12-14 17:32:38: Application log path : /Users/brendonboshell/.pm2/logs
2016-12-14 17:32:38: Process dump file    : /Users/brendonboshell/.pm2/dump.pm2
2016-12-14 17:32:38: Concurrent actions   : 2
2016-12-14 17:32:38: SIGTERM timeout      : 1600
2016-12-14 17:32:38: ===============================================================================
2016-12-14 17:32:38: Starting execution sequence in -cluster mode- for app name:api id:0
2016-12-14 17:32:38: Starting execution sequence in -cluster mode- for app name:api id:1
2016-12-14 17:32:38: App name:api id:0 online
2016-12-14 17:32:38: App name:api id:1 online
2016-12-14 17:32:41: Starting execution sequence in -cluster mode- for app name:api id:0
2016-12-14 17:32:41: Starting execution sequence in -cluster mode- for app name:api id:1
2016-12-14 17:32:41: App name:api id:0 online
2016-12-14 17:32:41: App name:api id:1 online
2016-12-14 17:32:42: -reload- New worker listening
2016-12-14 17:32:42: Stopping app:api id:_old_0
2016-12-14 17:32:42: -reload- New worker listening
2016-12-14 17:32:42: Stopping app:api id:_old_1
2016-12-14 17:32:42: App name:api id:_old_0 disconnected
2016-12-14 17:32:42: App [api] with id [_old_0] and pid [40178], exited with code [0] via signal [SIGINT]
2016-12-14 17:32:42: App name:api id:_old_1 disconnected
2016-12-14 17:32:42: App [api] with id [_old_1] and pid [40179], exited with code [0] via signal [SIGINT]
2016-12-14 17:32:42: pid=40178 msg=process killed
2016-12-14 17:32:42: pid=40179 msg=process killed
2016-12-14 17:32:51: Starting execution sequence in -cluster mode- for app name:api id:0
2016-12-14 17:32:51: Starting execution sequence in -cluster mode- for app name:api id:1
2016-12-14 17:32:51: App name:api id:0 online
2016-12-14 17:32:51: App name:api id:1 online
2016-12-14 17:32:52: -reload- New worker listening
2016-12-14 17:32:52: Stopping app:api id:_old_0
2016-12-14 17:32:52: -reload- New worker listening
2016-12-14 17:32:52: Stopping app:api id:_old_1
2016-12-14 17:32:52: App name:api id:_old_0 disconnected
2016-12-14 17:32:52: App [api] with id [_old_0] and pid [40218], exited with code [0] via signal [SIGINT]
2016-12-14 17:32:52: App name:api id:_old_1 disconnected
2016-12-14 17:32:52: App [api] with id [_old_1] and pid [40219], exited with code [0] via signal [SIGINT]
2016-12-14 17:32:52: pid=40218 msg=process killed
2016-12-14 17:32:52: pid=40219 msg=process killed
Won't fix Question

Most helpful comment

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(200);
});

var server = http.createServer(app);
server.listen(4000, function () {
 process.send('ready');
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

Start this with cluster mode and --wait-ready options and it should work.

All 11 comments

Are you sure that you have implemented both start and stop ? Could you provide a sample app to reproduce the issue ?

I am using the following script:

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  res.send(200);
});

var server = http.createServer(app);

server.listen(4000, function () {});

I am not sure what you mean by start and stop - I assume you mean am I catching the SIGINT signal. In this case, I am not because I expect PM2 to load balance away from the process being restarted. (I hope this is a correct assumption.) In this case I have no database connections etc that I need to close.

(Furthermore, Apache Bench is not using keep alive connections.)

You need to catch the SIGINT to shutdown after your http server has stopped processing request, see the documentation. PM2 don't load balance the request, that's the cluster mode that handle this part but you need to tell when the worker has done processing request (we can't do that for you).

Since this is an extremely common use case of PM2, could you please illustrate how I should be doing this in the very simple HTTP server I posted above? I have read the documentation, but there is no suggestion I need to manage the open HTTP connections myself.

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(200);
});

var server = http.createServer(app);
server.listen(4000, function () {
 process.send('ready');
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

Start this with cluster mode and --wait-ready options and it should work.

Okay, thank you for that script. It seems to have reduced the frequency of the timeout issue, but I am still very occasionally getting a timeout:

Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
apr_socket_recv: Operation timed out (60)
Total of 2688 requests completed

And when it doesn't completely timeout, there is often a request taking ~20 seconds or so to complete:

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      2
  98%      3
  99%      5
 100%  19698 (longest request)

(Also, is --wait-ready an argument to pm2 reload or pm2 start. I am providing it to both, but still get the same problem.)

There still are some requests that are prone to error (level of tcp) like when tcp connection are reset, but you should'nt have any read/write error on sockets. (at least we get those reports using wrk)

@vmarchaud @Unitech I am a curious about graceful start. I thought I remembered from the early days that PM2 has some magic under the hood to know when server.listen is called. Is this not so? What happens to simple HTTP servers in a cluster when we don't use --wait-ready? Is it possible that they are receiving requests before they are ready?

The clustering docs mention graceful shutdown but not graceful start.

Could you correct me on these assumptions?

  1. process.send('ready') is meaningless and has no effect unless using --wait-ready
  2. PM2_GRACEFUL_LISTEN_TIMEOUT is meaningless and has no effect unless using --wait-ready
  3. --wait-ready is an optional flag because it is only necessary for people who are doing asynchronous initialization _in addition to or instead of_ server.listen().
  4. Assuming server.listen() happens quickly and its callback is fired in < 3000ms (default PM2_GRACEFUL_LISTEN_TIMEOUT), then process.send('ready') is not required. In this case, process.send('ready') simply exists to improve performance so PM2 can recognize the process as online before the graceful listen timeout.

It definitely seems like I have some conflicting assumptions. In particular, it seems like even if PM2 is smart about server.listen(), it would still need something like PM2_GRACEFUL_LISTEN_TIMEOUT in case that was never called. So #2 must be wrong.

As a side note, I made a module called handle-quit to help with graceful shutdown.

1: Yes
2: See below
3: Yes
4: process.send('ready') is only useful in case 3

PM2 is indeed smart about the server.listen(), i through it was documented but it isnt so i will add the doc. However the process.send('ready') is easier to understand and do not rely on under-the-hood magic of PM2 so i advise to use it.

Hi @vmarchaud thanks for all of your helpful answers. One more clarification here.

Is it correct to say that when --wait-ready is used, PM2 ignores server.listen() completely? And in that case, it assumes the app is ready when either process.send('ready') is received or PM2_GRACEFUL_LISTEN_TIMEOUT is reached? In other words, if PM2_GRACEFUL_LISTEN_TIMEOUT was Infinity and process.send('ready') was never fired, it would never be considered ready? This is what I would expect.

@sholladay Totally, pm2 will wait for the ready event :)

Was this page helpful?
0 / 5 - 0 ratings