Pm2: Cluster mode/zero downtime graceful reloading can cause downtime

Created on 21 Feb 2017 · 4Comments · Source: Unitech/pm2

Expected behaviour

No downtime

Actual behaviour

≈2% downtime

Steps to reproduce

Clone the following Gist: https://gist.github.com/sebflipper/9480131095f43c7195519c677179221d
npm install
npm install -g artillery
pm2 --wait-ready --restart-delay 1000 start index.js -i max
while true; do pm2 --wait-ready --restart-delay 1000 gracefulReload index; sleep 1; done
artillery quick --duration 60 --rate 10 -n 20 http://localhost:3000

Around 2% of request sent by artillery fail with ECONNRESET or ETIMEDOUT, while 100% of request succeed while not graceful reloading.

Software versions used

OS         : Mac OS X 10.11
node.js    : v7.5.0
PM2        : 2.4.0

PM2 Log output

2017-02-21 14:47:57: pid=92140 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:0
2017-02-21 14:47:57: pid=92141 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:1
2017-02-21 14:47:57: App name:index id:1 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:2
2017-02-21 14:47:57: App name:index id:0 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:3
2017-02-21 14:47:57: App name:index id:2 disconnected
2017-02-21 14:47:57: App [index] with id [2] and pid [92144], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: App name:index id:3 disconnected
2017-02-21 14:47:57: App [index] with id [3] and pid [92145], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: pid=92144 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:2
2017-02-21 14:47:57: pid=92145 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:3
2017-02-21 14:47:57: App name:index id:2 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:4
2017-02-21 14:47:57: App name:index id:3 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:5
2017-02-21 14:47:57: App name:index id:4 disconnected
2017-02-21 14:47:57: App [index] with id [4] and pid [92148], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: App name:index id:5 disconnected
2017-02-21 14:47:57: App [index] with id [5] and pid [92149], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: pid=92148 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:4
2017-02-21 14:47:57: pid=92149 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:5
2017-02-21 14:47:57: App name:index id:4 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:6
2017-02-21 14:47:57: App name:index id:5 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: Stopping app:index id:7
2017-02-21 14:47:57: App name:index id:6 disconnected
2017-02-21 14:47:57: App [index] with id [6] and pid [92152], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: App name:index id:7 disconnected
2017-02-21 14:47:57: App [index] with id [7] and pid [92153], exited with code [0] via signal [SIGINT]
2017-02-21 14:47:57: pid=92152 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:6
2017-02-21 14:47:57: pid=92153 msg=process killed
2017-02-21 14:47:57: Starting execution sequence in -cluster mode- for app name:index id:7
2017-02-21 14:47:57: App name:index id:6 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.
2017-02-21 14:47:57: App name:index id:7 online
2017-02-21 14:47:57: (node:83735) DeprecationWarning: worker.suicide is deprecated. Please use worker.exitedAfterDisconnect.

cc @vmarchaud

Question

Source

sebflipper

All 4 comments

Something to start with is that you don't have the good exit strategy (see my code example in #2573), you need to listen for SIGINT to be able to ask to the http server to close the connections.

PS : Next time please make search before opening an issue please (it avoid spending time to re-answer to question that are already answered), thanks !

vmarchaud on 21 Feb 2017

Ah good spot! I've added a listen on SIGINT and rerun commands, but I'm still seeing around 2% downtime on the example app.

This seems to be much higher (around 36% downtime) on a larger/slower apps that makes API calls and are running over HTTPS etc.

sebflipper on 21 Feb 2017

You need to take care of PM2_LISTEN_TIMEOUT for slower apps, please read the whole #2573 issue.
You will read that ECONRESET are inherent of the TCP stack implementation and we can't control them with nodejs. The main metric for downtine are the read/write socket error that must be at 0 with reloading.

vmarchaud on 21 Feb 2017

👍1

@vmarchaud thanks for that! 👍 Setting PM2_GRACEFUL_TIMEOUT and PM2_KILL_TIMEOUT has brought down the chance of getting an error to 0.8%, which I think is like you say to do with the TCP stack.

sebflipper on 22 Feb 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings