Node-postgres: Handling connection timeouts / dropped network / db failover

Created on 11 Jul 2016 · 33Comments · Source: brianc/node-postgres

Hello,

I'm working on setting up a health check within our system, and have noticed that our Postgres implementation doesn't seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back -- https://github.com/mysqljs/mysql/issues/821.

We effectively have a class that looks similar to this:

const Promise = require('bluebird');
const pg = require('pg');

class PostgresSQLInterface extends Service {
    constructor(dbConnectionOptions, logger) {
        const defaults = {
            host: 'localhost',
            port: '5432',

            user: '',
            password: null,

            max: 10, // Poolsize
            min: 10,

            keepAlive: true,
            idleTimeoutMillis: 10000
        };

        this.state = {
            config: _.merge({}, defaults, dbConnectionOptions)
        };

        this.createDatabasePool();
    }

    createDatabasePool() {
        this.state.pool = new pg.Pool(this.state.config);

        this.state.pool.on('connect', () => {
            this.state.healthy = true;
            console.log('HEALTHY');
        });

        this.state.pool.on('error', () => {
            this.state.healthy = false;
            console.log('UNHEALTHY');
        });
    }

    *getDatabaseConnection() {
        try {
            const connection = yield this.state.pool.connect();

            return connection;
        }
        catch (error) {
            throw new Errors.DatabaseError("Could not connect to Postgres", {
                originalError: error
            });
        }
    }    
};

In a normal environment, when our application connects, I see it spit out "HEALTHY" x the number of connections it made, as expected. However, there are a few issues:

1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is "Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients." My expectation would be to see a string of "UNHEALTHY" statements.

2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.

3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.

Am I missing a setting in my configuration, or is this a situation that just hasn't come up yet for someone else? If it's the latter, I'd be more than happy to create a fix for the issue, but want to make sure I'm not overlooking something first.

Node version: 6.2.x
PG version: 6.0.2
Postgres version: 9.5.x

Thanks!

Source

jessedpate

Most helpful comment

I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an end event, but not an error event. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I've put a patch out at [email protected] - please check it out & lemme know if it fixes your issue!

brianc on 9 Jun 2017

👍6 ❤2 🎉2

All 33 comments

Hey - thanks for bringing this up! There are some tests around the pool emitting an error event on idle connections, but it looks like there are situations where the event isn't being emitted properly - that's unfortunate! I don't think you're missing anything in your config - I'm interested in any more research & fix you find about this issue. If there's anything I can do to help out on the fix or diagnose more or answer questions please let me know!

brianc on 11 Jul 2016

Thanks Brian! I'll dive into it a little today and let you know if I run into any questions.

jessedpate on 11 Jul 2016

Is this an issue with the new pool?

I did very good stress-testing for lost connectivity with versions 4.x - 5.1, and it would never fall over, as long as you provide pg.on('error', handler).

vitaly-t on 12 Jul 2016

After diving in some, this doesn't seem to be an issue with pool itself. The default pg.Client connection isn't catching the problem and emitting the error event -- which makes sense, because in at least one case the problem occurs when the server goes away without sending a RST packet. It's difficult to respond to an error that isn't sent to you.

The options to account for this are basically:
1.) Ping the server at regular intervals to make sure it's still there/reachable
2.) Implement a timeout when attempting to acquire a connection

jessedpate on 12 Jul 2016

@jessedpate when the server is gone, you are getting a connection error right away, so there is no need for a timeout in this case.

And when you are executing queries through the pool, you can just log the connection errors. Doing it separately can be an unnecessary overhead, since every query would report the connectivity issue.

vitaly-t on 12 Jul 2016

You would think that would be the case!

describe.only('events', function () {
  this.timeout(20000)
  it('emits error when a connection is dropped', function (done) {
    console.log('creating pool')
    var pool = new Pool({
      min: 1,
      host: 'some.remote.host',
      user: 'username',
      database: 'db_name',
      password: 'some_password'
    })
    console.log('pool created')
    pool.on('connect', function () {
      console.log('received pool connect emit')
    })
    pool.on('error', function () {
      done()
    })
  })
})

The above test times out if I turn off my network connection, or if I use iptables to drop return packets on 5432. Running on a local server, it did what I expected when I shut down the database (since it received a hang up from the server).

Now obviously, in a production environment, a full network outage is not likely to happen (and if it does, we've probably got bigger problems). However, the second scenario (dropping replies on 5432) very closely simulates what happens when an Amazon RDS instance is rebooted or fails over. When that happens queries (and therefore requests from users) end up hanging until the upstream server decides it has waited long enough.

jessedpate on 12 Jul 2016

👍1

I was trying the Pool and started my App in localhost with postgres off and couldn't get an error in this case.

my code is very simple

import { Pool } from 'pg';

var pool;
try {
  pool = new Pool();
}
catch(e) {
  console.error('Error creating pool', e);
}

pool.on('error', function(error, client) {
  console.error('got pool error', error);
});

pool.query('select NOW()')
.then(res => {
  console.log('Got db res', res.rows[0]);
})
.catch(err => {
  console.error('query error', e.message, e.stack);
});

psypersky on 13 Jul 2016

I'm bumping up against this too. If we're looking for a solution, I'd like to offer a suggestion to look at allowing the connection to have a "statement_timeout" configuration option:

https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT

ReinsBrain on 21 Jul 2016

I believe this issue was somehow introduced with the new Pool, because up to 5.1 there was no problem handling database disconnection, that's a fact. One could just provide a handler on the library itself: pg.on('error', handler), and it always worked.

vitaly-t on 21 Jul 2016

At least with Redshift, I do not get any error events from either Pool or Client instances.

jadbox on 21 Jul 2016

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected] wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP
.

brianc on 21 Jul 2016

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234341328,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP
.

jessedpate on 21 Jul 2016

awesoooooooome

On Thu, Jul 21, 2016 at 1:32 PM, Jesse Pate [email protected]
wrote:

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool
or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<

https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621

,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234341328
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234342841,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDodz2gQMpE6-CqH7S3692HpBhHNNZks5qX7s_gaJpZM4JJyxP
.

brianc on 21 Jul 2016

Im not so sure if the problem I have now is really related to this thread. We have nginx proxy sitting in front of our middle tier and the timeout is not really that long and shorter, I suspect, than the default client timeout on a connection to postgres. Is why I wanted to propose the ability to configure client connection time out... I'd like to be able to set the client timeout shorter than the nginx timeout to ensure that we're not sending back 500 to clients when a transaction just took to long to complete... is this worthy/welcome of starting another issue?

ReinsBrain on 22 Jul 2016

Ultimately, I think the solution to this issue will end up solving your use
case as well -- when you're dealing with (potentially) no response from a
server, the best way to handle it is by allowing for the setting of a
timeout.

On Fri, Jul 22, 2016 at 1:27 PM Rein Petersen [email protected]
wrote:

Im not so sure if the problem I have now is really related to this thread.
We have nginx proxy sitting in front of our middle tier and the timeout is
not really that long and shorter, I suspect, than the default client
timeout on a connection to postgres. Is why I wanted to propose the ability
to configure client connection time out... I'd like to be able to set the
client timeout shorter than the nginx timeout to ensure that we're not
sending back 500 to clients when a transaction just took to long to
complete... is this worthy/welcome of starting another issue?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234647592,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o0nwfFWlwmWSyIGX0hH6l2NOb4Uiks5qYSefgaJpZM4JJyxP
.

jessedpate on 22 Jul 2016

We do have the same issue.

AWS with Postgres in HA.
Client open connection pool to database.
When the database failover, connections are still established but all query fails.

Neocid on 15 Aug 2016

@jessedpate did you make any further progress on this issue?

allspiritseve on 4 Oct 2016

I would be surprised. As many times as I've tried, steps for reproducing the issue are super-unreliable.

This is why we've made no progress here. Sometimes it takes me 20 times to kill the connection physically to be able see the problem.

It seems that simply closing the port from PostgreSQL itself doesn't quite do the trick....

From PostgreSQL I'm referring to running this command:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'

You really have to kill the connection physically to be able to see the problem.

Something tells me it may have a bearing on Node.js freaking out about a dead port... but then again, the issue is related to version 5.2+ only, so probably not...

P.S. I'm not giving up, I will try to reproduce it again when I find time.

vitaly-t on 4 Oct 2016

Hey there,

I had the very same issue till I found why.
On the debian server, keepalive packet was set to default values, ie:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

I updated it for something more web-server-friendly:

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

So far so good, my issue was fixed. I hope it can help you folks ! (documentation)

xsellier on 8 Nov 2016

👎1 👍1

This issue is critical. Is there any established workarounds until the fix is ready?

diegoperini on 15 Nov 2016

👍2 👎1

@diegoperini If you update the sysctl configuration with

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

and if you use the db pool connection, everything must be ok.

xsellier on 15 Nov 2016

👎2

@xsellier That workaround is useful, but it affects a lot of other components on the same system. It's not really a substitute for the client library handling this case appropriately.

davepacheco on 15 Nov 2016

👍1

Hi!

I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time.

To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment.

Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue?

srijs on 23 Jan 2017

🎉3 👍2

That is the approach other libraries have implemented (mongo, mysql,
elastic search, et cetera). There really isn't another way around the issue
as best as I've ever been able to see.

On Sun, Jan 22, 2017 at 5:55 PM Sam Rijs notifications@github.com wrote:

Hi!

I have a way to reproduce the issue with RDS failover in an AWS setting
(using RDS and EC2). The core of the problem is that the network interface
of the primary RDS instance stops responding to packets as soon as the
failover is initiated, without sending RST packets. Therefore the
connection will be stuck in the ESTABLISHED state for a long time.

To work around it, I have patched this library to set an idle timeout on
the underlying socket. The downside of this is that it affects long-running
queries as well. It works for us since we don't rely on long-running
queries in our production environment.

Would the library maintainers be open to accept a PR exposing an option to
set the idle timeout on the socket to solve this issue?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-274379836,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o6bRRMnl-Vcx9-eDiY7F4GL5l4fcks5rVAiSgaJpZM4JJyxP
.

jessedpate on 23 Jan 2017

An alternative solution which we could realise with the native client is to set the keepalives_idle, keepalives_interval and keepalives_count settings to more aggressive values (or exposing them as an option) by passing them as query parameters to the connection string.

srijs on 23 Jan 2017

👍1

@srijs this all doesn't explain what exactly broke this, as connections worked perfectly up till version 5.1

vitaly-t on 11 Mar 2017

Adding keep-alive configuration & better network partition handling to the 7.0 milestone.

brianc on 24 May 2017

👍1

Using keep-alive to control the connection isn't a perfect solution to this problem, which will affect other application on the same os environment.
I will vote for client-connection-timeout and I've implemented and patch it on our production which seems solve our problem of HA.
I am very willing to send the pr but I need some time to familiar with the rule of contribution here.

okischuang on 6 Jun 2017

🎉1

@okischuang send the patch on over! I'd love to take a look at it.

also - what do you mean keep alive would affect other applications on the same os environment?

brianc on 6 Jun 2017

Re-posting from another issue here, as it is relevant here...

From my recent conversation with a developer who was testing restoring connections through the driver...

Simply by disrupting TCP packages on the network he managed to kill the process where node-postgres 6.2.3 was used, while this didn't happen with v5.1.

So the issue with restoring broken connections is 100% there, just needs to be found.

vitaly-t on 6 Jun 2017

👍2

brianc on 9 Jun 2017

👍6 ❤2 🎉2

Closing for now as I think this has been taken care of! :dancer:

brianc on 14 Jun 2017

🎉1

Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I'm simulating by issuing a docker pause on a db instance and then updating a DNS record to point to another) I would think this is actually not solved. I believe (I'm currently attempting to get a confirmation from AWS) what @jessedpate claims here is right, which would mean we actually need a way to hook in a timeout at the socket level.

Our current workaround amounts to:

        stream.setTimeout(20000);
        stream.on('timeout', () => {
          stream.end();
          connection.emit('end');
        });

added to lib/connection.js.

This is actually possible with the current implementation, since the library allows passing an already existing socket object in the Client configuration, but some libraries (knex) don't actually have hook-points available to modify the creation of the connection (at least from what I could see), and had us jumping around some loops to achieve the same behaviour (afterCreate on tarn's config).

All in all, I think it'd be worthwhile to actually allow for configuration of the property, since as mentioned above, it's available on other drivers.

je-al on 4 Oct 2019

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

self signed certificate error

eric-elsholz · 51Comments

Anything else to add?

brianc · 27Comments

pg client fails to connect (silently) using Node 14

martinbliss · 26Comments

connection strings not working in 6.x

vitaly-t · 29Comments

Doesn't work under Node.js 6.0.0

vitaly-t · 31Comments