Hello,
I'm working on setting up a health check within our system, and have noticed that our Postgres implementation doesn't seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back -- https://github.com/mysqljs/mysql/issues/821.
We effectively have a class that looks similar to this:
const Promise = require('bluebird');
const pg = require('pg');
class PostgresSQLInterface extends Service {
constructor(dbConnectionOptions, logger) {
const defaults = {
host: 'localhost',
port: '5432',
user: '',
password: null,
max: 10, // Poolsize
min: 10,
keepAlive: true,
idleTimeoutMillis: 10000
};
this.state = {
config: _.merge({}, defaults, dbConnectionOptions)
};
this.createDatabasePool();
}
createDatabasePool() {
this.state.pool = new pg.Pool(this.state.config);
this.state.pool.on('connect', () => {
this.state.healthy = true;
console.log('HEALTHY');
});
this.state.pool.on('error', () => {
this.state.healthy = false;
console.log('UNHEALTHY');
});
}
*getDatabaseConnection() {
try {
const connection = yield this.state.pool.connect();
return connection;
}
catch (error) {
throw new Errors.DatabaseError("Could not connect to Postgres", {
originalError: error
});
}
}
};
In a normal environment, when our application connects, I see it spit out "HEALTHY" x the number of connections it made, as expected. However, there are a few issues:
1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is "Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients." My expectation would be to see a string of "UNHEALTHY" statements.
2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.
3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.
Am I missing a setting in my configuration, or is this a situation that just hasn't come up yet for someone else? If it's the latter, I'd be more than happy to create a fix for the issue, but want to make sure I'm not overlooking something first.
Node version: 6.2.x
PG version: 6.0.2
Postgres version: 9.5.x
Thanks!
Hey - thanks for bringing this up! There are some tests around the pool emitting an error event on idle connections, but it looks like there are situations where the event isn't being emitted properly - that's unfortunate! I don't think you're missing anything in your config - I'm interested in any more research & fix you find about this issue. If there's anything I can do to help out on the fix or diagnose more or answer questions please let me know!
Thanks Brian! I'll dive into it a little today and let you know if I run into any questions.
Is this an issue with the new pool?
I did very good stress-testing for lost connectivity with versions 4.x - 5.1, and it would never fall over, as long as you provide pg.on('error', handler).
After diving in some, this doesn't seem to be an issue with pool itself. The default pg.Client connection isn't catching the problem and emitting the error event -- which makes sense, because in at least one case the problem occurs when the server goes away without sending a RST packet. It's difficult to respond to an error that isn't sent to you.
The options to account for this are basically:
1.) Ping the server at regular intervals to make sure it's still there/reachable
2.) Implement a timeout when attempting to acquire a connection
@jessedpate when the server is gone, you are getting a connection error right away, so there is no need for a timeout in this case.
And when you are executing queries through the pool, you can just log the connection errors. Doing it separately can be an unnecessary overhead, since every query would report the connectivity issue.
You would think that would be the case!
describe.only('events', function () {
this.timeout(20000)
it('emits error when a connection is dropped', function (done) {
console.log('creating pool')
var pool = new Pool({
min: 1,
host: 'some.remote.host',
user: 'username',
database: 'db_name',
password: 'some_password'
})
console.log('pool created')
pool.on('connect', function () {
console.log('received pool connect emit')
})
pool.on('error', function () {
done()
})
})
})
The above test times out if I turn off my network connection, or if I use iptables to drop return packets on 5432. Running on a local server, it did what I expected when I shut down the database (since it received a hang up from the server).
Now obviously, in a production environment, a full network outage is not likely to happen (and if it does, we've probably got bigger problems). However, the second scenario (dropping replies on 5432) very closely simulates what happens when an Amazon RDS instance is rebooted or fails over. When that happens queries (and therefore requests from users) end up hanging until the upstream server decides it has waited long enough.
I was trying the Pool and started my App in localhost with postgres off and couldn't get an error in this case.
my code is very simple
import { Pool } from 'pg';
var pool;
try {
pool = new Pool();
}
catch(e) {
console.error('Error creating pool', e);
}
pool.on('error', function(error, client) {
console.error('got pool error', error);
});
pool.query('select NOW()')
.then(res => {
console.log('Got db res', res.rows[0]);
})
.catch(err => {
console.error('query error', e.message, e.stack);
});
I'm bumping up against this too. If we're looking for a solution, I'd like to offer a suggestion to look at allowing the connection to have a "statement_timeout" configuration option:
https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT
I believe this issue was somehow introduced with the new Pool, because up to 5.1 there was no problem handling database disconnection, that's a fact. One could just provide a handler on the library itself: pg.on('error', handler), and it always worked.
At least with Redshift, I do not get any error events from either Pool or Client instances.
I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.
On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected] wrote:
At least with Redshift, I do not get any error events from either Pool or
Client instances.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP
.
I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.
On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:
I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:At least with Redshift, I do not get any error events from either Pool or
Client instances.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234341328,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP
.
awesoooooooome
On Thu, Jul 21, 2016 at 1:32 PM, Jesse Pate [email protected]
wrote:
I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:
I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:At least with Redshift, I do not get any error events from either Pool
or
Client instances.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/brianc/node-postgres/issues/1075#issuecomment-234340621
,
or mute the thread
<.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234341328
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234342841,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDodz2gQMpE6-CqH7S3692HpBhHNNZks5qX7s_gaJpZM4JJyxP
.
Im not so sure if the problem I have now is really related to this thread. We have nginx proxy sitting in front of our middle tier and the timeout is not really that long and shorter, I suspect, than the default client timeout on a connection to postgres. Is why I wanted to propose the ability to configure client connection time out... I'd like to be able to set the client timeout shorter than the nginx timeout to ensure that we're not sending back 500 to clients when a transaction just took to long to complete... is this worthy/welcome of starting another issue?
Ultimately, I think the solution to this issue will end up solving your use
case as well -- when you're dealing with (potentially) no response from a
server, the best way to handle it is by allowing for the setting of a
timeout.
On Fri, Jul 22, 2016 at 1:27 PM Rein Petersen [email protected]
wrote:
Im not so sure if the problem I have now is really related to this thread.
We have nginx proxy sitting in front of our middle tier and the timeout is
not really that long and shorter, I suspect, than the default client
timeout on a connection to postgres. Is why I wanted to propose the ability
to configure client connection time out... I'd like to be able to set the
client timeout shorter than the nginx timeout to ensure that we're not
sending back 500 to clients when a transaction just took to long to
complete... is this worthy/welcome of starting another issue?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-234647592,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o0nwfFWlwmWSyIGX0hH6l2NOb4Uiks5qYSefgaJpZM4JJyxP
.
We do have the same issue.
AWS with Postgres in HA.
Client open connection pool to database.
When the database failover, connections are still established but all query fails.
@jessedpate did you make any further progress on this issue?
I would be surprised. As many times as I've tried, steps for reproducing the issue are super-unreliable.
This is why we've made no progress here. Sometimes it takes me 20 times to kill the connection physically to be able see the problem.
It seems that simply closing the port from PostgreSQL itself doesn't quite do the trick....
From PostgreSQL I'm referring to running this command:
SELECT
pg_terminate_backend(pid)
FROM
pg_stat_activity
WHERE
-- don't kill my own connection!
pid <> pg_backend_pid()
-- don't kill the connections to other databases
AND datname = 'database_name'
You really have to kill the connection physically to be able to see the problem.
Something tells me it may have a bearing on Node.js freaking out about a dead port... but then again, the issue is related to version 5.2+ only, so probably not...
P.S. I'm not giving up, I will try to reproduce it again when I find time.
Hey there,
I had the very same issue till I found why.
On the debian server, keepalive packet was set to default values, ie:
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
I updated it for something more web-server-friendly:
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60
So far so good, my issue was fixed. I hope it can help you folks ! (documentation)
This issue is critical. Is there any established workarounds until the fix is ready?
@diegoperini If you update the sysctl configuration with
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60
and if you use the db pool connection, everything must be ok.
@xsellier That workaround is useful, but it affects a lot of other components on the same system. It's not really a substitute for the client library handling this case appropriately.
Hi!
I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time.
To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment.
Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue?
That is the approach other libraries have implemented (mongo, mysql,
elastic search, et cetera). There really isn't another way around the issue
as best as I've ever been able to see.
On Sun, Jan 22, 2017 at 5:55 PM Sam Rijs notifications@github.com wrote:
Hi!
I have a way to reproduce the issue with RDS failover in an AWS setting
(using RDS and EC2). The core of the problem is that the network interface
of the primary RDS instance stops responding to packets as soon as the
failover is initiated, without sending RST packets. Therefore the
connection will be stuck in the ESTABLISHED state for a long time.To work around it, I have patched this library to set an idle timeout on
the underlying socket. The downside of this is that it affects long-running
queries as well. It works for us since we don't rely on long-running
queries in our production environment.Would the library maintainers be open to accept a PR exposing an option to
set the idle timeout on the socket to solve this issue?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/brianc/node-postgres/issues/1075#issuecomment-274379836,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o6bRRMnl-Vcx9-eDiY7F4GL5l4fcks5rVAiSgaJpZM4JJyxP
.
An alternative solution which we could realise with the native client is to set the keepalives_idle, keepalives_interval and keepalives_count settings to more aggressive values (or exposing them as an option) by passing them as query parameters to the connection string.
@srijs this all doesn't explain what exactly broke this, as connections worked perfectly up till version 5.1
Adding keep-alive configuration & better network partition handling to the 7.0 milestone.
Using keep-alive to control the connection isn't a perfect solution to this problem, which will affect other application on the same os environment.
I will vote for client-connection-timeout and I've implemented and patch it on our production which seems solve our problem of HA.
I am very willing to send the pr but I need some time to familiar with the rule of contribution here.
@okischuang send the patch on over! I'd love to take a look at it.
also - what do you mean keep alive would affect other applications on the same os environment?
Re-posting from another issue here, as it is relevant here...
From my recent conversation with a developer who was testing restoring connections through the driver...
Simply by disrupting TCP packages on the network he managed to kill the process where node-postgres 6.2.3 was used, while this didn't happen with v5.1.
So the issue with restoring broken connections is 100% there, just needs to be found.
I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an end event, but not an error event. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I've put a patch out at [email protected] - please check it out & lemme know if it fixes your issue!
Closing for now as I think this has been taken care of! :dancer:
Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I'm simulating by issuing a docker pause on a db instance and then updating a DNS record to point to another) I would think this is actually not solved. I believe (I'm currently attempting to get a confirmation from AWS) what @jessedpate claims here is right, which would mean we actually need a way to hook in a timeout at the socket level.
Our current workaround amounts to:
stream.setTimeout(20000);
stream.on('timeout', () => {
stream.end();
connection.emit('end');
});
added to lib/connection.js.
This is actually possible with the current implementation, since the library allows passing an already existing socket object in the Client configuration, but some libraries (knex) don't actually have hook-points available to modify the creation of the connection (at least from what I could see), and had us jumping around some loops to achieve the same behaviour (afterCreate on tarn's config).
All in all, I think it'd be worthwhile to actually allow for configuration of the property, since as mentioned above, it's available on other drivers.
Most helpful comment
I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an
endevent, but not anerrorevent. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I've put a patch out at[email protected]- please check it out & lemme know if it fixes your issue!