Node-postgres: TypeError: Cannot read property 'name' of null

Created on 9 Aug 2016  Β·  64Comments  Β·  Source: brianc/node-postgres

First off, thanks for the library! Overall it's been incredibly performant and straightforward to use. I am running into one issue which may be an issue with the library, though it's a bit hard to follow the protocol for me so my usage may be off as well.

I'm running into an uncaught error during the parseComplete handler in client.js (https://github.com/brianc/node-postgres/blob/v6.0.1/lib/client.js#L131).

I am doing a direct connect against the pool and then utilizing pg-query-stream

pool.connect().then(function(client){
  stream = client.query(new QueryStream(query, null));

  stream.on('error', function(streamError){
    stream.close(function(closeError){ 
      // What to do with a close error? At this point stream is already hosed :shrug:
      client.release(streamError);
    }
  });

  stream.on('close', function(){
    client.release();
  }); 
});

At some point during my app's life cycle, I get an uncaught error

TypeError: Cannot read property 'name' of null
    at null.<anonymous> (/var/app/current/node_modules/pg/lib/client.js:139:26)
    at emit (events.js:107:17)
    at Socket.<anonymous> (/var/app/current/node_modules/pg/lib/connection.js:121:12)
    at Socket.emit (events.js:107:17)
    at readableAddChunk (_stream_readable.js:163:16)
    at Socket.Readable.push (_stream_readable.js:126:10)
    at TCP.onread (net.js:540:20)

So obviously activeQuery is becoming null, and I am trying to narrow down why I would receive a parseComplete event when the active query is null.

According to my logs, the stream is _not_ emitting an error, and seems to be closing normally. Therefore it seems like the connection client is getting a readyForQuery or end event, and then right after that getting the parseComplete event. Any idea why this would be happening, or see any issues with my usage? Thanks for any help you can give! I'll keep looking into it as well.

Most helpful comment

Started getting this error after using the query_timeout config parameter, pg 7.10.0

Repro steps for me, using this config:
{ query_timeout: 10 * 1000, max: 32, min: 8 }

Steps: lock a table in the database and then issue queries against that table, so that they all time out; doesn't have to be very fast, 1-2 per second will do. I get the crash very quickly, somewhere after around 10 to 100 query timeouts.

Other timing information, if it's relevant: the database and node-postgres servers are on separate machines but the ping rtt is very low, 0.5 - 1ms.

All 64 comments

I wonder, if the same error would occur within pg-promise, if you follow the example...

At least this would help ruling out some reasons of why this is happening...

@greghart interesting error - any way you could include the query & some data so I could try to reproduce it here? Maybe using generate_series or something if you need lots of data?

I'm having a similar issue. For me it seems to be related to parameterized queries.

.query('SELECT $1', ['a'])

Do you have any further findings @greghart

I may have found something promising. I tried to use a new client object with the original in the prototype chain. I did this because I didn't want client.release() to be called from child queries within a transaction.

I did something like this: (using lodash _.create)

const new_client = _.create(client, {
   release: function () { /* noop */ }
})

new_client.query('SELECT 1')

I did this because I didn't want client.release() to be called from child queries within a transaction.

Call client.release() in a predictable manner, then? Bluebird’s resource management might help.

I've been having the same issue but only on my live site, once every few hours.
My logs would always show that I received 2 requests to run the same query within about 5 ms - always 1 after another (no other queries in between).
I don't know if the parameters are the same because I don't log them.

One would then get executed and the second would crash.
It would usually happen again within a minute if I attempted to restart that application without restarting all the others that communicated with it. Basically, some combination of requests caused the issue to appear.

I haven't been able to replicate this offline.
To me at this point, it looks like a race condition and maybe some kind of reference error (destroying the activeQuery on the wrong client).

I switched to pg-native and the problem appears to have stopped (over 12 hours with no issues).

Having this issue. I had switched away from pg-native cause it was giving headache as well.

Only happens on first-start. The second time pg is used, the error goes away. The "race" condition is that it happens always at first.

So part of our production push right now is to hit the servers ourselves first before we let users use them so they don't get this error. Which is a hack of a fix.

Anyone was able to fix this?

Just had this issue occur with connect-pg-simple configured pretty much as described in the advanced example of https://github.com/voxpelli/node-connect-pg-simple#usage.

Do any of y'all have a small script I can use that will likely reproduce this? Also are you using pg or pg-native? This is really perplexing but really crappy & I want to get it fixed if possible. I've used node-postgres for years on many systems in production and never seen this...so I'm not sure what exactly it could be.

@nemo gah that fix is gross - sorry you have to do that. You're using normal pg and what exact version of node are you using?

My application (very rarely) has experienced this problem as well. We have unfortunately been unable to distill it down to a simple script/app that can reliably reproduce, but general observation _so far_ seems to indicate this happens when there is a connection issue with the database.

@brianc : I can completely understand the need to be able to reproduce in order to fix - but what are your thoughts on a short term fix to at least check to make sure activeQuery is a valid object before trying to reference the name property? Given the inability for client applications to handle this error in any way... and its drastic consequences - would be great if some form of incremental improvement could be delivered...

https://github.com/brianc/node-postgres/blob/master/lib/client.js#L241 indicates access to the name attribute allows to avoid the client from re-parsing the query. I would much rather take _extraneous reparsing_ when the alternative is _uncaught exception_ πŸ˜‡

@andyatmiami - what version of pg & pg-pool are you using? Are you using pg-cursor or pg-query-stream?

"pg": "6.1.2"
"pg-pool": "1.6.0"
"node": "6.11.1"

We use parameterized queries exclusively.

FWIW:
I have a dev I am working with this afternoon who has had minimal success reproducing this - not yet reliable and not yet distilled into a simple app to run. At a high level (which I am working on refining) - we have 1 wildly expensive query (takes seconds to run), and then hammer a high load of _other queries_ which are not long-running in parallel. Sometimes (and this is with Postgres completely stable) we see the activeQuery error fire...

Yah that's wretched that its happening. I really hope you can figure out a reproducible test case. I'd also suggest trying with the most recent version of pg & pg-pool. They both received pretty substantial internals upgrades from 6.x -> 7.x & 1.x -> 2.x respectively.

@brianc : Thanks for responsiveness in issue - will post back any updates/learnings as they come available!

I get the same error and also sometimes a different stack which I am guessing points to the same root cause

    self.activeQuery.handleCommandComplete(msg, con)
                    ^

TypeError: Cannot read property 'handleCommandComplete' of null
    at Connection.<anonymous> (/home/app/node_modules/pg/lib/client.js:239:21)
    at emitOne (events.js:96:13)
    at Connection.emit (events.js:188:7)
    at Socket.<anonymous> (/home/app/node_modules/pg/lib/connection.js:118:12)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)
    at TCP.onread (net.js:547:20)
events.js:160
      throw er; // Unhandled 'error' event

I see there was pull request from @spollack (https://github.com/brianc/node-postgres/pull/961) around this . @spollack, @jshepard it was mentioned in the PR that there was an another library causing this, do you mind sharing your findings ?

@vedant15 I think our issue was caused by a connection leak in pre 2.0 versions of node-any-db-pool.
We ended up back porting the fix from 2.0 in a fork here

@vedant15 sorry i wish i had written down better notes on this. I can tell you however that we have been running PR #961 in prod over the past year, and as far as i recall, have had no recurrences of this crash.

@spollack @jshepard thanks for the quick response.

Were you guys also seeing this issue only during high loads and with statement timeouts ?

@vedant15 pretty sure we only saw it during statement timeouts.

ok @jshepard , I am able to reproduce this issue on local (not always), but only when I put a considerable load on the application running queries with statement timeouts .

Thanks for all the info !

If statement timeouts are causing it I might be able to simulate them locally

@brianc I added some log statements on my local and whenever I am able to reproduce any of the 2 flavors of this issue, the log statements show the following:
1) activeQuery for a given client is set to null in Client.prototype._pulseQueryQueue
2) Right after this , either commandComplete or parseComplete is fired for the same client

quick question , from the comments in the code, it seems parseComplete is fired only for prepared statements , is this assumption correct ?

Also, what do you think about checking if activeQuery is null in commandComplete and parseComplete , do you see any issues with that ; I know you had some reservations about this based on the comment in #961

quick question , from the comments in the code, it seems parseComplete is fired for prepared statements , is this assumption correct ?

Yah - any query w/ parameters has a parse phase which the backend returns a parseComplete on.

Also, what do you think about checking if activeQuery is null in commandComplete and parseComplete , do you see any issues with that ; I know you had some reservations about this based on the comment in #961

I feel like that's similar to doing a try { something() } catch(e) { /* no big deal */ } and papers over what is a larger issue. I'd __strongly__ prefer to have a reproducible test case & fix the underlying issue. Otherwise we're hiding an error and as far as the protocol operation is concerned possibly putting the connection into a bad/unknown state.

If you can send over a script that smashes a local database w/ lots of load & uses statement timeouts that might be all I need. I've gotten pretty decent at triangulating issues in the connection after all these years. :stuck_out_tongue_closed_eyes:

something like:

const { Pool } = require('pg')
const async = require('async')

async.times(100000000, function(i, next) {
  pool.query('pg_sleep(100)')
  pool.query('SELECT $1::text as name', ['foo'], next)
})

or something? I'm missing where to define the statement timeout and if that reproduces the issue in your environment though :interrobang:

@brianc let me see if I can come up with a script which can reliably reproduce this (reliably being the keyword πŸ˜› ) . I will keep you posted

sweet - even if i need to run the script for a long time or over and over to reproduce the issue I should be able to work with it and narrow down where things are going amiss

So .. my initial attempts to recreate it via a script has failed. I will try a couple of ideas to see if I can make it work .. 😞

I believe the code I posted above(6 months ago) reproduced the error. Working backward from that may find the root cause.

@joeandaverde I appreciate what you put above, but it's not quite enough for me to repro easily over here. can you post up a more complete/executable example? neither of your examples above are executable "out of the box" and I'm not sure I follow what "working backwards" entails.

@brianc Since I am not able to reproduce this using a script, and my production app is crashing because of this. I came up with a band-aid approach to stop production from crashing while more investigation is done into the root cause.

In my fork, I have a change (similar changes for all other places where this issue is happening)

  con.on('parseComplete', function (msg) {
    if (!self.activeQuery) {
      self.emit('invalidActiveQuery');
    } else if (self.activeQuery.name) {
      con.parsedStatements[self.activeQuery.name] = true
    }
  })

Here , I am emitting a invalidActiveQuery event. In my application, I listen for this event and if fired, I do the following:
1) Drain the pool by calling end
2) Re-create the pool

Do you see any potential issues with this approach ?
Again I am not suggesting that this is the solution to this problem , it's just a work-around.

This reproduces the error.

```javascript
const { Pool } = require('pg')

const pool = new Pool({
host: 'localhost',
user: 'postgres',
database: 'phoenix_dev',
})

pool.connect()
.then(client => {
const new_client = Object.create(client, {
release: function () { /* noop */ }
})

// const new_client = client // not manipulating the object works, this may be a hint as to where the problem is

new_client.query({
name: 'query-name',
text: 'SELECT $1::int',
values: [ 1 ],
})
.then((data) => {
console.log(data)
})
})

````

@joeandaverde Thanks for your example! I looked at it. For your case specifically: don't do that. When you connect a client it wires itself up internally to a connection object for parsing and sets a whole bunch of event listeners. When you clone the object the scope of the event listeners is still bound to the old instance. Using Object.create on an active client is not likely something that will ever work. That doesn't mean the underlying issue is fixed, just that your example code probably isn't what's happening in the other cases the error is cropping up, but I'll keep it in mind when I try some other ways to reproduce this!

Here's the environment I'm using to try to reproduce:

11:34:42 ~/src/pg-active-query-crash
$ node --version
v6.11.1

11:34:47 ~/src/pg-active-query-crash
$ npm ls
[email protected] /home/bmc/src/pg-active-query-crash
β”œβ”€β”¬ [email protected]
β”‚ └── [email protected]
β”œβ”€β”¬ [email protected]
β”‚ β”œβ”€β”€ [email protected]
β”‚ β”œβ”€β”€ [email protected]
β”‚ β”œβ”€β”€ [email protected]
β”‚ β”œβ”€β”¬ [email protected]
β”‚ β”‚ └── [email protected]
β”‚ β”œβ”€β”¬ [email protected]
β”‚ β”‚ β”œβ”€β”€ [email protected]
β”‚ β”‚ β”œβ”€β”€ [email protected]
β”‚ β”‚ β”œβ”€β”€ [email protected]
β”‚ β”‚ └─┬ [email protected]
β”‚ β”‚   └── [email protected]
β”‚ β”œβ”€β”¬ [email protected]
β”‚ β”‚ └─┬ [email protected]
β”‚ β”‚   └── [email protected]
β”‚ └── [email protected]
β”œβ”€β”¬ [email protected]
β”‚ β”œβ”€β”€ [email protected]
β”‚ └── [email protected]
└─┬ [email protected]
  └── [email protected]

and here's the script I'm running:

const cluster = require('cluster')
const async = require('async')
const { Pool } = require('pg')
const QueryStream = require('pg-query-stream')

if (cluster.isMaster) {
  cluster.setupMaster({ silent: false })

  for (var i = 0; i < 8; i++) {
    cluster.fork()
  }
  return
}

const pool = new Pool()

function runQuey(i, cb) {
  pool.query('SELECT $1::text as name', ['brianc'], (err) => {
    if (err) {
      throw err
    }
    console.log('query', i)
    cb()
  })
}

function runStream(i, cb) {
  pool.connect((err, client) => {
    if (err) {
      throw err
    }

    const stream = client.query(new QueryStream('SELECT $1::text as name', ['brianc']))
    stream.on('data', () => { })

    stream.on('error', (streamError) => {
      stream.close(function (closeError) {
        // What to do with a close error? At this point stream is already hosed :shrug:
        client.release(streamError);
      })
    });

    stream.on('close', function () {
      console.log('stream', i)
      cb()
      client.release();
    });
  })
}

function runSlow(i, cb) {
  pool.query('SELECT pg_sleep(2), $1::text as name', ['brianc'], (err) => {
    if (err) {
      throw err
    }
    console.log('slow', i)
    cb()
  })
}

function loop(i, cb) {
  if (!(i % 80)) {
    return runSlow(i, cb)
  }
  if (!(i % 10)) {
    return runStream(i, cb)
  }
  return runQuey(i, cb)
}

function end(err) {
  if (err) throw err
  pool.end()
}

async.times(1000000, loop, end)

So far this hasn't reproduced the issue...if anyone has any ideas I'm all ears! Really want to get this sorted out. :pray:

Yesterday we began receiving this error ~2 times/hour. For years, we had no issues.

Error: Cannot read property 'handleDataRow' of null (activeQuery is null)
Stack: /app/node_modules/pg/lib/client.js:117
pg version: 6.1.5
node version: 8

So we upgraded to pg 7.4.1 but it didn't help. Our code looks like:

const connectionString = "postgres://whatever:[email protected]:10476/whatever?ssl=true";
const pool = new pg.Pool({connectionString:connectionString});

var query = function (sql, callback) {
  pool.connect(function (err, client, done) {
    if (err) {
      if (typeof done === "function") { done(); }
      return callback(err);
    }

    client.query(sql, function (err, result){
      done();
      callback(err, result);
    });
  });
};

query("select * from foo", function(err, results){ ... })

We added logging and found that connectedErrorHandler sets activeQuery = null because of SSL3_GET_RECORD:decryption failed or bad record:

07:14:00: Client.constructor
07:14:00: Client.connect

07:14:01: Client.query: SELECT id, form_id, whatever, latest FROM response WHERE customer_id = 'whatever' AND form_id in ('1234', '5678') AND deleted = false AND cast(latest->>'submitted_on' as bigint) >= 1520488800000 ORDER BY cast(latest->>'submitted_on' as bigint) DESC

07:14:02: Client.connectedErrorHandler: Error: 140314433615680:error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac:../deps/openssl/openssl/ssl/s3_pkt.c:535:

07:14:03: Con.on.dataRow: Cannot read property 'handleDataRow' of null (see it).

The msg parameter of handleDataRow looks corrupt. See it here. The beginning of the field is fine but then it turns into valuesοΏ½V\"οΏ½GMοΏ½5\u0000\u0000\u0000\u0000....

In our case, these are never parameterized queries and seem to be running in under a few seconds (e.g. not timeouts).

Summary

Seems like the root cause is SSL3_GET_RECORD:decryption failed or bad record.

But also, shouldn't the error trigger a callback instead of allowing handleDataRow to run and hitting a null exception? Totally guessing here.

I'm happy to gather more logs to help debug.

We reached out to our data host (Compose) and they replied:

It looks like this has been an issue with the OpenSSL implementation for some time.

I would recommend reaching out to the client driver developers, or checking into this issue I found for a workaround:
https://github.com/requests/requests/issues/1906

But also, shouldn't the error trigger a callback instead of allowing handleDataRow to run and hitting a null exception?

@bendytree Yes, and I don’t see any reason it wouldn’t, looking at the code. Does ... check for an error and throw it right away? Apart from that, pg can’t deal with corrupt messages usefully; your host is going to have to fix its bug.

Also, that query function is equivalent to pool.query.

@charmander Con.on.dataRow throws a null reference exception because connectedErrorHandler set activeQuery to null. Not sure if we receive a callback but we're not throwing any errors ourselves. The issue is that dataRow is allowed to run after an error has happened. Seems like either it shouldn't run or there should be a null check on activeQuery.

Update

We tried disabling SSL but now we get errors like SyntaxError: Unexpected token m in JSON in Result.addFields

Strangely, the JSON is like [{"foo":"bar"m]. If we re-run the same query, it works fine.

It's almost like a bit is getting switched. The "m" in our example should have been a "}" - which happen to be different by only one bit!

m is 1101101
} is 1111101

If a bit was switched on an SSL response, it would likely generate a gibberish tail and trigger a decryption error.

So what in the world could be switching a bit between Compose and node-postgres? πŸ˜†

Ends up, root cause was a bad portal on Compose. We created a new portal and everything is fine now.

Regarding the error, node-postgres called back with an error like SSL3_GET_RECORD:decryption failed or bad record. But moments later, a null reference exception would get thrown inside node-postgres because con.dataRow got called and connectedErrorHandler had set activeQuery to null.

Basically it's like the query continued to be processed despite the error. That caused an exception which made it impossible for us to recover gracefully.

@bendytree #1503 might fix that.

I'm using 7.5.0 with pg-promise 8.5.0 and when I run my tests with parallelism of 4, I get the following error almost every time:

TypeError: Cannot read property 'handleCommandComplete' of null
    at Connection.<anonymous> (/usr/src/app/lib/servercore/node_modules/pg/lib/client.js:232:22)
    at emitOne (events.js:116:13)  
    at Connection.emit (events.js:211:7)
    at Socket.<anonymous> (/usr/src/app/lib/servercore/node_modules/pg/lib/connection.js:123:12)
    at emitOne (events.js:116:13)  
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
    at readableAddChunk (_stream_readable.js:250:11)
    at Socket.Readable.push (_stream_readable.js:208:10)
    at TCP.onread (net.js:597:20) 

I've been able to reduce the tests run to what I believe is a minimum and the last 4 calls that happen before the error are:
1) DELETE FROM a table
2) SELECT FROM table that doesn't exist (raises an error)
3) DELETE FROM other table
4) DELETE FROM another table

What else can I do to help resolve this issue?

I was able to reproduce the issue with parallelism of 3 on my linux machine (usually run tests on a mac) and the same SELECT FROM table that doesn't exist happened just before this error so I believe that that's the cause of this issue.

@daveisfera Assuming that’s specific to pg 7.5.0, could you open a new bug?

@daveisfera Also, are the parallel tests all using the same client?

It happened with 7.4.3 as well and yes, the parallelism is with 4 requests being made from an external test suite against this single client servicing those requests.

Is there anything I can do to help get this issue resolved? We had to turn off parallel execution of our tests because this issue causes intermittent failures.

@daveisfera If you could create a failing test against current master, that would be super helpful.

I have same problem:
Cannot read property 'handleCommandComplete' of null

pg node: 7.6.1

@pankleks when?

@charmander hard to tell, but probably related to timeout from db.

After having stumbled into this in our production code, I tried to generate a little script to reproduce. I put the script in https://github.com/timotm/pg-bug . It seems to have something to do with timing as CPU load affects the reproducibility, and on a slower machine it reproduces more often than on a faster machine. (adjusting the number of yes commands in the background might thus have an effect)

Running it three times in row in my machine, I got three different errors:

βœ” 13:34 ~/pg-bug [master|✚ 1…1294] $ npm start

> [email protected] start /Users/metsala/pg-bug
> createdb pg-bug-local-test-db 2>/dev/null; yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & node index.js; killall yes

Initialized, now looping..

/Users/metsala/pg-bug/node_modules/bluebird/js/release/using.js:12
        setTimeout(function(){throw e;}, 0);
                              ^
error: canceling statement due to statement timeout
    at Connection.parseE (/Users/metsala/pg-bug/node_modules/pg/lib/connection.js:555:11)
    at Connection.parseMessage (/Users/metsala/pg-bug/node_modules/pg/lib/connection.js:380:19)
    at Socket.<anonymous> (/Users/metsala/pg-bug/node_modules/pg/lib/connection.js:120:22)
    at emitOne (events.js:116:13)
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
    at readableAddChunk (_stream_readable.js:250:11)
    at Socket.Readable.push (_stream_readable.js:208:10)
    at TCP.onread (net.js:597:20)
sh: line 1: 47917 Terminated: 15          yes > /dev/null
sh: line 1: 47918 Terminated: 15          yes > /dev/null
sh: line 1: 47919 Terminated: 15          yes > /dev/null
sh: line 1: 47920 Terminated: 15          yes > /dev/null
sh: line 1: 47921 Terminated: 15          yes > /dev/null
βœ” 13:34 ~/pg-bug [master|✚ 1…1294] $ npm start

> [email protected] start /Users/metsala/pg-bug
> createdb pg-bug-local-test-db 2>/dev/null; yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & node index.js; killall yes

Initialized, now looping..
/Users/metsala/pg-bug/node_modules/pg/lib/client.js:278
    if (self.activeQuery.name) {
                         ^

TypeError: Cannot read property 'name' of null
    at Connection.<anonymous> (/Users/metsala/pg-bug/node_modules/pg/lib/client.js:278:26)
    at emitOne (events.js:116:13)
    at Connection.emit (events.js:211:7)
    at Socket.<anonymous> (/Users/metsala/pg-bug/node_modules/pg/lib/connection.js:125:12)
    at emitOne (events.js:116:13)
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
    at readableAddChunk (_stream_readable.js:250:11)
    at Socket.Readable.push (_stream_readable.js:208:10)
    at TCP.onread (net.js:597:20)
sh: line 1: 47979 Terminated: 15          yes > /dev/null
sh: line 1: 47980 Terminated: 15          yes > /dev/null
sh: line 1: 47981 Terminated: 15          yes > /dev/null
sh: line 1: 47982 Terminated: 15          yes > /dev/null
sh: line 1: 47983 Terminated: 15          yes > /dev/null
βœ” 13:35 ~/pg-bug [master|✚ 1…1294] $ npm start

> [email protected] start /Users/metsala/pg-bug
> createdb pg-bug-local-test-db 2>/dev/null; yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & yes > /dev/null & node index.js; killall yes

Initialized, now looping..
/Users/metsala/pg-bug/node_modules/pg/lib/client.js:271
    self.activeQuery.handleCommandComplete(msg, con)
                     ^

TypeError: Cannot read property 'handleCommandComplete' of null
    at Connection.<anonymous> (/Users/metsala/pg-bug/node_modules/pg/lib/client.js:271:22)
    at emitOne (events.js:116:13)
    at Connection.emit (events.js:211:7)
    at Socket.<anonymous> (/Users/metsala/pg-bug/node_modules/pg/lib/connection.js:125:12)
    at emitOne (events.js:116:13)
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
    at readableAddChunk (_stream_readable.js:250:11)
    at Socket.Readable.push (_stream_readable.js:208:10)
    at TCP.onread (net.js:597:20)
sh: line 1: 48046 Terminated: 15          yes > /dev/null
sh: line 1: 48047 Terminated: 15          yes > /dev/null
sh: line 1: 48048 Terminated: 15          yes > /dev/null
sh: line 1: 48049 Terminated: 15          yes > /dev/null
sh: line 1: 48050 Terminated: 15          yes > /dev/null

With console.log debug technology it seemed that the flow in the crash was something like

  1. open transaction
  2. set local statement timeout
  3. execute prepared INSERT statement
  4. the above INSERT statement times out
  5. pg receives CommandCompleted
  6. pg sends Sync
  7. pg receives ErrorResponse
  8. pg sends Sync
  9. pg receives ReadyForQuery with status E (for Sync after CommandCompleted)
  10. client dequeues the next statement (in this case ROLLBACK), setting self.activeQuery
  11. pg receives ReadyForQuery with status E (for Sync after ErrorResponse)
  12. client dequeues the next statement, setting self.activeQuery (or sets to null if no next statement)
  13. pg receives CommandCompleted for ROLLBACK but self.activeQuery points to the wrong query or null

I tried ignoring the first ReadyForQuery and only reacting to the second one and could no longer reproduce the null crashes.

EDIT: forgot to include server version: PostgreSQL 9.5.6 on x86_64-apple-darwin16.4.0, compiled by Apple LLVM version 8.0.0 (clang-800.0.42.1), 64-bit on my machine and PostgreSQL 9.6.11 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit on the production machine

An interesting observation: after having upgraded the production machine to PostgreSQL 11.2 (Debian 11.2-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit, the problem no longer exists.

Also, on a colleague's machine running PostgreSQL 11.1 on x86_64-apple-darwin17.7.0, compiled by Apple LLVM version 10.0.0 (clang-1000.11.45.5), 64-bit we couldn't reproduce this with the script.

Started getting this error after using the query_timeout config parameter, pg 7.10.0

Repro steps for me, using this config:
{ query_timeout: 10 * 1000, max: 32, min: 8 }

Steps: lock a table in the database and then issue queries against that table, so that they all time out; doesn't have to be very fast, 1-2 per second will do. I get the crash very quickly, somewhere after around 10 to 100 query timeouts.

Other timing information, if it's relevant: the database and node-postgres servers are on separate machines but the ping rtt is very low, 0.5 - 1ms.

I'm also using query_timeout:

[2019-07-11 09:36:36] [ERROR] [pid:8718] TypeError: Cannot read property 'name' of null
TypeError: Cannot read property 'name' of null
at Connection. (/opt/acn/4be1e1fb/api/node_modules/pg/lib/client.js:313:26)
at emitOne (events.js:116:13)
at Connection.emit (events.js:211:7)
at Socket. (/opt/acn/4be1e1fb/api/node_modules/pg/lib/connection.js:126:12)
at emitOne (events.js:116:13)
at Socket.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
at readableAddChunk (_stream_readable.js:250:11)
at Socket.Readable.push (_stream_readable.js:208:10)
at TCP.onread (net.js:597:20)

I was experiencing this issue with pg-query-stream and downgrading to [email protected] fixed it. Not sure what changed but maybe that is something to look into?

I'm also using [email protected] with [email protected]. I tried downgrading pg-promise to a version that uses [email protected] based on the previous comment, but it did not fix the issue.

Unfortunately, I only encounter this issue when running some automated tests in a docker container (CentOS 7.6.1810). Things run noticeably slower in the container vs locally (MacOS 10.14.6) so maybe it's timing related as mentioned above. In all cases, the database is running in a separate container (PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit).

If it's at all helpful, here's the test code that fails:

let errorEncountered = false;
let results;
try {
    results = await runQuery('SELECT * FROM nonexistant_table');
} catch (err) {
    errorEncountered = true;
    expect(err).to.exist;
    expect(err.message).to.equal('relation "nonexistant_table" does not exist');
}

expect(errorEncountered).to.be.true;

// Reaches here
results = await runQuery('SELECT * FROM good_table WHERE id >= 18 ORDER BY id DESC;');
// Does not reach here

expect(results).to.have.lengthOf(4);
expect(results[2].id).to.equal(19);

runQuery() sends the query to the server via a web socket request and receives the data in chunks. The function resolves when the query has completed.

Here's a much simplified version of my server code:

function executeQuery(query) {
    return new Promise((resolve, reject) => {
        connection.task(async tx => {
            await tx.none('SET <custom_config_key>=<custom_config_value>');

            return tx.stream(queryStream, function (stream) {
                stream.pipe(outputStream);

                stream.on('error', err => {
                    outputStream.emit('error', err);
                });

                // Resolve now that the stream has been initiated.
                resolve();
            });
        }).catch(err => {
            reject(err);
        });
    });
}

webSocketRequestHandler(query) {
    const outputStream = new PassThrough({ objectMode: true });

    // Add some stream listeners

    executeQuery(query).then(() => {
        // Send data back to the client in chunks as data comes through the stream
    });
}

The failure happens reliably every time I run the tests, so if there's some places I can add log statements to better track down this problem, I am happy to assist.

After reading this comment and some more experimentation, I seem to have gotten around this problem by making sure the stream is closed before releasing the connection back to the pool.

function executeQuery(query) {
    return new Promise((resolve, reject) => {
        connection.task(async tx => {
            await tx.none('SET <custom_config_key>=<custom_config_value>');

            return new Promise((connResolve, connReject) => {
                return tx.stream(queryStream, function (stream) {
                    stream.on('close', () => {
                        // Make sure the stream is closed before releasing the connection back to the connection pool
                        connResolve();
                    });
                    stream.on('error', err => {
                        outputStream.emit('error', err);
                    });
                    stream.pipe(outputStream);

                    // Resolve now that the stream has been initiated.
                    resolve();
                });
            });
        }).catch(err => {
            reject(err);
        });
    });
}

I'm not sure where the right place to make the change would be, but it seems like maybe this safeguard could be added to one of the involved libraries (pg-pool or pg-promise?) so a close listener wouldn't be necessary externally...?

Also, I feel like it's worth mentioning that I initially got around this error by commenting out this line in pg-cursor/index.js. Probably not the best idea to do that, but it helped me figure things out.

For those of you catching this with statement timeouts: I was able to reproduce this error consistently when statement_timeout and query_timeout were set to the same value. Increasing query_timeout relative to statement_timeout seems to improve the frequency at which we see this error

I have a PR up for this: #2367. The fix feels dirty to me, but I think what's happening here might be a race condition within postgres itself we're working around.

FYI @brianc we did see this on postgres 12.3, not 9.x

@pvatterott that's good to know, but a bummer as my hunch was around a race condition in older versions of postgres. I've been going back and forth w/ some postgres maintainers & think this might fix it. I can confirm it fixes in 9.x & 10.x of postgres, it might fix it for 12.3 of postgres as well. The change involves pipelining the sync command in with the other extended query commands all at once. It also improves my benchmarks by up to 10% in some cases, which is a nice side effect.

Thank you @brianc !

@pckilgore my pleasure! Let me know if it still happens...was quite a monster to track down but @timotm is the real hero here. Steps to reproduce & deep analysis helped me know where to focus. I regret it took me so long to circle back to this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ossdev07 picture ossdev07  Β·  3Comments

KeynesYouDigIt picture KeynesYouDigIt  Β·  3Comments

wrod7 picture wrod7  Β·  4Comments

ClueLessEggHead picture ClueLessEggHead  Β·  3Comments

v1co1n picture v1co1n  Β·  4Comments