Node-oracledb: Error NJS-040: connection timeout error under large load

Created on 13 Aug 2018 · 41Comments · Source: oracle/node-oracledb

Hi,

we're having a problem where the pool seems to not have any available connections after a long time of inactivity. When that happens, normally after poolTimeout it throws an error NJS-40. I'm trying to generically handle this error by checking for its errorNum (40?) and closing and recreating the pool. However, it seems that NJS-40 doesn't have an errorNum. Can anyone confirm that this is a bug, or is this working as designed? The docs say errors should have an errorNum.

 async executeQuery(query, vars, resultSet = false) {
    let connection;
    try {
      connection = await oracledb.getConnection(this.config.poolAlias);
      const result = await connection.execute(query, vars || [], {
        outFormat: oracledb.OBJECT,
        autoCommit: true,
        maxRows: 1000,
      });
     return result;
    } catch (error) {
      console.log('Error executing query: ', error.errorNum);
     // logs Error executing query: undefined
      const pool = oracledb.getPool(this.config.poolAlias);
      pool._logStats();
    } finally {
      if (!resultSet) await this.closeConnection(connection);
    }
  }

Answer the following questions:

What is your Node.js version? Is it 64-bit or 32-bit? Run version.js from https://github.com/oracle/node-oracledb/blob/master/examples/version.js
64-bit
What is your node-oracledb version?
2.3.0
What is your Oracle client (e.g. Instant Client) version? Is it 64-bit or 32-bit? How was it installed? Where is it installed?
64-bit, it's contained within the docker image. The Dockerfile of the image used:
https://github.com/CollinEstes/docker-node-oracle/blob/8/Dockerfile
What is your Oracle Database version?
What is your OS and version?
MacOS High Sierra 10.13.5 but in the Docker image:
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian

question

Source

Ventis

Most helpful comment

@Ventis
Doing commits more often than not can lead you to many problems, ORA-1555 snapshot too old to name one, there are many problems that can crop up with too frequent committing, good advice is to commit when needed, not in select queries for example :)

Also, on the array instead of object question, the simple answer is that the code does more when delivering objects back to the JS layer:

Check how when the outformat is not rows array, the layer does more things to return the code back.

for (uint32_t row = 0; row < baton->rowsFetched; row++) {
        if (baton->outFormat == NJS_ROWS_ARRAY)
            rowAsArray = Nan::New<Array>(resultSet->numQueryVars);
        else rowAsObj = Nan::New<Object>();
        for (uint32_t col = 0; col < resultSet->numQueryVars; col++) {
            var = &resultSet->queryVars[col];
            if (!njsConnection::GetScalarValueFromVar(baton, var, &var->buffer,
                    row, val))
                return;
            if (baton->outFormat == NJS_ROWS_ARRAY)
                Nan::Set(rowAsArray, col, val);
            else {
                keyVal = Nan::New<String>(var->name).ToLocalChecked();
                Nan::Set(rowAsObj, keyVal, val);
            }
        }
        if (baton->outFormat == NJS_ROWS_ARRAY)
            Nan::Set(rows, row, rowAsArray);
        else Nan::Set(rows, row, rowAsObj);
    }

The complex answer might have to do with internals and ODPI stuff, which I'd rather not talk about because I'm not an expert on it.

danilohgds on 16 Aug 2018

👍2

All 41 comments

@Ventis define 'seems to not'. How are you checking? Have you traced your connection open and close calls to see if they are matched? Have you checked connectionsInUse and connectionsOpen at various point in the app?? Is your app getting connections timedout from a firewall or the DBA killing sessions - are you getting errors using connections?

errorNum is for Oracle client and DB errors like ORA and TNS, not node-oracledb errors like NJS or DPI. The manual says "This value is undefined for non-Oracle errors and for messages prefixed with NJS or DPI.". PR's welcome if you want to change the wording or behavior.

[FWIW does https://github.com/CollinEstes/docker-node-oracle/blob/8/Dockerfile really need buildessentials? Or to set LD_LIBRARY_PATH since it has already used ldconfig?]

cjbj on 13 Aug 2018

Hey @cjbj,

Yes, I'm using pool._logStats(); at various points in the app and the timeouts happened before when the avg waiting time for queries in the pool queue exceeded the queueTimeout when a large number of simultaneous queries are triggered in quick succession. I minimised the risk of that by increasing the number of threads (and poolMin and Max).
The connectionsInUse and connectionsOpen goes up and down accordingly.

Now it mainly seems to happen when there haven't been new queries for a long time (after a weekend of inactivity for example).

I must have missed the part about non-Oracle errors having the errorNum as undefined. Wording is ok I guess, though the error is thrown by the node-oracledb lib makes it an "Oracle" error as far as I'm concerned but I get the distinction.

Ventis on 13 Aug 2018

What are the pool parameters e.g. pooMin & poolMax & UV_THREADPOOL_SIZE? If all the sessions in the pool have been killed by the DBA or resource limits or firewall expiries overnight, is it possible that the pool is having to grow again from poolMin? Since poolIncrement sessions will be created, and that will take some time, is there the timeout happening before the sessions have been created and become usable?

Note that the best practice is to keep poolMin = poolMax (and poolIncrement = 0 with 12.2 client or earlier) to avoid connection 'storms'.

The pool stats can't tell you which sessions (aka connections) in the pool are actually usable - they just show the numbers of session structures in the pool, not the viability of their network connection to the DB, or whether the DBA has killed the DB-end of any connection. This really needs diagrams and timelines to explain!

One more tip: the 18c Oracle client has some internal pool improvements. Give it a whirl. It will connect to the same DB versions as 12.2 client does.

cjbj on 13 Aug 2018

poolMin and poolMax are the same as UV_THREADPOOL_SIZE: 32.
All of these these situations you mention are out of our hands unfortunately because we're working with an external partner's DB. Is there a way to detect that the pool's connections have become unusable and to recreate the pool? I currently have something like this but it doesn't work of course since I can't detect if an error is the NJS-40 one:

/**
   * Initializes a connection pool to an Oracle db
   * @param {Object} config A different db configuration or use the default one (from an env file)
   */
  async init(config) {
    try {
      if (!this.config) this.config = {
      connectionString: `${process.env.DB_HOST}:${process.env.DB_PORT}/uspsde`,
      password: process.env.DB_PASSWORD,
      user: process.env.DB_USER,
      _enableStats: true,
      poolAlias: 'digPool',
      poolMax: 32,
      poolMin: 32,
      poolIncrement: 0,
      poolTimeout: 0,
    };
      await oracledb.createPool(this.config);
    } catch (err) {
      throw new Error('Error initializing db connection: ', err);
    }
  }

/**
   * Closes the connection pool
   */
  async closePool() {
    try {
      const pool = await oracledb.getPool(this.config.poolAlias);
      await pool.close();
    } catch (err) {
      console.log('Error destroying pool:', err);
    }
  }

  /**
   * If the error thrown is a timeout error, reinit the connection pool
   * @param {Error} err Error with errorNum property
   */
  async timeoutError(err) {
    console.log('Timeout errorNum:', err.errorNum);
    if (err.errorNum === NJS40) {
      await this.closePool();
      await this.init();
    }
  }

Ventis on 13 Aug 2018

More comments:

It's easy to check for error using err.message.startsWith('NJS-040:').

It's worthwhile keeping an open mind as to the cause, since I don't fully know the behaviors you are seeing. I'm not sure creating 32 connections would hit the queueTimeout threshold and trigger NJS-040. If you keep making requests to the app when you see NJS-040, do some connect?

Keeping poolMin == poolMax won't stop connection storms if the underlying network transport has timed out (e.g via firewall sniping) or if the DBA is killing idle sessions. If this is the problem you are seeing, then restarting the pool will have the same problem as recreating each connection in the existing connection pool does, won't it? I wonder if this is a scenario where it's useful to NOT have poolMin==poolMax and have a small poolIncrement so that a few connections can be quickly created ?

For valid performance reasons, the pool doesn't know if connections are usable until it tries to use or ping them. Overall the DBAs and system owners need to recognize the application requirements and provide a 'proper' configuration to let it do its work. Can you find out if the connections are being killed by the network or DBAs - you haven't mentioned any errors that would make believe they are.

What if you set up a heartbeat ping? This assumes the problem is sessions being killed.

cjbj on 14 Aug 2018

I thought the default poolPingInterval is 60 seconds? Shouldn't that be enough to check aliveness of the connections in the pool? I can make it shorter of course.

The queueTimeout being hit when a large amount of updates are happening (200k queries in 2,5 minutes) seems logical and should be fixed by us by buffering or merging queries together. After it starts throwing NJS-040, none of the following queries connect though, which seems strange.

The NJS-040 errors when the connection has been inactive for a long time are maybe fixable with the poolIncrement solution you mentioned. I'll try to play around with that.

Ventis on 14 Aug 2018

poolPingInterval is ignored in node-oracledb 2 when you are using 12.2 client libraries. We'll reenable it with node-oracledb 3 because it helps the case when DBAs kill sessions, but I don't think this has a bearing on your problem. Are you seeing any ORA errors? Check the doc on poolPingInterval because your wording isn't what I would have used.

What is the relationship between 200k queries and the number of connections - it sounds like 1-to-1, which seems inefficient? Everytime you get or release a connection the pool has to lock. There have been some improvements in 18c client.

Are you sure you're not deadlocking?

If you're doing non-DB work in Node, why not bump UV_THREADPOOL_SIZE bigger than maxPool?

Instead of dumping the whole logStats, try logging connectionsInUse and connectionsOpen and triple checking that closes are occurring for each getConnection().

cjbj on 14 Aug 2018

I am getting an ORA-03113 error:
Gist

Ventis on 14 Aug 2018

@Ventis Going to back to the code in your original question, I see this line:

if (!resultSet) await this.closeConnection(connection);

Granted, you don't seem to be using resultSet within the function (do any callers set it?), but how would you close the connections if resultSet was true? It seems the connection variable would be out of scope, no?

Also, why do you set autoCommit to true?

Have you tried the 12.1 or 18.3 client libraries? The 12.1 will give you back the built-in pool ping feature and Chris has mentioned that 18.3 client has some other improvements that can help.

If you don't want to try different clients, then with the current client you can try:

Setting the poolMin to 0. This will allow idle connections that pass the poolTimeout to be recreated.
Implementing a manual pool ping. Make a getConnection function that gets a connection an executes something like select 1 from dual. If it succeeds, then return the connection. If it fails you can have some retry logic there. Granted, you incur the full round trip, but maybe that's best for your situation for now.

Note that poolTimeout and queueTimeout are different. The NJS-40 error should occur when the queueTimeout has been exceeded. However, the Gist you showed indicates that you're not exceeding the default of 60 seconds (your max time in queue is 25.6 seconds).

Something isn't adding up. Perhaps you're making changes to the pool config and we're seeing stats from a previous config.

Chris also mentioned that the ratio of connections to queries seems inefficient. Can you tell us more about the app? Does it only execute select statements or does it do DML as well? How often is resultSet = true? Are you running a REST API or doing something different? Please tell us more about your scenario.

dmcghan on 14 Aug 2018

Hi @dmcghan ,

Regarding the resultSet, I handle the closeConnection when it's a resultSet separately but in all honesty, it's never used. I use simple queries for small amounts of data or updates and streaming queries for large selects.

To give you an idea of our use case (which is pretty complex to explain, but I'll try): we're working on a city simulation engine. Basically, we're talking to different calculation models (air quality, traffic, noise,...) from our node backend and proxying the calculations they make to our client (which is a Unity application).

The calculation models query the Oracle DB directly and inform each other (and our backend) through socket messages of the change they made and what other data it impacted. Think of it like this: I can close down a road in the city, so I do an update to the road table and inform all the calculation models of the change by posting a message on the "road" socket they are listening to.

The air model start recalculating all the impacted air quality measurements, updates the data in the Oracle DB and starts posting the changed records one by one on an 'air' socket. Our backend is subscribed to this air socket, and we want to get the new air data (the air model only posts the record id) from the DB and send it to our Unity client and visualise it.

Now, if this is an important road (like a highway), one simple update from our client (1 query) can trigger over 200k updates from the air quality model (inside of 2 minutes). In all honesty, when we started we didn't assume the amount of updates would be that large so for every message we receive, we do one query. So 200k message becomes 200k select queries. I'm improving this at the moment so we batch the messages we get and do larger select queries (which take longer but less need for more connections).

Ventis on 14 Aug 2018

@Ventis Wow, 200k queries for 1 update? I think you've identified the most important part of the app to focus on! :)

I answered a related question here, maybe it can help:
https://stackoverflow.com/questions/49430967/how-to-control-the-number-of-oracles-request-in-node-js/49682340#49682340

There are two options there. The first populates a temp table with id values and then uses the temp table in a subsequent where clause to get what's needed. The second option skips the temp table and uses a nested table instead, but the concept is the same.

You mentioned that this is an "external partner's DB" so hopefully they will let you create an object or two so that you can do what needs to be done most efficiently. A PL/SQL cursor for loop could be used if needed, but the other options may be better.

Let me know if you'd like some code more specific to your situation. I'd need some more details, but you can keep it fairly generic.

dmcghan on 14 Aug 2018

The gist has:

stream 'error' event: Error: ORA-03113: end-of-file on communication channel

Was the error after your system idle period?

You seem to have two issues

errors under load
errors after an extended idle period

Which of the two situations did the ORA-3113 occur in?

Related to what @dmcghan mentioned, are you sure you're closing connections in all error cases?

Did you find out if the network/firewall is killing idle sessions? For connections killed by the DBA or from a resource profile limit, I would expect different error numbers. An ORA-3113 should be identified by the connection liveness check always done by the session pool in 12.2 client libraries, so I don't think trying 12.1 client will give you any benefit from poolPingInterval.

Can you create & check Oracle client trace files to see if there is more information about the ORA-3113? For tracing, I typically I create a sqlnet.ora file like:

ADR_BASE=/tmp/mylogs
DIAG_ADR_ENABLED=ON

and run mkdir /tmp/mylogs.

Alternatively with Instant Client zip files I run:

rm -rf $HOME/instantclient/log/diag/clients && mkdir -p $HOME/instantclient/log/diag/clients

With big systems the volume of tracing might be prohibitive.

The gist doesn't show the 'pool connections in use' value after the error. That might be interesting - although other pool users may affect the value and negate the usefulness of the before/after error connection comparison.

From the gist, the max queue length of 15721 is pretty big, and the max time in the queue is large. In a system that was getting & closing connections correctly, I would look at reducing the number of getConnection calls, or increasing the pool size, or getting a faster machine, or making sure the app was very efficient etc.

You're in the best position to revisit the overall architecture, but here are some random micro thoughts: use oracledb.ARRAY not oracledb.OBJECT. Keep an eye on over committing - do you really need to turn on oracledb.autoCommit? If you are only fetching single rows, keep the relevant size parameters e.g. maxRows small. Do more in the DB with SQL or PL/SQL, as @dmcghan mentioned. Maybe even explore features available in Oracle Spatial and Graph?

PS details matter, so be careful to say 'query' only when you mean SELECT.

cjbj on 15 Aug 2018

After some more investigating, it's apparently very common that the firewall the DB is behind kills idle sessions.

What's the downside of using autocommit when doing SELECT queries, and what's the upside of using ARRAY instead of OBJECT as outFormat?

Ventis on 16 Aug 2018

Also, on the array instead of object question, the simple answer is that the code does more when delivering objects back to the JS layer:

Check how when the outformat is not rows array, the layer does more things to return the code back.

for (uint32_t row = 0; row < baton->rowsFetched; row++) {
        if (baton->outFormat == NJS_ROWS_ARRAY)
            rowAsArray = Nan::New<Array>(resultSet->numQueryVars);
        else rowAsObj = Nan::New<Object>();
        for (uint32_t col = 0; col < resultSet->numQueryVars; col++) {
            var = &resultSet->queryVars[col];
            if (!njsConnection::GetScalarValueFromVar(baton, var, &var->buffer,
                    row, val))
                return;
            if (baton->outFormat == NJS_ROWS_ARRAY)
                Nan::Set(rowAsArray, col, val);
            else {
                keyVal = Nan::New<String>(var->name).ToLocalChecked();
                Nan::Set(rowAsObj, keyVal, val);
            }
        }
        if (baton->outFormat == NJS_ROWS_ARRAY)
            Nan::Set(rows, row, rowAsArray);
        else Nan::Set(rows, row, rowAsObj);
    }

The complex answer might have to do with internals and ODPI stuff, which I'd rather not talk about because I'm not an expert on it.

danilohgds on 16 Aug 2018

👍2

What's the downside of using autocommit when doing SELECT queries

commit is used to end a transaction, which starts with manipulating data, not querying it. At best, you're doing unnecessary work and at worst, you're ending transactions prematurely/unexpectedly when callers of executeQuery only want to execute a query.

On the other hand, there's no upside to committing a query unless it was used to verify something as the very last step of a transaction - in that case, you save a round-trip.

what's the upside of using ARRAY instead of OBJECT as outFormat

I'm not sure I agree with this one (though maybe @cjbj will convince me). I believe the data comes from the database to Node.js the same way in both cases, so it just has to do with how rows are exposed in Node.js. Perhaps it's a bit more work, but I prefer the OBJECT format myself. I should probably run a perf test to compare...

dmcghan on 16 Aug 2018

@danilohgds Thanks for looking that up! If it is just a little extra work, I'm okay with that. I think it helps the code read a little easier and you have the advantage of being able to re-order columns without affecting other parts of the code. Just my two cents.

dmcghan on 16 Aug 2018

👍1

@Ventis what's the current status of your problem(s)? I'm curious.

cjbj on 17 Aug 2018

@cjbj I made some optimisations. Firstly, I now batch incoming websocket messages with better-queue (the ones that required me to do a select query) per 1k, turning 200k individual select queries into 200 WHERE IN queries. That should help a lot on the "errors under load" side and reduce the queue length.

I also changed some of the config setting, like keeping a small poolIncrement, and not keeping poolMax and poolMin the same. Also, I increased UV_THREADPOOL_SIZE to something above poolMax so node doesn't get deadlocked.

The 'errors after and extended idle period' problem though I don't have many solutions for. I'm going to keep it running through the weekend and see on Monday if it's still working.

The auto-commit idea is something I will try out next.

Ventis on 17 Aug 2018

@Ventis Sounds like things are moving in the right direction! 👍 Just curious, how did you approach the "in" clause? Temp table, nested table type, other?

Please let us know if your perf goals are being met after your updates.

dmcghan on 17 Aug 2018

Hey, quick update. So the 'errors under load' are definitely fixed. Haven't seen any queue timeouts anymore and the performance is definitely improved. I created what is probably the most inefficient way to do a where in clause, building a list of id's in-memory and constructing a querystring from it by joining the array with commas (since I read that you currently can't bind_in arrays).

The timeout of the connections in the pool after a long period of inactivity is still happening though. Left the backend up this weekend and on monday morning I couldn't execute any queries anymore without restarting the backend.
As evidenced by the gist here, the pool is definitely not out of available connections. But as you guys pointed out that doesn't mean the "open" connections are valid.
gist

Ventis on 20 Aug 2018

@Ventis Glad to hear the perf is better!

since I read that you currently can't bind_in arrays

Actually, you can bind in arrays of number, string, and date. But you can't bind in arrays of objects with multiple fields yet. See this section of the doc.

Since Oracle 12c it has been possible to use associative arrays with the TABLE operator. See this recent issue for an example.

I plan to do a blog post that covers the various ways you can do "where in". Here's what I have so far:

Hardcoded or dynamically generated list of bind variables: where in (:val1, :val2, :val3, ...)
Use of TABLE operator on an associative array for 12c and above: where in (select column_value from table(:associative_array_in)
Use of TABLE operator on nested table type or pipelined function for earlier versions of Oracle: where in (select column_value from table(:nested_table_type_in) or where in (select column_value from table(pipelined_table_function(:some_values_to_parse_in))
Use of temporary table: where in (select c from t)

dmcghan on 20 Aug 2018

@Ventis I doubt we'll ever be able to bind an array to an IN clause. The current node-oracledb documentation on working with IN clauses is at Binding Multiple Values to a SQL WHERE IN Clause.

In any discussion of big data, I feel obliged to remind everyone to use use executeMany() when inserting lots of data into the DB. This might come into play, for example, if you use a global temporary table.

For the inactivity issue, try either the 12.1 client, the patch in src/dpiConn.c https://github.com/oracle/odpi/commit/219dfbb2f2d97db40b7a15e1d1014e8e5714bf76, or wait for node-oracledb 3. Don't forget that network dropouts can occur at any time, so your normal execute error handling should probably be enhanced to retry connecting if an ORA-3113 is seen.

cjbj on 21 Aug 2018

Hi @cjbj I'm trying out the 18.3 client like you suggested now. No advantage to using that?

Ventis on 21 Aug 2018

@Ventis nope. In fact, unless you have poolMin = poolMax, in node-oracledb 2 you will hit the 18.3 behavior change with OCI_SPOOL_ATTRVAL_NOWAIT mentioned in https://blogs.oracle.com/opal/some-new-features-of-oracle-instant-client-183 Roll on node-oracledb 3, where we will change the internal code & mode so there should be no user visible behavior change in this area. But this is all unrelated to dead session detection.

I still believe you should be implementing your own error checking, since network failures could occur at anytime and you want your application to be resilient. Our ping tweaks in node-oracledb 3 are "nice" to users but can't protect against everything.

cjbj on 21 Aug 2018

What's the timeline on node-oracledb 3 anyway?

Ventis on 21 Aug 2018

I'll know better at the end of this week.

cjbj on 21 Aug 2018

Hey @cjbj, I'm actually doing just that, but I'm having problems trying to recover from the "dead session" situation.

Hypothetically speaking, if I detect a NJS-040 or ORA-03113 or any other error that makes the open connections in the pool unusable, I should recreate the pool right? But the way I'm doing it doesn't seem to work very well:

/**
   * Initializes a connection pool to an Oracle db
   * @param {String} poolAlias Name of the pool to init
   */
  async init(poolAlias) {
    try {
      switch (poolAlias) {
        case DIGITAL_TWIN_POOL:
          await oracledb.createPool(this.config);
          break;
        case DIGITAL_TWIN_POOL_EDIT:
          await oracledb.createPool({
            connectionString: `${process.env.DB_HOST}:${process.env.DB_PORT}/uspsde`,
            password: process.env.DB_PASSWORD,
            user: process.env.DB_USER,
            poolAlias: DIGITAL_TWIN_POOL_EDIT,
          });
          break;
      }
    } catch (err) {
      throw new Error('Error initializing db connection: ', err);
    }
  }

  /**
   * Closes the connection pool
   * @param {String} poolAlias The pool to close
   */
  async closePool(poolAlias) {
    try {
      console.log('Destroy pool:', poolAlias);
      const pool = await oracledb.getPool(poolAlias);
      await pool.close();
    } catch (err) {
      console.log('Error destroying pool:', err);
    }
  }

  /**
   * If the error thrown is a timeout error, reinit the connection pool
   * @param {Error} err Error with errorNum property
   * @param {String} poolAlias The pool that gives the error
   */
  async timeoutError(err, poolAlias) {
    console.log('errorNum:', err.errorNum);
    if (err.message.startsWith(NJS40) || err.errorNum === 3113) {
      await this.closePool(poolAlias);
      await this.init(poolAlias);
    }
  }

Ventis on 21 Aug 2018

Hypothetically speaking, if I detect a NJS-040 or ORA-03113 or any other error that makes the open connections in the pool unusable, I should recreate the pool right?

For errors like ORA_3114, you just need to close the connection and call getConnection() again. If you get NJS-040 then you can wait and retry the getConnection. In neither case should it be necessary to close the pool.

I should have restated that you could try the 12.1 client, which may help your idle connection re-establishment problem

cjbj on 21 Aug 2018

Ok, thx for all your help @cjbj , @dmcghan and @danilohgds! I think I can take it from here, if I run into any more issues I'll be back :)

Ventis on 21 Aug 2018

@Ventis If you care to build node-oracledb from source code, I've pushed the latest set of node-oracledb 3 changes to the GitHub master branch. This picks up the change regarding poolPingInterval, and the internal change regarding 18.3 session pooling so you can use any Oracle client libraries.

I'll close this issue since you seem to be on your way.

cjbj on 22 Aug 2018

👍1

Hello, I am experiencing the same problem as @Ventis with the NJS-40 Error after a period of inactivity. For a little more context incase the errors are not entirely the same, the error only happens after a load test that is followed by a period if inactivity. If the load is kept more or less consistent for the life of the pod, then there is no error. Additionally the first error that is seen is ORA-03113 which seems to trigger the NJS-40 error some time later.

@cjbj I was wondering if the solution of changing the client version ended up being the solution or if there were any additional steps?

brennanclark on 10 Oct 2019

@brennanclark It would be interesting to know details - versions, connection counts, how long the 'period of inactivity' is.

My general suggestion for connection timeouts is to use a multi-prong approach: use Oracle client libraries >= 12.2; make sure that UV_THREADPOOL_SIZE is big enough; make sure connections are being released correctly; make sure the connection pool isn't oversized; stop the network from dropping connections and stop the DBA (or resource manager) from killing them. In your specific case, the ORA-3113 would indicate the network is dropping connections. Try EXPIRE_TIME to keep them alive, see https://github.com/oracle/node-oracledb/issues/1115#issuecomment-513098962.

cjbj on 10 Oct 2019

@cjbj Thanks for replying!
my specs are:
Client version - 18.3
Pool Details:
poolMax - 10
poolMin - 10
poolIncrement - 0
ThreadPool size - 14*

The period of inactivity is 1 hour, then the ORA-3113 occurs.

*I have created a sqlnet.ora where I have set SQLNET.EXPIRE_TIME=10

*In the dockerfile I have set the following environment variables:
ENV TNS_ADMIN /src/oracleConfig
ENV UV_THREADPOOL_SIZE=14

I've followed your example for setting up a pool so I release the connections like this:
try { // Returns the connection to the pool. await conn.close(); logger.info('Oracle connection closed'); } catch (error) { wrapError('Oracle query closing err', error); }

Something to note: The error has persisted; however after adding the sqlnet.ora param the NJS error still occurred and after it did a series of 6 connections were closed at once and the service started responding as expected again.

brennanclark on 10 Oct 2019

If you set EXPIRE_TIME on the Node.js server side of the network, I was told it needs to be in the connection string (e.g. tnsnames.ora) (as mentioned in the issue I linked to).

cjbj on 11 Oct 2019

After some time testing, I have seen the ORA-3113 error resolved by adding the Expire Time; however I am still seeing the NJS-040 error after an hour long idle period.

brennanclark on 16 Oct 2019

I just wanted to pop back in to confirm that you were right, the expire_time has to be set in the tnsnames.ora file not the sqlnet.ora file as the other threads had mentioned. In my case I put it directly into the connectString in the config.

Thanks!

brennanclark on 22 Oct 2019

@brennanclark did that solve your NJS-040s?

cjbj on 23 Oct 2019

@cjbj yeah, that solved it!

brennanclark on 23 Oct 2019

@brennanclark that's great news. However, NJS-040 is a sign of a connection storm or too-small connection pool. Keep an eye on the system and try to identify why the error might have been occurring. If it is/was because connections were being killed by something like a firewall, then investigate whether the firewall can be 'fixed'.

cjbj on 24 Oct 2019

I just wanted to pop back in to confirm that you were right, the expire_time has to be set in the tnsnames.ora file not the sqlnet.ora file as the other threads had mentioned. In my case I put it directly into the connectString in the config.

Thanks!

Hello @brennanclark , I've experiencing with this problem also. It would be nice if you provide your connectionString and options of createPool method? Thanks in advance :)