Loopback: Event loop blocked when querying ~4k rows from a PostgreSQL database

Created on 27 Dec 2019 · 25Comments · Source: strongloop/loopback

We are seeing repeated event loop blocked results in our Loopback application.

Steps to reproduce

Download and npm install the sandbox app
Start a postgres database using docker-compose -f postgresql.yml up (or update the data source to point to an existing postgres insteallation).
Start the app using node.
Open loopback explorer and create the suppliers contained in suppliers.json
Do a get on Suppliers and see the blocking in the logs:

checking client timeout
connecting new client
new client connected
pulse queue
no queued requests
Event loop blocked for 36.808244995117185 milliseconds [ ' at Socket.connect (net.js:918:7)',
' at Connection.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/connection.js:54:17)',
' at Client._connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/client.js:102:9)',
' at Client.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/client.js:288:10)',
' at Pool.newClient (/Users/davidknapp/loopback-sandbox/node_modules/pg-pool/index.js:208:12)',
' at Pool.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg-pool/index.js:183:10)',
' at PostgreSQL.executeSQL (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:651:10',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:259:9',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)',
' at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)',
' at PostgreSQL.SQLConnector.execute (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:650:8)',
' at PostgreSQL.find [as all] (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:1460:8)',
' at invokeConnectorMethod (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:172:21)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:1755:7',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.find (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:1753:10)',
' at SharedMethod.invoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/shared-method.js:270:27)',
' at HttpContext.invoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/http-context.js:297:12)',
' at phaseInvoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/remote-objects.js:695:9)',
' at runHandler (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:135:5)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1016:9',
' at eachLimit$1 (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3196:24)',
' at Object. (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1046:16)',
' at runHandlers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:144:13)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1016:9',
' at eachLimit$1 (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3196:24)',
' at Object. (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1046:16)',
' at Phase.run (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:148:9)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase-list.js:256:13',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:473:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1006:25)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:473:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1006:25)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/remote-objects.js:671:7',
' at process._tickCallback (internal/process/next_tick.js:61:11)' ]
remove idle client

Use explorer to run the custom Suppliers/list endpoint and see that there is not any blocking

Current Behavior

We see numerous event loop blocked messages in our large loopback application. We were able to reproduce the issue in the minimal sandbox app attached.

At first, we believed that the issue was with the pg library but could not reproduce the issue in a standalone node.js app.

Then, we added the custom "list" method on the supplier model and called the connector for the get and we did not get any blocking messages. We believe that this indicates that the issue is elsewhere in Loopback.

The blocking appears to be related to getting a new connection from the pool. If you click "try it out" multiple times, you may not see the error on the second and subsequent clicks. If you then wait for the connection to be closed after the default 10 second idle timeout, you will likely see if the blocked message again.

In this small app on my machine the blocks are 30-40ms. In our complex app, we see blocks reach the 100-200ms range. We see these blocks running in cloud foundry with compose databases as well so we do not believe that this is os or machine related.

Expected Behavior

Loopback should not block the event loop.

Link to reproduction sandbox

https://github.com/Apex-BPM/loopback-sandbox

Additional information

node -e 'console.log(process.platform, process.arch, process.versions.node)':

darwin x64 10.16.0

npm ls --prod --depth 0 | grep loopback:

[email protected] /Users/davidknapp/loopback-sandbox
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]

Related Issues

bug

Source

dknapp47

Most helpful comment

@bajtos My thoughts mirror your own in the comment above.

To mimize cpu cost-per-record, it can be worthwhile to get a large set of query rows as a single result, operate on it synchronously, and JSON.serialize() it to the requestor.

But this is trading off cost-per-record with latency. For low latency/fast loops/minimal interaction between requests, it would be better to do stream records, or use sql cursors, or some other technical mechanism to handle a record-per-loop, or some batch of records per-loop.

The tradeoff depends on the app, and in this case, doesn't it depend on how much work the app does per record, because the per-record processing overhead is only partially loopback, its calling out to application code to handle each record?

That makes it even harder for loopback to choose how to tradeoff performance against time-per-loop.

Your suggestion pushes the choice to the user, enabling them to take the approach that is right for them by controlling the batch size that will be requested synchronously in each pg query.

Wrt. to workers, you can't make js calls between workers, just exchange data, but in this case, it looks like the "processing" is mostly calling out to user-provided code, https://github.com/strongloop/loopback-datasource-juggler/blob/d8ca3827ab5231ad6653dd5bac9a7805da0b3760/lib/dao.js#L1764

That can't happen from worker to main thread, so unless I misunderstand the nature of processQueryResults(), it doesn't look like a candidate for workerization, even if it was a good idea for LB to be making these tradeoffs internally.

sam-github on 13 Jan 2020

👍2

All 25 comments

Was using loopback with Postgres for a long time. I never experienced such an issue. I have cloned the repo and ran against my Postgres installation. I didn't find any such issues as you described. I am using Postgres 9.6 on my local machine.

sujeshthekkepatt on 28 Dec 2019

@sujeshthekkepatt Thanks for trying to reproduce the issue. We have been using and loving Loopback with Postgres for almost five years as well :). But we only recently added blocked-at to our app to check for thread blocking issues in dev and test. The issue could have been there for some time before that (how would we know?).

I am sorry that you could not reproduce the issue. Perhaps there is some difference in our environments or the testing approach. Here are some additional details on my environment:

node version: v10.16.0
postgres version: PostgreSQL 11.2 (Debian 11.2-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

I would be happy to jump on a zoom when we are both available to see if we can figure out what is different.

In the meantime, I added database environment variables to the datasource and pushed the app to cloud foundry on Pivotal. I set it up to connect to a compose.io postgres database. The results were actually a little worse. There was blocking during the load of the suppliers as well as three times during the get. There is still no blocking with the custom list endpoint.

I have included the logs in the repo here: https://github.com/Apex-BPM/loopback-sandbox/blob/master/cloud_foundry_logs

The cloud foundry instance was running on:

node: v13.3.0
postgresql: PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit

The fact that there is no blocking on the custom endpoint which uses the connector directly makes me think that the issue is not in the connector or the pg library but instead somewhere else in Loopback.

If it would help reproduce the issue more quickly, I would be happy to prepare a docker file and a docker compose file including the app and the database so that we can run the exact same versions.

dknapp47 on 28 Dec 2019

👍1

From the log I feel it's something related to postgres connection management. I am not jumping into any conclusions. I will try to reproduce it. And a docker/docker-compose would be great :). It would be more convenient if some one from the core team take a look @bajtos

sujeshthekkepatt on 29 Dec 2019

@sujeshthekkepatt I have added the dockerfile and docker-compose. You should be able to just do docker-compose up and it will run. I tested and had the same results.

dknapp47 on 29 Dec 2019

👍1

@dknapp47 It seems like the Postgres configuration issue. You may need to modify the timeOut value etc. Could you update the datasources.json with the following and verify

    "min": 2,
    "max": 5,
    "idleTimeoutMillis": 60000,
    "ssl": false

sujeshthekkepatt on 30 Dec 2019

@sujeshthekkepatt I updated the settings and added logging to confirm that the settings took effect. I also measured the time until the "remove idle client" messages and they match the 60 second settings. Unfortunately it did not fix it. I still see the Event loop blocked messages. I have attached the logs from the container.
logs.txt

dknapp47 on 30 Dec 2019

@sujeshthekkepatt While testing on my side, I noticed that things had improved but it turned out that somewhere along the way, the postgres database was reset so I had to reload the suppliers.json. After doing that, I am back to consistent blocking reports.

dknapp47 on 30 Dec 2019

I have refined the event loop blocking output and also included the logs in the latest commit:

https://github.com/Apex-BPM/loopback-sandbox/blob/master/logs.txt

Line 195 shows that async id 27 which is a TCPWRAP took 89.4ms.

It is interesting that the same async id was called multiple times before that. You can see those on lines 137, 143, 153, 163, 174, and 190. Those ranged from a low of 2.5ms to a high of 6.4ms.

It is also interesting that the same async id is called two more times after the slow (see lines 242 and 247) and those were very fast (0.2ms and 0.0ms).

dknapp47 on 30 Dec 2019

I think it happens when a new connection is made with the Postgres server, which results in a high range of time. It's nothing related to loopback I think. From the log, you could find that the highest time taken is to create a new connection to the server, before that no new connection is made which resulted in less amount of time. But still, I am not familiar with the pg lib. I think it's better the core team take a look, @dhmlau

sujeshthekkepatt on 31 Dec 2019

@sujeshthekkepatt I have made more progress this morning. It appears that the issue is in the callbacks after the executeSQL has results. I need to add more logging but I think that something about the way datasource-juggler handles the 1000 row result may be blocking the event loop. I probably won't be able to dig deeper until Thursday.

dknapp47 on 31 Dec 2019

@sujeshthekkepatt Here is an update. I have been working locally adding timing and changes to eliminate the blocking aspects. Here is a summary of what I have done.

Here is the baseline run output:

blocked for 60ms
    at Socket.connect (net.js:918:7)
    at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
    at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
    at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
    at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
    at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
    at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
    at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)

I then added console.time to the allCb in DataAccessObject.find:

SQLConnector.prototype.all async.mapSeries: 0.959ms
DataAccessObject.find allCb: 44.354ms
blocked for 58ms
    at Socket.connect (net.js:918:7)
    at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
    at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
    at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
    at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
    at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
    at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
    at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)

allCb accounted for most of the blocking time. I then added console.time around data.map in SQLConnector.prototype.all and got these results:

SQLConnector.prototype.all data.map: 9.226ms
DataAccessObject.find allCb: 44.010ms
blocked for 55ms
    at Socket.connect (net.js:918:7)
    at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
    at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
    at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
    at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
    at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
    at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
    at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)

So those two account for all the blocking time. I then added console.time around async.map in allCb in DataAccessObject.find:

SQLConnector.prototype.all data.map: 9.404ms
DataAccessObject.find allCb async.map: 43.528ms
DataAccessObject.find allCb: 43.692ms
blocked for 56ms
    at Socket.connect (net.js:918:7)
    at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
    at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
    at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
    at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
    at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
    at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
    at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)

That surprised me. I would not have expected the async.map to be blocking. I dug deeper in the code and found a callback that was made without process.nextTick (line 155 in observer.js). I added it and got these results:

SQLConnector.prototype.all data.map: 9.707ms
DataAccessObject.find allCb async.map: 409.118ms
DataAccessObject.find allCb: 409.528ms
blocked for 420ms
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
    at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
    at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
    at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
    at Connection.emit (events.js:198:13)
    at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
    at Socket.emit (events.js:198:13)

That also surprised me. The time got much worse but notice that the pg library is no longer blocked. I looked through more of the code and found added a process.nextTick on the callback inside buildResult:

SQLConnector.prototype.all data.map: 12.456ms
DataAccessObject.find allCb async.map: 394.342ms
DataAccessObject.find allCb: 394.572ms
blocked for 408ms
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
    at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
    at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
    at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
    at Connection.emit (events.js:198:13)
    at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
    at Socket.emit (events.js:198:13)

Slightly less bad but not good. I changed the async.map to an async.mapSeries in allCb (may not be the right solution but I'm just trying to track down the blocking code):

SQLConnector.prototype.all data.map: 10.615ms
DataAccessObject.find allCb async.mapSeries: 0.434ms
DataAccessObject.find allCb: 0.613ms
blocked for 12ms
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
    at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
    at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
    at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
    at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
    at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
    at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
    at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
    at Connection.emit (events.js:198:13)
    at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
    at Socket.emit (events.js:198:13)
blocked for 16ms
    at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
    at process._tickCallback (internal/process/next_tick.js:63:19)

That's much better. I then added a process.nextTick on observer.js line 178:

SQLConnector.prototype.all data.map: 13.822ms
DataAccessObject.find allCb async.mapSeries: 0.363ms
DataAccessObject.find allCb: 0.507ms
blocked for 15ms
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
    at process._tickCallback (internal/process/next_tick.js:61:11)
blocked for 15ms
    at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
    at process._tickCallback (internal/process/next_tick.js:63:19)

Next, I went after the time in the data.map but converting it to async.map in SQLConnector.prototype.all:

SQLConnector.prototype.all async.map: 212.241ms
blocked for 213ms
    at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
    at process._tickCallback (internal/process/next_tick.js:61:11)
DataAccessObject.find allCb async.mapSeries: 0.415ms
DataAccessObject.find allCb: 0.565ms
SQLConnector.prototype.all cb: 0.738ms
blocked for 19ms
    at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
    at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
    at process._tickCallback (internal/process/next_tick.js:63:19)

The time got worse but it confirmed that it is the source of the first block. I changed it to async.mapSeries:
```SQLConnector.prototype.all async.mapSeries: 0.959ms
DataAccessObject.find allCb async.mapSeries: 0.334ms
DataAccessObject.find allCb: 0.520ms
SQLConnector.prototype.all cb: 0.702ms
blocked for 18ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)

Now we are down to just one block. I guessed that the last one might be "above" DAO in the call stack. I added console.time around cb in allCb async.mapSeries:

SQLConnector.prototype.all async.mapSeries: 1.058ms
DataAccessObject.find allCb async.mapSeries: 0.331ms
DataAccessObject.find allCb: 0.514ms
SQLConnector.prototype.all cb: 0.653ms
DataAccessObject.find allCb async.mapSeries cb: 18.830ms
blocked for 19ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)

I continued by adding console.time a couple more layers up:

SQLConnector.prototype.all async.mapSeries: 0.954ms
DataAccessObject.find allCb async.mapSeries: 0.293ms
DataAccessObject.find allCb: 0.461ms
SQLConnector.prototype.all cb: 0.627ms
HttpContext.prototype.invoke fn: 16.769ms
SharedMethod.prototype.invoke cb: 16.934ms
DataAccessObject.find allCb async.mapSeries cb: 17.427ms
blocked for 18ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
```
I ran out of time to follow those further up the call stack today.

So my summary is that when there are a large number of results (I was using 4000 rows here), there are parts of the package that are blocking the thread. Most of the changes were the simple addition of process.nextTick on callbacks that were made immediately. I am not sure if moving to async.mapSeries is the right answer. Maybe there is a better way to make those two parts not block the thread.

Let me know what you think. If needed, I can package up what I have so that someone else can reproduce it.

dknapp47 on 4 Jan 2020

👍2

@bajtos @raymondfeng @strongloop/sq-lb-apex, could you please help with this? Thanks.

dhmlau on 9 Jan 2020

Whoa, thank you @dknapp47 for such detailed analysis!

So my summary is that when there are a large number of results (I was using 4000 rows here), there are parts of the package that are blocking the thread.

Based on my past experience, your summary looks like a correct conclusion to me.

Let me try to reproduce the issue on my machine and then investigate what options we have to fix this problem.

bajtos on 9 Jan 2020

👍1

I wanted to reproduce the issue while running the application locally, using Docker for the PostgreSQL database server only. I ran docker-compose -f postgresql.yml up but the application was not able to connect with the current configuration.

So first of all, I updated server/datasources.json as follows:

{
  "db": {
    "name": "db",
    "connector": "memory"
  },
  "postgres": {
    "name": "postgres",
    "connector": "postgresql",
    "host": "127.0.0.1",
    "port": "5432",
    "user": "postgres",
    "database": "postgres"
  }
}

With this new config, I was able to start the app, post sample supplier data to create 4000 records (I executed the POST request four times).

I found the logs introduced by https://github.com/Apex-BPM/loopback-sandbox/commit/38c5b0526d547a6b79256b393550933dc5d562d2 as too verbose, therefore I commented out async_hook based logging and restored blocked-at based reporting.

With this change in place, I was able to make a query for 4k records in the API Explorer and observe a warning printed to the console (as expected).

bajtos on 9 Jan 2020

Here is my current understanding of what's happening under the hood:

The code converting the result from pg to an HTTP response is fully synchronous right now, we perform all work in a single tick of the event loop. While this is probably a good thing when the number of records (array items) is low, it does not scale well when the size of data increases.

I verified the claim above by adding three console.log statements into allCb function:

The first log before async.map is called
The second log into the function passed as the last argument of async.map (the callback invoked after all records are processed)
The third log after async.map returns (within the same tick)
Then I run a sample query and observed that my logs were printed in the order 1-2-3. If the mapping was not done within the same tick, then I would observer logs in the order 1-3-2.

I think this explains the changes in timing data observed after changing notifyObserversOf from sync to async and replacing map with mapSeries. Depending on the changes made, some parts of the workload was split into multiple ticks. But that comes with its own overhead that demonstrates elsewhere.

To further test my hypothesis, I tested the impact of response size.

When I run a query for the first 1000 records only (http://localhost:3000/api/Suppliers?filter=%7B%22limit%22%3A1000%7D), I don't receive any warnings.
When I query for 2000 records (http://localhost:3000/api/Suppliers?filter=%7B%22limit%22%3A2000%7D), the threshold is reached and a warning is printed.

This reinforces my opinion that we are not doing anything costly inside LoopBack, it's just that the more records we return, the more work we have to do, and therefore at some point we hit the threshold for how long a tick can take.

In my opinion, the right solution for this problem is to limit the number of records we are processing within a single tick and frequently yield the control back to the event loop. I.e. in our Supplier example, we can process 4000 rows in 4 batches of 1000 records, yielding control back every 1000 records processed.

Now the question is how to decide how often to yield? I don't think a single number of records will work for all use cases:

If there is at least one async observer registered with the model, then we are already yielding control for each record processed. Additional yielding from allCb may not be needed, or perhaps it's enough to yield less often.
If model instances contain few properties only, then we can process each record very quickly and it should be enough to yield infrequently. (E.g. every 1000 records as in our example here.)
If model instances are large, then each iteration will take longer and thus we need to yield more often. (E.g. every 10 records.)

BTW if your response contains many rows, then JSON.stringify can become a bottleneck too! LoopBack depends on res.json() provided by Express. My colleagues in IBM have written a module called yieldable-json to address this problem, see their Statistics for timing data.

On the second though (and also because I was not able to find a module that will make it easy to yield control back every N iterations), this change in lib/dao.js seems to remove the blocked-at warning completely:

-     async.map(data, function(item, next) {
+     async.mapLimit(data, 10, async.ensureAsync(function(item, next) {

By wrapping the iterating function in async.ensureAsync, we ensure it always defers the callback to the next tick. The implementation is clever enough to skip setImmediate if the callback was already called outside of the initial tick.

This change converts fully sync processing into fully async, where each iteration is started on the first tick and finishes on the second tick. As you have observed, this is significantly increased the time while the loop is blocked (from ~40ms to ~400ms). I am assuming this is the overhead added by parallelization.
By limiting the number of iterating function calls running in parallel, we spread the overhead over multiple ticks, keeping each tick within the threshold.

However, while this may remove the problem of blocked event loop, I am concerned that it may decrease the overall performance (time to serve the request).

Further research is required here.

bajtos on 9 Jan 2020

Maybe we could use a sensible default and allow it to be overridden with something in options. That would let us set the option for queries that return large result sets for example:

Suppliers.find({where:{status:'Active'}},{resultChunkSize:100},callback);

For backward compatibility, we could set the resultChunkSize to 0 indicating that there is no chunking.

dknapp47 on 9 Jan 2020

However, while this may remove the problem of blocked event loop, I am concerned that it may decrease the overall performance (time to serve the request).

I ran a quick benchmark using autocannon and the following command:

$ autocannon http://localhost:3000/api/Suppliers?filter=%7B%22limit%22%3A4000%7D

Using vanilla juggler code, I observed average latency 356.68 ms and 27.5 requests per second.

After I applied the patch, the performance has significantly degraded to average latency 8225.8 ms and about 1 request per second.

Maybe we could use a sensible default and allow it to be overridden with something in options. That would let us set the option for queries that return large result sets for example:
Suppliers.find({where:{status:'Active'}},{resultChunkSize:100},callback);

I agree that a new feature flag (a new option) may be helpful. I am afraid it's not enough to implement it in find options only, because these options cannot be provided by HTTP clients. I am thinking about exposing this configuration at model level.

For backward compatibility, we could set the resultChunkSize to 0 indicating that there is no chunking.

I see two views here:

Use a sensible non-zero default, because this is likely to improve performance for all LoopBack users. In (hopefully rare) cases where the performance degrades after the change, users can set the value to 0 where needed.
Prioritize backwards compatibility over everything else and don't process query results in chunks unless explicitly configured by the app developer.

@strongloop/loopback-3x-maintainers thoughts?

bajtos on 10 Jan 2020

I managed to create a working prototype that processes query results in chunks of 500 rows, does not trigger blocked-at warnings and mostly preserves the performance:

blocked-at warning is triggered on the first run, I think this is caused by the fact that V8 hadn't had a chance to optimize the code yet.
average latency is 363.96 ms (2.0% worse than 356.68 ms)
we can handle 27 requests per second on average (1.8% worse than 27.5)

The source code is here: https://github.com/strongloop/loopback-datasource-juggler/compare/3.x...feat/chunked-all-cb?expand=1&w=1 The changes are best reviewed with whitespace-changes ignored. The first commit is a refactoring, the second contains the relevant changes.

I'd like to discuss this topic with colleagues more experienced with performance fine-tuning, before going forward to open any pull requests.

bajtos on 10 Jan 2020

👍2

@bajtos Can't we use a worker thread? Just a quick question

sujeshthekkepatt on 10 Jan 2020

Can't we use a worker thread? Just a quick question

I am not familiar with worker threads enough to be able to tell how easy or difficult it would be to make LoopBack work on worker threads and what would be the overhead of passing the data between the main loop and the worker thread. I suspect it would require significant design changes inside the framework too.

bajtos on 10 Jan 2020

👍1

Can't we use a worker thread? Just a quick question

I am not familiar with worker threads enough to be able to tell how easy or difficult it would be to make LoopBack work on worker threads and what would be the overhead of passing the data between the main loop and the worker thread. I suspect it would require significant design changes inside the framework too.

That I agree. But offloading work is the better choice while dealing with CPU intensive op ,right ? Batching improves the performance but that too has a limitation, I think. Just brainstorming. @bajtos

sujeshthekkepatt on 10 Jan 2020

That I agree. But offloading work is the better choice while dealing with CPU intensive op ,right ? Batching improves the performance but that too has a limitation, I think. Just brainstorming.

Yes, offloading to a worker thread is a often a good choice. There is no free lunch though, you have to pass extra costs to pass the data between the main thread and the worker thread. So if you are processing only few records (e.g. 50), then the cost of worker synchronization is almost certainly going to outweigh benefits from higher concurrency.

Anyhow, I don't offloading to worker threads is something to implement inside LoopBack. IMO, this part belongs to the user application, i.e. the app can decide to run the entire find method in a worker thread.

To be honest, I think we should step back and ask what's the most efficient way how to process a larger query result? IMO, it's rather wasteful to wait until all rows are returned from the database and stored in an array, then convert that array to another array of model instances, then build a large JSON string that will be finally sent in chunks to the HTTP response. A much better approach is to leverage streams:

Consume the query results in chunks (a stream of rows/records), possibly using cursors/iterators See pg.Cursor.
Transform the stream of rows/records into a stream of plain data objects to be sent back. See stream.Transform.
Transform the stream of plain data objects into an HTTP response stream. See JSONStream.stringify().

In this light, I am not sure if the additional complexity of processing large query results in chunks (as shown in my PoC) is actually worth the limited benefits it brings.

Thoughts?

/cc @raymondfeng

bajtos on 13 Jan 2020

👍1

@bajtos My thoughts mirror your own in the comment above.

To mimize cpu cost-per-record, it can be worthwhile to get a large set of query rows as a single result, operate on it synchronously, and JSON.serialize() it to the requestor.

That makes it even harder for loopback to choose how to tradeoff performance against time-per-loop.

Your suggestion pushes the choice to the user, enabling them to take the approach that is right for them by controlling the batch size that will be requested synchronously in each pg query.

sam-github on 13 Jan 2020

👍2

Thank you @sam-github for chiming in.

We have discussed this topic offline and decided that we are not going to fix it in LoopBack 3 because it would require too much effort.

I opened a feature request for LoopBack 4 to implement "proper" support for handling large query results in a way that preserves good time-per-loop values, see https://github.com/strongloop/loopback-next/issues/4486

bajtos on 23 Jan 2020

Thank you to everyone who worked on this and we look forward to seeing the large query handling in Loopback 4 :).

dknapp47 on 24 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings