We are seeing repeated event loop blocked results in our Loopback application.
checking client timeout
connecting new client
new client connected
pulse queue
no queued requests
Event loop blocked for 36.808244995117185 milliseconds [ ' at Socket.connect (net.js:918:7)',
' at Connection.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/connection.js:54:17)',
' at Client._connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/client.js:102:9)',
' at Client.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg/lib/client.js:288:10)',
' at Pool.newClient (/Users/davidknapp/loopback-sandbox/node_modules/pg-pool/index.js:208:12)',
' at Pool.connect (/Users/davidknapp/loopback-sandbox/node_modules/pg-pool/index.js:183:10)',
' at PostgreSQL.executeSQL (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:651:10',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:259:9',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)',
' at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)',
' at PostgreSQL.SQLConnector.execute (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:650:8)',
' at PostgreSQL.find [as all] (/Users/davidknapp/loopback-sandbox/node_modules/loopback-connector/lib/sql.js:1460:8)',
' at invokeConnectorMethod (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:172:21)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:1755:7',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at doNotify (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.ObserverMixin._notifyBaseObservers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:176:15)',
' at Function.ObserverMixin.notifyObserversOf (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)',
' at Function.find (/Users/davidknapp/loopback-sandbox/node_modules/loopback-datasource-juggler/lib/dao.js:1753:10)',
' at SharedMethod.invoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/shared-method.js:270:27)',
' at HttpContext.invoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/http-context.js:297:12)',
' at phaseInvoke (/Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/remote-objects.js:695:9)',
' at runHandler (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:135:5)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1016:9',
' at eachLimit$1 (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3196:24)',
' at Object.
' at runHandlers (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:144:13)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1016:9',
' at eachLimit$1 (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3196:24)',
' at Object.
' at Phase.run (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase.js:148:9)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/lib/phase-list.js:256:13',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:3110:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1011:17)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:473:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1006:25)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:473:16',
' at replenish (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:1006:25)',
' at iterateeCallback (/Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:995:17)',
' at /Users/davidknapp/loopback-sandbox/node_modules/loopback-phase/node_modules/async/dist/async.js:969:16',
' at /Users/davidknapp/loopback-sandbox/node_modules/strong-remoting/lib/remote-objects.js:671:7',
' at process._tickCallback (internal/process/next_tick.js:61:11)' ]
remove idle client
We see numerous event loop blocked messages in our large loopback application. We were able to reproduce the issue in the minimal sandbox app attached.
At first, we believed that the issue was with the pg library but could not reproduce the issue in a standalone node.js app.
Then, we added the custom "list" method on the supplier model and called the connector for the get and we did not get any blocking messages. We believe that this indicates that the issue is elsewhere in Loopback.
The blocking appears to be related to getting a new connection from the pool. If you click "try it out" multiple times, you may not see the error on the second and subsequent clicks. If you then wait for the connection to be closed after the default 10 second idle timeout, you will likely see if the blocked message again.
In this small app on my machine the blocks are 30-40ms. In our complex app, we see blocks reach the 100-200ms range. We see these blocks running in cloud foundry with compose databases as well so we do not believe that this is os or machine related.
Loopback should not block the event loop.
https://github.com/Apex-BPM/loopback-sandbox
node -e 'console.log(process.platform, process.arch, process.versions.node)':
darwin x64 10.16.0
npm ls --prod --depth 0 | grep loopback:
[email protected] /Users/davidknapp/loopback-sandbox
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
Was using loopback with Postgres for a long time. I never experienced such an issue. I have cloned the repo and ran against my Postgres installation. I didn't find any such issues as you described. I am using Postgres 9.6 on my local machine.
@sujeshthekkepatt Thanks for trying to reproduce the issue. We have been using and loving Loopback with Postgres for almost five years as well :). But we only recently added blocked-at to our app to check for thread blocking issues in dev and test. The issue could have been there for some time before that (how would we know?).
I am sorry that you could not reproduce the issue. Perhaps there is some difference in our environments or the testing approach. Here are some additional details on my environment:
node version: v10.16.0
postgres version: PostgreSQL 11.2 (Debian 11.2-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
I would be happy to jump on a zoom when we are both available to see if we can figure out what is different.
In the meantime, I added database environment variables to the datasource and pushed the app to cloud foundry on Pivotal. I set it up to connect to a compose.io postgres database. The results were actually a little worse. There was blocking during the load of the suppliers as well as three times during the get. There is still no blocking with the custom list endpoint.
I have included the logs in the repo here: https://github.com/Apex-BPM/loopback-sandbox/blob/master/cloud_foundry_logs
The cloud foundry instance was running on:
node: v13.3.0
postgresql: PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
The fact that there is no blocking on the custom endpoint which uses the connector directly makes me think that the issue is not in the connector or the pg library but instead somewhere else in Loopback.
If it would help reproduce the issue more quickly, I would be happy to prepare a docker file and a docker compose file including the app and the database so that we can run the exact same versions.
From the log I feel it's something related to postgres connection management. I am not jumping into any conclusions. I will try to reproduce it. And a docker/docker-compose would be great :). It would be more convenient if some one from the core team take a look @bajtos
@sujeshthekkepatt I have added the dockerfile and docker-compose. You should be able to just do docker-compose up and it will run. I tested and had the same results.
@dknapp47 It seems like the Postgres configuration issue. You may need to modify the timeOut value etc. Could you update the datasources.json with the following and verify
"min": 2,
"max": 5,
"idleTimeoutMillis": 60000,
"ssl": false
@sujeshthekkepatt I updated the settings and added logging to confirm that the settings took effect. I also measured the time until the "remove idle client" messages and they match the 60 second settings. Unfortunately it did not fix it. I still see the Event loop blocked messages. I have attached the logs from the container.
logs.txt
@sujeshthekkepatt While testing on my side, I noticed that things had improved but it turned out that somewhere along the way, the postgres database was reset so I had to reload the suppliers.json. After doing that, I am back to consistent blocking reports.
I have refined the event loop blocking output and also included the logs in the latest commit:
https://github.com/Apex-BPM/loopback-sandbox/blob/master/logs.txt
Line 195 shows that async id 27 which is a TCPWRAP took 89.4ms.
It is interesting that the same async id was called multiple times before that. You can see those on lines 137, 143, 153, 163, 174, and 190. Those ranged from a low of 2.5ms to a high of 6.4ms.
It is also interesting that the same async id is called two more times after the slow (see lines 242 and 247) and those were very fast (0.2ms and 0.0ms).
I think it happens when a new connection is made with the Postgres server, which results in a high range of time. It's nothing related to loopback I think. From the log, you could find that the highest time taken is to create a new connection to the server, before that no new connection is made which resulted in less amount of time. But still, I am not familiar with the pg lib. I think it's better the core team take a look, @dhmlau
@sujeshthekkepatt I have made more progress this morning. It appears that the issue is in the callbacks after the executeSQL has results. I need to add more logging but I think that something about the way datasource-juggler handles the 1000 row result may be blocking the event loop. I probably won't be able to dig deeper until Thursday.
@sujeshthekkepatt Here is an update. I have been working locally adding timing and changes to eliminate the blocking aspects. Here is a summary of what I have done.
Here is the baseline run output:
blocked for 60ms
at Socket.connect (net.js:918:7)
at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)
I then added console.time to the allCb in DataAccessObject.find:
SQLConnector.prototype.all async.mapSeries: 0.959ms
DataAccessObject.find allCb: 44.354ms
blocked for 58ms
at Socket.connect (net.js:918:7)
at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)
allCb accounted for most of the blocking time. I then added console.time around data.map in SQLConnector.prototype.all and got these results:
SQLConnector.prototype.all data.map: 9.226ms
DataAccessObject.find allCb: 44.010ms
blocked for 55ms
at Socket.connect (net.js:918:7)
at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)
So those two account for all the blocking time. I then added console.time around async.map in allCb in DataAccessObject.find:
SQLConnector.prototype.all data.map: 9.404ms
DataAccessObject.find allCb async.map: 43.528ms
DataAccessObject.find allCb: 43.692ms
blocked for 56ms
at Socket.connect (net.js:918:7)
at Connection.connect (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:54:17)
at Client._connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:102:9)
at Client.connect (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:288:10)
at Pool.newClient (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:208:12)
at Pool.connect (/Users/davidknapp/block-test/node_modules/pg-pool/index.js:183:10)
at PostgreSQL.executeSQL (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:166:13)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:651:10
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:259:9
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:49)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at PostgreSQL.ObserverMixin.notifyObserversAround (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:228:15)
at PostgreSQL.SQLConnector.execute (/Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:650:8)
That surprised me. I would not have expected the async.map to be blocking. I dug deeper in the code and found a callback that was made without process.nextTick (line 155 in observer.js). I added it and got these results:
SQLConnector.prototype.all data.map: 9.707ms
DataAccessObject.find allCb async.map: 409.118ms
DataAccessObject.find allCb: 409.528ms
blocked for 420ms
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
at Connection.emit (events.js:198:13)
at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
at Socket.emit (events.js:198:13)
That also surprised me. The time got much worse but notice that the pg library is no longer blocked. I looked through more of the code and found added a process.nextTick on the callback inside buildResult:
SQLConnector.prototype.all data.map: 12.456ms
DataAccessObject.find allCb async.map: 394.342ms
DataAccessObject.find allCb: 394.572ms
blocked for 408ms
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
at Connection.emit (events.js:198:13)
at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
at Socket.emit (events.js:198:13)
Slightly less bad but not good. I changed the async.map to an async.mapSeries in allCb (may not be the right solution but I'm just trying to track down the blocking code):
SQLConnector.prototype.all data.map: 10.615ms
DataAccessObject.find allCb async.mapSeries: 0.434ms
DataAccessObject.find allCb: 0.613ms
blocked for 12ms
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
at PostgreSQL.ObserverMixin._notifyBaseObservers (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:178:5)
at PostgreSQL.ObserverMixin.notifyObserversOf (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:153:8)
at cbForWork (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:240:14)
at /Users/davidknapp/block-test/node_modules/loopback-connector/lib/sql.js:661:9
at Query.<anonymous> (/Users/davidknapp/block-test/node_modules/loopback-connector-postgresql/lib/postgresql.js:146:7)
at Query.handleReadyForQuery (/Users/davidknapp/block-test/node_modules/pg/lib/query.js:128:10)
at Connection.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/client.js:247:19)
at Connection.emit (events.js:198:13)
at Socket.<anonymous> (/Users/davidknapp/block-test/node_modules/pg/lib/connection.js:128:12)
at Socket.emit (events.js:198:13)
blocked for 16ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
That's much better. I then added a process.nextTick on observer.js line 178:
SQLConnector.prototype.all data.map: 13.822ms
DataAccessObject.find allCb async.mapSeries: 0.363ms
DataAccessObject.find allCb: 0.507ms
blocked for 15ms
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
at process._tickCallback (internal/process/next_tick.js:61:11)
blocked for 15ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
Next, I went after the time in the data.map but converting it to async.map in SQLConnector.prototype.all:
SQLConnector.prototype.all async.map: 212.241ms
blocked for 213ms
at doNotify (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/observer.js:155:57)
at process._tickCallback (internal/process/next_tick.js:61:11)
DataAccessObject.find allCb async.mapSeries: 0.415ms
DataAccessObject.find allCb: 0.565ms
SQLConnector.prototype.all cb: 0.738ms
blocked for 19ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
The time got worse but it confirmed that it is the source of the first block. I changed it to async.mapSeries:
```SQLConnector.prototype.all async.mapSeries: 0.959ms
DataAccessObject.find allCb async.mapSeries: 0.334ms
DataAccessObject.find allCb: 0.520ms
SQLConnector.prototype.all cb: 0.702ms
blocked for 18ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
Now we are down to just one block. I guessed that the last one might be "above" DAO in the call stack. I added console.time around cb in allCb async.mapSeries:
SQLConnector.prototype.all async.mapSeries: 1.058ms
DataAccessObject.find allCb async.mapSeries: 0.331ms
DataAccessObject.find allCb: 0.514ms
SQLConnector.prototype.all cb: 0.653ms
DataAccessObject.find allCb async.mapSeries cb: 18.830ms
blocked for 19ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
I continued by adding console.time a couple more layers up:
SQLConnector.prototype.all async.mapSeries: 0.954ms
DataAccessObject.find allCb async.mapSeries: 0.293ms
DataAccessObject.find allCb: 0.461ms
SQLConnector.prototype.all cb: 0.627ms
HttpContext.prototype.invoke fn: 16.769ms
SharedMethod.prototype.invoke cb: 16.934ms
DataAccessObject.find allCb async.mapSeries cb: 17.427ms
blocked for 18ms
at buildResult (/Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1705:19)
at /Users/davidknapp/block-test/node_modules/loopback-datasource-juggler/lib/dao.js:1719:13
at process._tickCallback (internal/process/next_tick.js:63:19)
```
I ran out of time to follow those further up the call stack today.
So my summary is that when there are a large number of results (I was using 4000 rows here), there are parts of the package that are blocking the thread. Most of the changes were the simple addition of process.nextTick on callbacks that were made immediately. I am not sure if moving to async.mapSeries is the right answer. Maybe there is a better way to make those two parts not block the thread.
Let me know what you think. If needed, I can package up what I have so that someone else can reproduce it.
@bajtos @raymondfeng @strongloop/sq-lb-apex, could you please help with this? Thanks.
Whoa, thank you @dknapp47 for such detailed analysis!
So my summary is that when there are a large number of results (I was using 4000 rows here), there are parts of the package that are blocking the thread.
Based on my past experience, your summary looks like a correct conclusion to me.
Let me try to reproduce the issue on my machine and then investigate what options we have to fix this problem.
I wanted to reproduce the issue while running the application locally, using Docker for the PostgreSQL database server only. I ran docker-compose -f postgresql.yml up but the application was not able to connect with the current configuration.
So first of all, I updated server/datasources.json as follows:
{
"db": {
"name": "db",
"connector": "memory"
},
"postgres": {
"name": "postgres",
"connector": "postgresql",
"host": "127.0.0.1",
"port": "5432",
"user": "postgres",
"database": "postgres"
}
}
With this new config, I was able to start the app, post sample supplier data to create 4000 records (I executed the POST request four times).
I found the logs introduced by https://github.com/Apex-BPM/loopback-sandbox/commit/38c5b0526d547a6b79256b393550933dc5d562d2 as too verbose, therefore I commented out async_hook based logging and restored blocked-at based reporting.
With this change in place, I was able to make a query for 4k records in the API Explorer and observe a warning printed to the console (as expected).
Here is my current understanding of what's happening under the hood:
The code converting the result from pg to an HTTP response is fully synchronous right now, we perform all work in a single tick of the event loop. While this is probably a good thing when the number of records (array items) is low, it does not scale well when the size of data increases.
I verified the claim above by adding three console.log statements into allCb function:
async.map is calledasync.map (the callback invoked after all records are processed)async.map returns (within the same tick)I think this explains the changes in timing data observed after changing notifyObserversOf from sync to async and replacing map with mapSeries. Depending on the changes made, some parts of the workload was split into multiple ticks. But that comes with its own overhead that demonstrates elsewhere.
To further test my hypothesis, I tested the impact of response size.
This reinforces my opinion that we are not doing anything costly inside LoopBack, it's just that the more records we return, the more work we have to do, and therefore at some point we hit the threshold for how long a tick can take.
In my opinion, the right solution for this problem is to limit the number of records we are processing within a single tick and frequently yield the control back to the event loop. I.e. in our Supplier example, we can process 4000 rows in 4 batches of 1000 records, yielding control back every 1000 records processed.
Now the question is how to decide how often to yield? I don't think a single number of records will work for all use cases:
allCb may not be needed, or perhaps it's enough to yield less often.BTW if your response contains many rows, then JSON.stringify can become a bottleneck too! LoopBack depends on res.json() provided by Express. My colleagues in IBM have written a module called yieldable-json to address this problem, see their Statistics for timing data.
On the second though (and also because I was not able to find a module that will make it easy to yield control back every N iterations), this change in lib/dao.js seems to remove the blocked-at warning completely:
- async.map(data, function(item, next) {
+ async.mapLimit(data, 10, async.ensureAsync(function(item, next) {
By wrapping the iterating function in async.ensureAsync, we ensure it always defers the callback to the next tick. The implementation is clever enough to skip setImmediate if the callback was already called outside of the initial tick.
This change converts fully sync processing into fully async, where each iteration is started on the first tick and finishes on the second tick. As you have observed, this is significantly increased the time while the loop is blocked (from ~40ms to ~400ms). I am assuming this is the overhead added by parallelization.
By limiting the number of iterating function calls running in parallel, we spread the overhead over multiple ticks, keeping each tick within the threshold.
However, while this may remove the problem of blocked event loop, I am concerned that it may decrease the overall performance (time to serve the request).
Further research is required here.
Maybe we could use a sensible default and allow it to be overridden with something in options. That would let us set the option for queries that return large result sets for example:
Suppliers.find({where:{status:'Active'}},{resultChunkSize:100},callback);
For backward compatibility, we could set the resultChunkSize to 0 indicating that there is no chunking.
However, while this may remove the problem of blocked event loop, I am concerned that it may decrease the overall performance (time to serve the request).
I ran a quick benchmark using autocannon and the following command:
$ autocannon http://localhost:3000/api/Suppliers?filter=%7B%22limit%22%3A4000%7D
Using vanilla juggler code, I observed average latency 356.68 ms and 27.5 requests per second.
After I applied the patch, the performance has significantly degraded to average latency 8225.8 ms and about 1 request per second.
Maybe we could use a sensible default and allow it to be overridden with something in options. That would let us set the option for queries that return large result sets for example:
Suppliers.find({where:{status:'Active'}},{resultChunkSize:100},callback);
I agree that a new feature flag (a new option) may be helpful. I am afraid it's not enough to implement it in find options only, because these options cannot be provided by HTTP clients. I am thinking about exposing this configuration at model level.
For backward compatibility, we could set the resultChunkSize to 0 indicating that there is no chunking.
I see two views here:
Use a sensible non-zero default, because this is likely to improve performance for all LoopBack users. In (hopefully rare) cases where the performance degrades after the change, users can set the value to 0 where needed.
Prioritize backwards compatibility over everything else and don't process query results in chunks unless explicitly configured by the app developer.
@strongloop/loopback-3x-maintainers thoughts?
I managed to create a working prototype that processes query results in chunks of 500 rows, does not trigger blocked-at warnings and mostly preserves the performance:
blocked-at warning is triggered on the first run, I think this is caused by the fact that V8 hadn't had a chance to optimize the code yet.363.96 ms (2.0% worse than 356.68 ms)27 requests per second on average (1.8% worse than 27.5)The source code is here: https://github.com/strongloop/loopback-datasource-juggler/compare/3.x...feat/chunked-all-cb?expand=1&w=1 The changes are best reviewed with whitespace-changes ignored. The first commit is a refactoring, the second contains the relevant changes.
I'd like to discuss this topic with colleagues more experienced with performance fine-tuning, before going forward to open any pull requests.
@bajtos Can't we use a worker thread? Just a quick question
Can't we use a worker thread? Just a quick question
I am not familiar with worker threads enough to be able to tell how easy or difficult it would be to make LoopBack work on worker threads and what would be the overhead of passing the data between the main loop and the worker thread. I suspect it would require significant design changes inside the framework too.
Can't we use a worker thread? Just a quick question
I am not familiar with worker threads enough to be able to tell how easy or difficult it would be to make LoopBack work on worker threads and what would be the overhead of passing the data between the main loop and the worker thread. I suspect it would require significant design changes inside the framework too.
That I agree. But offloading work is the better choice while dealing with CPU intensive op ,right ? Batching improves the performance but that too has a limitation, I think. Just brainstorming. @bajtos
That I agree. But offloading work is the better choice while dealing with CPU intensive op ,right ? Batching improves the performance but that too has a limitation, I think. Just brainstorming.
Yes, offloading to a worker thread is a often a good choice. There is no free lunch though, you have to pass extra costs to pass the data between the main thread and the worker thread. So if you are processing only few records (e.g. 50), then the cost of worker synchronization is almost certainly going to outweigh benefits from higher concurrency.
Anyhow, I don't offloading to worker threads is something to implement inside LoopBack. IMO, this part belongs to the user application, i.e. the app can decide to run the entire find method in a worker thread.
To be honest, I think we should step back and ask what's the most efficient way how to process a larger query result? IMO, it's rather wasteful to wait until all rows are returned from the database and stored in an array, then convert that array to another array of model instances, then build a large JSON string that will be finally sent in chunks to the HTTP response. A much better approach is to leverage streams:
pg.Cursor.stream.Transform.JSONStream.stringify().In this light, I am not sure if the additional complexity of processing large query results in chunks (as shown in my PoC) is actually worth the limited benefits it brings.
Thoughts?
/cc @raymondfeng
@bajtos My thoughts mirror your own in the comment above.
To mimize cpu cost-per-record, it can be worthwhile to get a large set of query rows as a single result, operate on it synchronously, and JSON.serialize() it to the requestor.
But this is trading off cost-per-record with latency. For low latency/fast loops/minimal interaction between requests, it would be better to do stream records, or use sql cursors, or some other technical mechanism to handle a record-per-loop, or some batch of records per-loop.
The tradeoff depends on the app, and in this case, doesn't it depend on how much work the app does per record, because the per-record processing overhead is only partially loopback, its calling out to application code to handle each record?
That makes it even harder for loopback to choose how to tradeoff performance against time-per-loop.
Your suggestion pushes the choice to the user, enabling them to take the approach that is right for them by controlling the batch size that will be requested synchronously in each pg query.
Wrt. to workers, you can't make js calls between workers, just exchange data, but in this case, it looks like the "processing" is mostly calling out to user-provided code, https://github.com/strongloop/loopback-datasource-juggler/blob/d8ca3827ab5231ad6653dd5bac9a7805da0b3760/lib/dao.js#L1764
That can't happen from worker to main thread, so unless I misunderstand the nature of processQueryResults(), it doesn't look like a candidate for workerization, even if it was a good idea for LB to be making these tradeoffs internally.
Thank you @sam-github for chiming in.
We have discussed this topic offline and decided that we are not going to fix it in LoopBack 3 because it would require too much effort.
I opened a feature request for LoopBack 4 to implement "proper" support for handling large query results in a way that preserves good time-per-loop values, see https://github.com/strongloop/loopback-next/issues/4486
Thank you to everyone who worked on this and we look forward to seeing the large query handling in Loopback 4 :).
Most helpful comment
@bajtos My thoughts mirror your own in the comment above.
To mimize cpu cost-per-record, it can be worthwhile to get a large set of query rows as a single result, operate on it synchronously, and JSON.serialize() it to the requestor.
But this is trading off cost-per-record with latency. For low latency/fast loops/minimal interaction between requests, it would be better to do stream records, or use sql cursors, or some other technical mechanism to handle a record-per-loop, or some batch of records per-loop.
The tradeoff depends on the app, and in this case, doesn't it depend on how much work the app does per record, because the per-record processing overhead is only partially loopback, its calling out to application code to handle each record?
That makes it even harder for loopback to choose how to tradeoff performance against time-per-loop.
Your suggestion pushes the choice to the user, enabling them to take the approach that is right for them by controlling the batch size that will be requested synchronously in each pg query.
Wrt. to workers, you can't make js calls between workers, just exchange data, but in this case, it looks like the "processing" is mostly calling out to user-provided code, https://github.com/strongloop/loopback-datasource-juggler/blob/d8ca3827ab5231ad6653dd5bac9a7805da0b3760/lib/dao.js#L1764
That can't happen from worker to main thread, so unless I misunderstand the nature of
processQueryResults(), it doesn't look like a candidate for workerization, even if it was a good idea for LB to be making these tradeoffs internally.