Node-sqlite3: Make #stmt.each async, and use stream/pipe

Created on 10 Aug 2016 · 16Comments · Source: mapbox/node-sqlite3

I want to handle 10,000,000 rows.
Each row will take 1 second to complete.

I have two question:

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?
What do you think, about implementation an async version for callback. When there are some async tasks, it needs to wait until the task if finished:

      stmt.each('select * from very_big_table',function(err,rows,next){
          setTimeout(function doAsyncTask(){
                  console.log('When the task will complete (2 seconds), I want to move to the next records?
          },2000)
      })

by the way, Congratulations::
I have permissions, and created new node-sqlite3 tag in StackOverflow:

http://stackoverflow.com/questions/tagged/node-sqlite3

Source

Aminadav

👍2

Most helpful comment

Sounds great I would like it to be part of official package functionality. Maybe instead of separate package you can provide a PR that author can accept. Why streaming should not be part of sqlite package?

pleerock on 27 Jun 2017

👍2

All 16 comments

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

.each will give you rows one-by-one. It loads each row as it calls each callback.

Happy to accept a PR implementing a next() callback.

tmcw on 3 Oct 2016

How about if when .each calls the function, if that function returns a thenable object, it will wait for the thenable (Promise) to resolve before calling each again?

db.each('...', function(err, item) {
  return new Promise((resolve, reject){
    ... do async stuff and resolve at the end...
  });
})

would also work as an async function/fatarrow handler.

db.each('...', async (err, item) => {
  if (err) throw err;
}).then(()=>console.log('success'),(err)=>console.error(err));

This assumes each returns a Promise, and handles waiting in each iterable.

tracker1 on 7 Oct 2016

Fine with either as a PR, the difference between expressing as async, promises, and callbacks here seems slim. node-sqlite3 currently supports v0.10 (which doesn't support promises), but we're dropping compatibility at the end of the month.

tmcw on 7 Oct 2016

Is anyone working on a PR for this? otherwise I might have a go at it.

It has not been mentioned here, so I just wanted to point out that you can create a wrapper using Database.prepare, and Statement.get to create your own each function that supports promises/done callback (I have used a similar strategy for creating a stream) but the performance of using Statement.get seems to be about five times slower than Database.each (see https://gist.github.com/erikman/a494925ae6ce95869dd56076fb810831 for my tests), so hopefully an async each will be much faster. :)

erikman on 31 Oct 2016

I couldn't stop myself and had a look through the sources yesterday, and it seems more complicated to implement than what I first anticipated. Here are my findings, please correct any mistakes I've made in my reasoning (I have not looked at libuv before either, but I do know my way around general threading issues):

The each function works by starting a worker thread that iterates through the results and posts the results back to the main thread. When an each is executing no other sql-statements will run they will just be scheduled to run later. If we make each asynchronous, then we need to be able to stop the worker and resume it later, otherwise one can create a deadlock by making a callback that waits for another sql-statement.

However breaking the each worker into multiple worker segments may introduce other hazards, because then code like

stmt.each(....).finalize()

will have problems to properly schedule all the each callbacks before the finalize call (I guess there are ways around it but it feels error prone).

So I don't know what your feelings are, but to remain backwards compatible I don't think modifying each is the best way forward (it will be quite different behaviors depending on if there is an async callback or not).

I created a new getMultiple(maxResults, cb) that works like a mix between get and all to share the synchronization cost over multiple rows here: https://github.com/erikman/node-sqlite3/tree/get-multiple and using this I can create a stream that is closer to the performance of each (bit not quite there, 120ms vs 150ms for iterating over 50000 elements from a memory backend). I'll put up the code as an example later, it needs some refactoring as I have some Promise wrappers that are not really needed.

What are your thoughts on this idea?

erikman on 1 Nov 2016

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

From my own testing, stmt.each loads all results into memory. As soon as I step over stmt.each, my memory spikes up to 2.2 GB (and then throws a fatal process out of memory error). So while the API allows you to "step" through the rows, AFAICT the data is loaded at once.

ghost on 17 Dec 2016

👍1

@paolo-tanium It depends on how fast your javascript handler is and the time allocation between the threads. each will kick off a worker thread that loads rows as fast as possible from the db storage, and then sends back the results as (batched, asynchronous) messages to the javascript thread where the handler is running.

So if you handler is fast enough (faster than loading rows from db) then you shouldn't see a memory spike. On the other hand if the handler is slow then the rows will wait in memory until the handler can process them (and we see a memory spike).

At least in theory, in practice the OS might decide to give lots of time to the loader thread before the handler is invoked and then everything will be loaded into memory even with a fast handler. So you are right that when you have a huge number of rows then you can't use each for processing them without risking that all rows will get loaded into memory before they are processed, specially if you have a memory backend.

erikman on 18 Dec 2016

👍2

@erikman was there any further progress on getMultiple? I'm experimenting with your fork which seems to work.

mhat on 3 Mar 2017

@mhat I have not received any comments on it so I don't really know what to do with it. It works fine for me and I get a performance boost compared to using get. I have used it to create both a stream and an each function that works with promises.

Performance is somewhat slower than each, my assumption is that some time is lost when getMultiple waits for the initial array to be filled before returning them (I have done multiple tests with kicking off the next set of getMultiple immediately to minimize the latency, and also varying the size of the fetched blocks).

Thus I don't know if getMultiple is the best name, and also if (and where) I should add my JavaScript eachAsync implementation (or stream).

erikman on 15 Mar 2017

👍1

are you guys talking about streaming support here? I can't believe this package do not support streaming

pleerock on 21 Jun 2017

@pleerock yes, tangentally :) You can already make your own streams using the Statement#get method, the #getMultiple method discussed above improves performance. I just published my repo for working with sqlite: https://github.com/erikman/sqlutil which contains a sqlite-stream (and some other stuff you probably aren't interested in ;). It uses #getMultiple if available and otherwise falls back to ordinary #get.

erikman on 26 Jun 2017

pleerock on 27 Jun 2017

👍2

I agree that it is functionality that should be provided by the official package. When I started the conversation in this thread I was looking for architectural input from the maintainers and have not received any yet, so I won't put time into making a PR at this time.

erikman on 30 Jun 2017

I would like a async iterator.

for await (let row of db.each(query)) {
  // code
}
// done

you can be in control of requesting more data when needed if it would return a async generator

class x {
  async * [Symbol.asyncIterator]() {
    while (have_more) {
      yield await get_next
    }
  }

  async [Symbol.iterator]() {
    return this.values()
  }

  async * values() {
    return result_array.values()
  }
}

jimmywarting on 8 Jun 2019

You can make a wrapper with a generator that uses db.each to do this… Note
that node-sqlite can't stop the .each call.

I use a different approach: I use a wrapper that does a LIMIT 100 .all()
and then processes those results and then it repeats until it processed all
data. It is not atomic but I don't need it to be in my case.
https://github.com/StratoKit/strato-db/blob/28c9ef581074e249e050840fa1dc5960be7c88ed/src/JsonModel/JsonModel.js#L644-L650

wmertens on 8 Jun 2019

node-sqlite3 is just a nodejs binding for sqlite. Even the sqlite program itself doesn't offer an interface for streaming SELECT. A binding should have the same interfaces as the original program.

It's a fact that streaming SELECT is important and common-used, but it should be implemented in an upper-level library which uses the binding as a dependency, rather than implemented by binding itself.