I want to handle 10,000,000 rows.
Each row will take 1 second to complete.
I have two question:
node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?callback. When there are some async tasks, it needs to wait until the task if finished: stmt.each('select * from very_big_table',function(err,rows,next){
setTimeout(function doAsyncTask(){
console.log('When the task will complete (2 seconds), I want to move to the next records?
},2000)
})
by the way, Congratulations::
I have permissions, and created new node-sqlite3 tag in StackOverflow:
Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?
.each will give you rows one-by-one. It loads each row as it calls each callback.
Happy to accept a PR implementing a next() callback.
How about if when .each calls the function, if that function returns a thenable object, it will wait for the thenable (Promise) to resolve before calling each again?
db.each('...', function(err, item) {
return new Promise((resolve, reject){
... do async stuff and resolve at the end...
});
})
would also work as an async function/fatarrow handler.
db.each('...', async (err, item) => {
if (err) throw err;
}).then(()=>console.log('success'),(err)=>console.error(err));
This assumes each returns a Promise, and handles waiting in each iterable.
Fine with either as a PR, the difference between expressing as async, promises, and callbacks here seems slim. node-sqlite3 currently supports v0.10 (which doesn't support promises), but we're dropping compatibility at the end of the month.
Is anyone working on a PR for this? otherwise I might have a go at it.
It has not been mentioned here, so I just wanted to point out that you can create a wrapper using Database.prepare, and Statement.get to create your own each function that supports promises/done callback (I have used a similar strategy for creating a stream) but the performance of using Statement.get seems to be about five times slower than Database.each (see https://gist.github.com/erikman/a494925ae6ce95869dd56076fb810831 for my tests), so hopefully an async each will be much faster. :)
I couldn't stop myself and had a look through the sources yesterday, and it seems more complicated to implement than what I first anticipated. Here are my findings, please correct any mistakes I've made in my reasoning (I have not looked at libuv before either, but I do know my way around general threading issues):
The each function works by starting a worker thread that iterates through the results and posts the results back to the main thread. When an each is executing no other sql-statements will run they will just be scheduled to run later. If we make each asynchronous, then we need to be able to stop the worker and resume it later, otherwise one can create a deadlock by making a callback that waits for another sql-statement.
However breaking the each worker into multiple worker segments may introduce other hazards, because then code like
stmt.each(....).finalize()
will have problems to properly schedule all the each callbacks before the finalize call (I guess there are ways around it but it feels error prone).
So I don't know what your feelings are, but to remain backwards compatible I don't think modifying each is the best way forward (it will be quite different behaviors depending on if there is an async callback or not).
I created a new getMultiple(maxResults, cb) that works like a mix between get and all to share the synchronization cost over multiple rows here: https://github.com/erikman/node-sqlite3/tree/get-multiple and using this I can create a stream that is closer to the performance of each (bit not quite there, 120ms vs 150ms for iterating over 50000 elements from a memory backend). I'll put up the code as an example later, it needs some refactoring as I have some Promise wrappers that are not really needed.
What are your thoughts on this idea?
Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?
From my own testing, stmt.each loads all results into memory. As soon as I step over stmt.each, my memory spikes up to 2.2 GB (and then throws a fatal process out of memory error). So while the API allows you to "step" through the rows, AFAICT the data is loaded at once.
@paolo-tanium It depends on how fast your javascript handler is and the time allocation between the threads. each will kick off a worker thread that loads rows as fast as possible from the db storage, and then sends back the results as (batched, asynchronous) messages to the javascript thread where the handler is running.
So if you handler is fast enough (faster than loading rows from db) then you shouldn't see a memory spike. On the other hand if the handler is slow then the rows will wait in memory until the handler can process them (and we see a memory spike).
At least in theory, in practice the OS might decide to give lots of time to the loader thread before the handler is invoked and then everything will be loaded into memory even with a fast handler. So you are right that when you have a huge number of rows then you can't use each for processing them without risking that all rows will get loaded into memory before they are processed, specially if you have a memory backend.
@erikman was there any further progress on getMultiple? I'm experimenting with your fork which seems to work.
@mhat I have not received any comments on it so I don't really know what to do with it. It works fine for me and I get a performance boost compared to using get. I have used it to create both a stream and an each function that works with promises.
Performance is somewhat slower than each, my assumption is that some time is lost when getMultiple waits for the initial array to be filled before returning them (I have done multiple tests with kicking off the next set of getMultiple immediately to minimize the latency, and also varying the size of the fetched blocks).
Thus I don't know if getMultiple is the best name, and also if (and where) I should add my JavaScript eachAsync implementation (or stream).
are you guys talking about streaming support here? I can't believe this package do not support streaming
@pleerock yes, tangentally :) You can already make your own streams using the Statement#get method, the #getMultiple method discussed above improves performance. I just published my repo for working with sqlite: https://github.com/erikman/sqlutil which contains a sqlite-stream (and some other stuff you probably aren't interested in ;). It uses #getMultiple if available and otherwise falls back to ordinary #get.
Sounds great I would like it to be part of official package functionality. Maybe instead of separate package you can provide a PR that author can accept. Why streaming should not be part of sqlite package?
I agree that it is functionality that should be provided by the official package. When I started the conversation in this thread I was looking for architectural input from the maintainers and have not received any yet, so I won't put time into making a PR at this time.
I would like a async iterator.
for await (let row of db.each(query)) {
// code
}
// done
you can be in control of requesting more data when needed if it would return a async generator
class x {
async * [Symbol.asyncIterator]() {
while (have_more) {
yield await get_next
}
}
async [Symbol.iterator]() {
return this.values()
}
async * values() {
return result_array.values()
}
}
You can make a wrapper with a generator that uses db.each to do this… Note
that node-sqlite can't stop the .each call.
I use a different approach: I use a wrapper that does a LIMIT 100 .all()
and then processes those results and then it repeats until it processed all
data. It is not atomic but I don't need it to be in my case.
https://github.com/StratoKit/strato-db/blob/28c9ef581074e249e050840fa1dc5960be7c88ed/src/JsonModel/JsonModel.js#L644-L650
>
node-sqlite3 is just a nodejs binding for sqlite. Even the sqlite program itself doesn't offer an interface for streaming SELECT. A binding should have the same interfaces as the original program.
It's a fact that streaming SELECT is important and common-used, but it should be implemented in an upper-level library which uses the binding as a dependency, rather than implemented by binding itself.
Most helpful comment
Sounds great I would like it to be part of official package functionality. Maybe instead of separate package you can provide a PR that author can accept. Why streaming should not be part of sqlite package?