I have test a sum with 10 millions of documents with the stream() mongoose and the stream() mongodb-native. the result is really disappointed. I post my result and my code right now.
Result:
My code:
var mongoose = require('mongoose');
var db = mongoose.createConnection('localhost', 'Yanxin');
var schema = mongoose.Schema({data: mongoose.Schema.Types.Mixed}, {_id: false});
var collection = 'fiverecord';
var model = db.model(collection, schema);
var sum = 0;
var timeBegin = new Date().getTime();
var stream = model.find({'_id.rdv': '/test/', 'data.number': {$exists: true}}, {'_id': 0}).stream();
stream.on('data', function(doc) {
sum += doc.data.number;
}).on('err', function(err) {
console.log('<<< err is: ' + err);
}).on('close', function() {
console.log('----------sum is: ' + sum + '--------------');
console.log('----------Time is: ' + (new Date().getTime() - timeBegin) + '--------------');
db.close();
});
var Db = require('mongodb').Db,
assert = require('assert');
var db1 = new Db('DBname', new Server("127.0.0.1", 27017, {auto_reconnect: false, poolSize: 5}), {w:0, native_parser: false});
var timeBegin = new Date().getTime();
db1.open(function(err, db) {
db.createCollection('CollectionNames', function(err, collection) {
assert.equal(null, err);
var stream = collection.find({'_id.rdv': '/test/', 'data.number': {$exists: true}}, {_id: 0, data: 1}).stream();
var sum = 0;
stream.on("data", function(item) {
sum += item.data.number;
});
stream.on('error', function(err)
{
console.log(err);
});
stream.on("close", function() {
console.log('----------sum is: ' + sum + '--------------');
console.log('----------Time is: ' + (new Date().getTime() - timeBegin) + '--------------');
db.close();
});
});
});
I don't really why it cause this big difference. Can anyone explain it?
A few things, first, this is not surprising. Mongoose is an Object Document Mapper, wrapping each document returned from MongoDB in a custom object decorated with getters, setters, hooked methods, validation, etc, etc. This has a cost.
I ran my test on a collection with 4,485,326 documents.
With the default settings, on average mongoose ran 3.x+ slower than the raw driver.
The first thing to do to tweak performance is to adjust your batchSize option. With mongoose this is exposed through the query.batchSize(1000)
method. On the driver you pass it as an option to collection.find(criteria, fields, { batchSize: 1000 })
.
Here are my results running with batchSize set to 1000.
----------sum is: 10059072420475--------------
----------Time is: 101807--------------
running native driver test...
native
----------sum is: 10059072420475--------------
----------Time is: 29238--------------
The next thing to do is enable lean
reads with mongoose. This bypasses the document mapper part of mongoose and returns the raw documents directly from the driver. This is enabled by calling query.lean()
. The final mongoose code looks like:
var stream = A.find({'data.number': {$exists: true}}, {'_id': 0, data: 1}).lean().batchSize(1000).stream();
The results:
----------sum is: 10059072420475--------------
----------Time is: 21731--------------
running native driver test...
native
----------sum is: 10059072420475--------------
----------Time is: 25689--------------
That mongoose ran faster than the driver is a fluke, the driver cannot be faster than itself :) but you get the idea.
@aheckmann, I'm having a similar concern. I have written something very similar to your example with lean, batchSize and stream. I understand the batchSize doesn't affect the streamed data but instead the size of the getmore request. Nonetheless, it would feel natural that the stream data correspond to what has been sent from the server. Like in your example, you specify a batchSize of 1000, but the data
event will just process one document at a time. Indeed, the 1000 records are already in memory hence why not just sending them in an array in one go?
My current workaround is to aggregate them back (!) and to pause the stream when it exceeds a limit (which is actually the size of the batchSize...) and then resumes the stream once my process is finished (it consists of sending through a socket the data to the client where they are further processed, hence I can't send them one by one). But this unnecessary work, to split (on mongoose) and aggregate (on my side) is for sure costly. Maybe there is already an option, which I have missed to do what I'd like?
No option to do that AFAIK. However, getting docs one at a time is a pretty fundamental part of how cursors work in MongoDB, so mongoose would have a hard time supporting that behavior because we'd have to reach into the driver's cursor abstraction.
So you mean that MongoDB provides to Mongoose 1 document at a time anyhow regardless of the batchSize? Does it work the same when we write a query without streaming (just promise)?
Still, it's mongoose, which is emitting potentially 10s of thousands of events, i.e. one for each document (or are these events coming directly from MongoDB?). This is subjective without the possibility to compare, but I really have a feeling there is some room for performance improvement here.
No, the driver internally loads batchSize
docs from the database, the "one at a time" behavior is enforced by the driver. Even when you don't stream, mongoose repeatedly just does "next()" on the cursor to get all the docs. Without changing the mongodb driver's cursor API, the only improvement we could really do is have mongoose aggregate the docs for you rather than letting you do it, which wouldn't really help much with performance. You can always just bypass the driver entirely and handle calling getMore()
yourself if you're really itching for the extra performance boost.
Thank you for the explanation Valeri, I may want to avoid reinventing the wheel myself :)
Most helpful comment
A few things, first, this is not surprising. Mongoose is an Object Document Mapper, wrapping each document returned from MongoDB in a custom object decorated with getters, setters, hooked methods, validation, etc, etc. This has a cost.
I ran my test on a collection with 4,485,326 documents.
With the default settings, on average mongoose ran 3.x+ slower than the raw driver.
The first thing to do to tweak performance is to adjust your batchSize option. With mongoose this is exposed through the
query.batchSize(1000)
method. On the driver you pass it as an option tocollection.find(criteria, fields, { batchSize: 1000 })
.Here are my results running with batchSize set to 1000.
The next thing to do is enable
lean
reads with mongoose. This bypasses the document mapper part of mongoose and returns the raw documents directly from the driver. This is enabled by callingquery.lean()
. The final mongoose code looks like:The results:
That mongoose ran faster than the driver is a fluke, the driver cannot be faster than itself :) but you get the idea.