Mongoose: Low performance of stream()

Created on 12 Dec 2012 · 6Comments · Source: Automattic/mongoose

I have test a sum with 10 millions of documents with the stream() mongoose and the stream() mongodb-native. the result is really disappointed. I post my result and my code right now.

Result:

mongoose.stream()
----------sum is: 49999995000000--------------
----------Time is: 325095--------------
mongodb-native:
----------sum is: 49999995000000--------------
----------Time is: 82294--------------

My code:

mongoose.stream()

var mongoose = require('mongoose');
var db = mongoose.createConnection('localhost', 'Yanxin');
var schema = mongoose.Schema({data: mongoose.Schema.Types.Mixed}, {_id: false});
var collection = 'fiverecord';
var model = db.model(collection, schema);
var sum = 0;
var timeBegin = new Date().getTime();
var stream = model.find({'_id.rdv': '/test/', 'data.number': {$exists: true}}, {'_id': 0}).stream();
stream.on('data', function(doc) {
    sum += doc.data.number;
}).on('err', function(err) {
    console.log('<<< err is: ' + err);
}).on('close', function() {
    console.log('----------sum is: ' + sum + '--------------');
    console.log('----------Time is: ' + (new Date().getTime() - timeBegin) + '--------------');
    db.close();
});

mongodb-native

var Db = require('mongodb').Db,
                assert = require('assert');
var db1 = new Db('DBname', new Server("127.0.0.1", 27017, {auto_reconnect: false, poolSize: 5}), {w:0, native_parser: false});
var timeBegin = new Date().getTime();
db1.open(function(err, db) {
    db.createCollection('CollectionNames', function(err, collection) {
        assert.equal(null, err);
        var stream = collection.find({'_id.rdv': '/test/', 'data.number': {$exists: true}}, {_id: 0, data: 1}).stream();
        var sum = 0;
        stream.on("data", function(item) {
            sum += item.data.number;
        });
        stream.on('error', function(err)
        {
            console.log(err);
        });
        stream.on("close", function() {
            console.log('----------sum is: ' + sum + '--------------');
            console.log('----------Time is: ' + (new Date().getTime() - timeBegin) + '--------------');
            db.close();
        });
    });
});

I don't really why it cause this big difference. Can anyone explain it?

Source

GGYaX

Most helpful comment

A few things, first, this is not surprising. Mongoose is an Object Document Mapper, wrapping each document returned from MongoDB in a custom object decorated with getters, setters, hooked methods, validation, etc, etc. This has a cost.

I ran my test on a collection with 4,485,326 documents.

With the default settings, on average mongoose ran 3.x+ slower than the raw driver.

The first thing to do to tweak performance is to adjust your batchSize option. With mongoose this is exposed through the query.batchSize(1000) method. On the driver you pass it as an option to collection.find(criteria, fields, { batchSize: 1000 }).

Here are my results running with batchSize set to 1000.

----------sum is: 10059072420475--------------
----------Time is: 101807--------------
running native driver test...

native
----------sum is: 10059072420475--------------
----------Time is: 29238--------------

The next thing to do is enable lean reads with mongoose. This bypasses the document mapper part of mongoose and returns the raw documents directly from the driver. This is enabled by calling query.lean(). The final mongoose code looks like:

var stream = A.find({'data.number': {$exists: true}}, {'_id': 0, data: 1}).lean().batchSize(1000).stream();

The results:

----------sum is: 10059072420475--------------
----------Time is: 21731--------------
running native driver test...

native
----------sum is: 10059072420475--------------
----------Time is: 25689--------------

That mongoose ran faster than the driver is a fluke, the driver cannot be faster than itself :) but you get the idea.

aheckmann on 12 Dec 2012

👍6 😄1

All 6 comments

I ran my test on a collection with 4,485,326 documents.

With the default settings, on average mongoose ran 3.x+ slower than the raw driver.

Here are my results running with batchSize set to 1000.

----------sum is: 10059072420475--------------
----------Time is: 101807--------------
running native driver test...

native
----------sum is: 10059072420475--------------
----------Time is: 29238--------------

var stream = A.find({'data.number': {$exists: true}}, {'_id': 0, data: 1}).lean().batchSize(1000).stream();

The results:

----------sum is: 10059072420475--------------
----------Time is: 21731--------------
running native driver test...

native
----------sum is: 10059072420475--------------
----------Time is: 25689--------------

That mongoose ran faster than the driver is a fluke, the driver cannot be faster than itself :) but you get the idea.

aheckmann on 12 Dec 2012

👍6 😄1

@aheckmann, I'm having a similar concern. I have written something very similar to your example with lean, batchSize and stream. I understand the batchSize doesn't affect the streamed data but instead the size of the getmore request. Nonetheless, it would feel natural that the stream data correspond to what has been sent from the server. Like in your example, you specify a batchSize of 1000, but the data event will just process one document at a time. Indeed, the 1000 records are already in memory hence why not just sending them in an array in one go?

My current workaround is to aggregate them back (!) and to pause the stream when it exceeds a limit (which is actually the size of the batchSize...) and then resumes the stream once my process is finished (it consists of sending through a socket the data to the client where they are further processed, hence I can't send them one by one). But this unnecessary work, to split (on mongoose) and aggregate (on my side) is for sure costly. Maybe there is already an option, which I have missed to do what I'd like?

0cv on 29 Oct 2015

No option to do that AFAIK. However, getting docs one at a time is a pretty fundamental part of how cursors work in MongoDB, so mongoose would have a hard time supporting that behavior because we'd have to reach into the driver's cursor abstraction.

vkarpov15 on 30 Oct 2015

So you mean that MongoDB provides to Mongoose 1 document at a time anyhow regardless of the batchSize? Does it work the same when we write a query without streaming (just promise)?

Still, it's mongoose, which is emitting potentially 10s of thousands of events, i.e. one for each document (or are these events coming directly from MongoDB?). This is subjective without the possibility to compare, but I really have a feeling there is some room for performance improvement here.

0cv on 30 Oct 2015

No, the driver internally loads batchSize docs from the database, the "one at a time" behavior is enforced by the driver. Even when you don't stream, mongoose repeatedly just does "next()" on the cursor to get all the docs. Without changing the mongodb driver's cursor API, the only improvement we could really do is have mongoose aggregate the docs for you rather than letting you do it, which wouldn't really help much with performance. You can always just bypass the driver entirely and handle calling getMore() yourself if you're really itching for the extra performance boost.

vkarpov15 on 30 Oct 2015

Thank you for the explanation Valeri, I may want to avoid reinventing the wheel myself :)

0cv on 30 Oct 2015

Was this page helpful?

0 / 5 - 0 ratings