Mongoose: Convert populate to use $lookup

Created on 17 Dec 2015  路  31Comments  路  Source: Automattic/mongoose

Before:

lib.db.model('Stock').find(stockQuery).populate('product') // ...

After:

            lib.db.model('Stock').collection.aggregate([
              { $match: stockQuery },
              {
                $lookup: {
                  from: 'products',
                  localField: 'product',
                  foreignField: '_id',
                  as: 'product'
                }
              },
              {
                $project: {
                  product: {
                    $arrayElemAt: [ '$product', 0 ]
                  }
                }
              },
              { $sort: sort },
              { $limit: req.query.limit },
              { $skip: req.offset }
            ], callback);

Can we normalize this functionality for population using aggregate and $lookup into a rewrite of the populate function (so it would still function the same, but it'd use $lookup instead?

discussion

Most helpful comment

Here's a handful of benchmarks with MongoDB 3.4 (v3.2 was similar). Numbers are the median of 15 replicates.

findOne

| query | 1x concurrency | 6x | 15x |
| --- | --- | --- | --- |
| primary query | 0.911 ms | 9.226 | 19.55
| primary query, then 2nd query for relationship (~.populate()) | 1.201 ms | 5.782* | 11.44
| $lookup | 0.598 ms | 1.763 | 4.575

*Note: not sure why this is faster than first row; might be cached at some layer, but it's consistent.

find returning array of 140 docs

| query | 1x concurrency
| --- | --- |
| primary query | 3.401 ms
| primary query, then 2nd query for relationship (~.populate()) | 6.872 ms
| $lookup | 8.705 ms

Code: https://gist.github.com/zbjornson/746a477ee715685fb60963d9e3c5d2d1. The foreignField was indexed, and I verified that the index was used. The collection had about 60k documents in it. I used the mongodb driver directly for the benchmark to show the best possible case.

Looks like it helps with latency for individual documents, but is detrimental for arrays (low-throughput). In the MongoDB profiler it looks like each $lookup goes out as a separate query, so that makes sense.

All 31 comments

By default $lookup returns an array of results, and therefore that is why for a single document DB Ref I've put in the $arrayElemAt operator.

Note that $lookup is only available in MongoDB 3.2

Its worth considering, but not for the forseeable future because I'd imagine the majority of mongoose users don't use mongodb 3.2. Also, the only rationale that I could think of for doing this would be "it improves performance considerably." While that might be the case, we'd have to run a benchmark to see.

+1 for using $lookup internally.

Also, how about implementing this conditionally? This way mongoose users who work with MongoDB 3.2+ could already benefit from the (suspected) performance improvements.

Possible, but it's not really a high priority atm. The difficult part is that $lookup only works with the aggregation framework, so .find().populate() needs to get converted into an aggregation query, which seems unwieldy.

+1 for $lookup.

Is there real benefits of using $lookup insteal of populate? that's the point, however mongodb3.2 it's still to new for the majority of users.

Performance in some cases - populate() requires 2 distinct queries to mongodb and thus 2 sequential round trips over the network, $lookup only takes 1. However, there's ways to structure your schema so populate-like operations can be done in parallel, so $lookup doesn't help as much as you might think.

Another thing to consider about $lookup within arrays

http://stackoverflow.com/questions/34967482/lookup-on-objectids-in-an-array

Ah yeah that would be tricky to implement - we'd need to be smart enough to build an aggregation pipeline that includes $unwind for every nested array.

+1

+1 on principle, but does anyone have any good benchmarking data?
Good news, though: it looks like the issue with $lookup on arrays will be resolved as of mongo 3.3.4: https://jira.mongodb.org/browse/SERVER-22881

+1

populate works with multiple database. How would $lookup work then?

@sidgate that's a great point, it actually would not work at all and that's a pretty core feature for mongoose. I'm gonna close this issue because neither mongodb 3.2 nor 3.4 are going to support cross-db $lookup...

For anyone interested in an ES7 version of this, here you go!

// get paginated result of job postings using $lookup
  // <https://github.com/Automattic/mongoose/issues/3683>
  let [ positions, itemCount ] = await Promise.all([
    Positions.aggregate([
      { $match },
      {
        $lookup: {
          from: 'users',
          localField: 'user',
          foreignField: '_id',
          as: 'user'
        }
      },
      {
        $project: {
          user: {
            $arrayElemAt: [ '$user', 0 ]
          }
        }
      },
      { $sort },
      { $limit: ctx.query.limit },
      { $skip: ctx.paginate.skip  }
    ]),
    Positions.count($match).exec()
  ]);

Per https://github.com/crocodilejs/crocodile-node-mvc-framework

I'm also using koa-ctx-paginate with this at https://github.com/koajs/ctx-paginate

Good stuff @niftylettuce but async/await is not ES6 :p

Typo 馃槅 馃殌

Hopefully it'll make it into ES2017 :rocket:

This doesn't change the complications of implementing this., but just a comment on how $lookup could "improve performance considerably":

Using populate on a cursor tends to be unusably slow due to the amount of single queries being performed.

One way around this could be to build a batched cursor, populate per batch, and then unpack each document in that batch and provide it as a readable stream.

That could work. Unfortunately, $lookup still executes a separate query for each individual doc, so you're saving a round trip by using $lookup but using approximately the same db processing time and hogging the connection. I like the cursor batch idea though :+1:

@vkarpov15 What is the state of allow population by lookup. Can we use it whenever possible and revert to normal population whenever necessary.

@lykmapipo not planning on doing this currently because $lookup doesn't support lookups between different databases, so this would break the populate API. Yeah we could fall back to doing conventional populate in cross-db cases but that would add a lot of code and be more work than it's really worth IMO.

Can we revisit this? Why do we need to do cross database lookups?

Seems like closing this issue due to cross-db issues is a cheap excuse, the feature would be excellent.

Per my chat with @vkarpov15 on Slack, can someone in the community create a performance test with these requirements to compare $lookup vs populate?

6 concurrent $lookups with 2 fast findOne() queries VS. 6 populates with 2 fast findOne() queries

Here's a handful of benchmarks with MongoDB 3.4 (v3.2 was similar). Numbers are the median of 15 replicates.

findOne

| query | 1x concurrency | 6x | 15x |
| --- | --- | --- | --- |
| primary query | 0.911 ms | 9.226 | 19.55
| primary query, then 2nd query for relationship (~.populate()) | 1.201 ms | 5.782* | 11.44
| $lookup | 0.598 ms | 1.763 | 4.575

*Note: not sure why this is faster than first row; might be cached at some layer, but it's consistent.

find returning array of 140 docs

| query | 1x concurrency
| --- | --- |
| primary query | 3.401 ms
| primary query, then 2nd query for relationship (~.populate()) | 6.872 ms
| $lookup | 8.705 ms

Code: https://gist.github.com/zbjornson/746a477ee715685fb60963d9e3c5d2d1. The foreignField was indexed, and I verified that the index was used. The collection had about 60k documents in it. I used the mongodb driver directly for the benchmark to show the best possible case.

Looks like it helps with latency for individual documents, but is detrimental for arrays (low-throughput). In the MongoDB profiler it looks like each $lookup goes out as a separate query, so that makes sense.

There are things that I found that works with populate but not with lookup. I think a model that looks like the following doesn't work well with populate (i can't find the $lookup i wrote for it, though).

Schema({
nestedObject: {
item1: { name: 'tuna' },
list: [{ type: ObjectId, ref: 'AnotherModel' }]
}
});

I will try to find a more concrete example, but that was my recollection of my attempts at replacing a nested populate.

@tunatoksoz yeah if you want to use $lookup with an array you need to add a $unwind.

@zbjornson yeah that doesn't surprise me too much. $lookup does end up as a separate query for each doc because of how aggregation works, so the performance impact isn't quite as cut-and-dried. I'd imagine the concurrency impact is even bigger on .find()...

Seems like closing this issue due to cross-db issues is a cheap excuse, the feature would be excellent.

Just wanted to weigh in that the cross-db feature has been a time saver for us, and I think it's one of those things that really makes mongoose an ODM, an extra layer of complexity with additional features from the base driver.

For future reference, here's an example of where $lookup gets super slow:

  const db = await mongodb.MongoClient
    .connect('mongodb://localhost:27017/test', {
      useNewUrlParser: true,
      poolSize: 1 // Only 1 operation can run at a time
    })
    .then(client => client.db())

  await db.dropDatabase()

  for (let i = 0; i < 5000; ++i) {
    await db.collection('Foo').insertOne({ _id: i })
  }
  console.log('Inserted foo docs')

  for (let i = 0; i < 5000; ++i) {
    await db.collection('Bar').insertOne({ _id: i, fooId: i })
  }
  console.log('Inserted bar docs')

  // This aggregation takes 150ms on my laptop, That's because this query executes one full
  // collection scan of 'Bar' for every document in 'Foo', or 20000 collection scans
  const promise = db.collection('Foo').aggregate([
    { $lookup: { from: 'Bar', localField: '_id', foreignField: 'fooId', as: 'bars' } }
  ]).toArray()

Versus what populate does:

  let startTime = Date.now()
  const foos = await db.collection('Foo').find().toArray()
  // "Found foos in 28ms"
  console.log(`Found foos in ${Date.now() - startTime}ms`)

  startTime = Date.now()
  const bars = await db.collection('Bar').find({ fooId: { $in: foos.map(f => f._id) } }).toArray()
  // "Found bars in 15ms"
  console.log(`Found bars in ${Date.now() - startTime}ms`)
  const barMap = bars.reduce((map, b) => map.set(b.fooId.toString(), b), new Map())

  for (const foo of foos) {
    foo.bar = barMap.get(foo._id.toString())
  }

$lookup compounds index misses, populate() minimizes their impact because it only executes one query for each populate(), regardless of the number of docs being populated.

Was this page helpful?
0 / 5 - 0 ratings