Litedb: Improve performance by only populating properties requested in select statement

Created on 9 Mar 2018 · 16Comments · Source: mbdavid/LiteDB

Hi,

I've noticed that when doing queries on very large data sets (im querying and retrieving 60000 docs) it can take quite a while for the data to be retrieved.

I tried to address this by only selecting out the particular properties I am interested in of those documents, but when i checked the document that was returned in the debugger i could see all the properties of the document were populated even though I was not using them in the select.

I can only assume that litedb would be much faster if these unused properties were not deserialized
If I understand the bson format correctly it should be possible to deserialize only the needed properties.
Would it be possible to restrict the parts of the document to deserialize from the db?

Because the deserialized object properties could use any of the underlying object, it is not always possible to determine what data needs to be deserialized automatically, but if the expression lambda was supported instead, any complex property calls would not be allowed so you could determine it automatically.

The other option would be a optionall "property deserilize" white list option on the find query or something.

Is this sort of thing possible? Or already available?

I think it would radically improve litedb's performance

Source

anthonypcole

Most helpful comment

Hi @toneo, just to know, I implement this partial deserialize document in v5. It was fantastic... works as I exceptect. Now, if you query document using an expression to select (or transform) result, LiteDB will read only data needed to deserialize this fields (only document root fields are supported). So, an query like this:

```C#
var q = db.Query("customer")
.Where("_id = @0", 100)
.Select("{ _id, name: UPPER(name), age: YEAR(DATE()) - YEAR(birthday) }")
.ToList()

Will read and deserialize only `_id`, `_name` and `birthday` fields. If you `customer` document contains 100 fields (including subdocument), only this three fields will be loaded. For next step, will be possible same selection on `Include` DbRef document, like:

```C#
var q = db.Query("order")
    .Include("customer", "_id", "name")
    .Where("customer.name = @0", "John")
    .ToList()

From first version that I made, I always think this change will never be possible because are too complex to implement and there is no need for that. But looking big document, that make all change and performance are improve a lot. Again, thanks for contributing....

mbdavid on 22 Mar 2018

👍2

All 16 comments

Hi @toneo, in BSON format, it´s not possible read/populate partial document because when you deserialize you have no information about where each field are located (different from relational-tabular database).

If you need query/retrieve 60.000 documents and each document have too many fields, you can create another collection with a subset fields of you want.

In new v5 version will be possible create an index with fields only you want and query index only (current version has no support to index only).

mbdavid on 10 Mar 2018

Hi David,

thanks for the response, and also i wanna mention what a great lib you’ve made!

I will try making a sub document just as you mentioned. that should work nicely.

however, i just read the bson format supports fast traversability. so you can go directly to the part you want, basically because of the tree map thing they have.

couldn’t given a list of paths/properties a custom deserializer go to only the paths specified?

if this would work, you would have to build a much more complex deserializer of course. from looking at the bson code in dot net it looks like it might be possible

anthonypcole on 10 Mar 2018

Hi @toneo, yes, you are right. It´s possible read selected fields only on deserialization on bson. I always think about bson document as a key-value with "simple" value only. In this case, there is no difference between read value (like int) or skip. But, make all sense if you are reading another sub document (or array) or skip all content.

But, to implement this I will need change how data are get from datafile. Today, when you read a binary data (before deserialization) all data are loaded into an byte[]. Using this will make no sense if you have large document (allocation many exatend pages) and load all into memory to deserialize just few fields. This read must be changed to return some IEnumerable or Stream data.

Thanks for this, make me think about this and how implement. I think it's pretty possible implement. I will start this in my v5 branch.

mbdavid on 10 Mar 2018

```C#
var q = db.Query("customer")
.Where("_id = @0", 100)
.Select("{ _id, name: UPPER(name), age: YEAR(DATE()) - YEAR(birthday) }")
.ToList()

Will read and deserialize only `_id`, `_name` and `birthday` fields. If you `customer` document contains 100 fields (including subdocument), only this three fields will be loaded. For next step, will be possible same selection on `Include` DbRef document, like:

```C#
var q = db.Query("order")
    .Include("customer", "_id", "name")
    .Where("customer.name = @0", "John")
    .ToList()

mbdavid on 22 Mar 2018

👍2

Hi David,

Thats awesome!

I will try it out.. Yeah I think this will be a great improvement.
As far as I can tell, your project is the only one that is a local, no setup required noSql database. Adding these sorts of improvements really make it a good solution for all sorts of large projects.

I'm actually using litedb as a client side cache for a cloud mongo db data source. None of the other solutions really gave me the control I needed.

Thanks so much for the great project!

anthonypcole on 31 Mar 2018

Hi @toneo, thanks a lot for this! Write LiteDB is best when I read comments like this. Thanks!

mbdavid on 1 Apr 2018

@mbdavid
Amazing to use with GraphQL. Just a simple FieldAst to Query translator and you would not only request less data, no, there is also less data to process.. sweet as!

Cannot wait till v5 :D

PascalSenn on 20 Apr 2018

👍1

@PascalSenn, yes, using partial BSON deserialization will be improved performance on large documents. In v5, there is no more "Query" class (ok, will exists just keep compatible) but everything now is BsonExpression. Will be much more easy to implement this translator.

mbdavid on 22 Apr 2018

Glad you say this. I've actually written a very basic translator of my GraphQL filter wrapper to the "Query" classes. I was going to do some more work on it in the future but know I'll wait. 😄

PascalSenn on 23 Apr 2018

Hello,
firts of all thank you for your great library!!
Me too, I'm really interested to this partial deserialize HUGE improvement. When will v5 (also preview) be available on nuget?
Thank you!

Gerardo-Sista on 27 Mar 2019

Hi @Gerardo-Sista, I'm backing to write v5 version after 4 months away. New v5 is a huge step from v4 and I'm re-writing almost everything be improve performance and use less resources...
I have no release date yet, but I will need, at least 3 months to review and write unit tests... so, I think is more possible to get in Q3

mbdavid on 27 Mar 2019

Thank you.
Can't wait till v5 is ready! :)

Gerardo-Sista on 31 Mar 2019

@mbdavid As v5 is already released, is this feature implemented?

i try to do something like col.Query().Select("{ _id }").ToDocuments().Select(d => d["_id"].AsString).ToArray() to only get all ids stored. Will that be an optimized index query?

viceice on 7 Sep 2020

Hi @viceice, yes, this query will get only _id field inside index.. there is no document deserialization. You can check EXPLAIN using LiteDB Studio

EXPLAIN SELECT _id FROM col

And you will see that lookup document is based on index only

mbdavid on 9 Sep 2020

👍1

@mbdavid Any hint how to execute such a sql like string on database / collection?

viceice on 10 Sep 2020

@viceice You can use db.Execute("EXPLAIN SELECT _id FROM col").ToList().