I'm using LiteCollection's Find method to make some complex queries:
IEnumerable
Using this particular method works well since it allows me to dynamically create complex expressions and pass them as a parameter, but I'm not able to get this working with pagination at all. If I call:
myCol.Find(predicate, skip, limit).OrderBy(x => x.name)
It seems liteDB is first returning the collection already trimmed and only then apply the sorting to it which obviously causes incorrect items to be returned.
For instance if I have:
{id: 1; name : "AA"}
{id: 2; name : "AC"}
{id: 3; name : "AB"}
{id: 4; name : "AD"}
And I call Find(predicate, 0, 2) (assuming predicate will not filter anything in this case) I will get:
{id: 1; name : "AA"}
{id: 2; name : "AC"}
Instead of:
{id: 1; name : "AA"}
{id: 3; name : "AB"}
Is there any way to get the correct behavior? For this to work the sorting should be applied before the engine returns the data. Otherwise using OrderBy doesn't seem to have any real use to me unless you don't use skip/limit at all.
Thanks
Hi @cardosso, LiteDB has no built-in order by implementation, so you need use Linq-to-Object order by extensions from IEnumerable (as you did). But, in this case, you already use Skip/Limit from internal query, so your results will be not as you want.
In this case, do not use skip/limit parameter and use Take/Skip extension methods after your OrderBy method.
Thanks for the reply, but using Take/Skip will mean all results will be returned by the engine, marshalled into objects and only then the sort is applied am I correct?
For very large datasets this will have a huge performance (and memory) impact. If the user is doing text search and only types 2 letters this will return easily tens of thousands of items and converting all these to actual objects would be very costly.
Ok, it's exactly what I thought and data will be converted to objects before sorting so this won't work for me.
Would using queries instead of expressions work?
This seem to work correctly:
Query query = query = Query.And(Query.All("name", Query.Ascending), Query.Contains("name", filterData.name));
Would this make only the filtered results to be converted to objects instead of all of them? Apparently yes, but I'm not sure about the implementation details so I'd rather ask. The only place to use sorting seems to be in Query.All method and this is why I'm having some doubts on how exactly this works.
If so, this could be a solution, but I'd need to spend quite some time refactoring my dynamic filter generation to use query statements instead of expressions which I'm not sure I will be able to do them all. But if it does work this way, it would be the only option for me performance wise.
Thanks
When you search using an index, your result will be return sorted according this index. In you example, if you have an index in name (using EnsureIndex) you can only do:
Query.Contains("name", filterData.name)
You result will be name sorted. But, if you want another field to search, than you can add Query.All("name"). It's not best solution too, because you are not using index to search, only to order. Using this Query.All() as left Query.And parameter, you will get all documents in name order, than, for each one will be deserialize and apply another filter (like Query.Contains).
I'm modeling new LiteDB 5, and the major feature will be run in server process and access via REST/TCP protocol. In this case, I will need implement OrderBy in server. I'm studying more about this.
Thanks for the reply David.
So that means if my collection has say 5 million documents it will deserialize them all before filtering. That might be problematic for me then.
I want everything to be sorted by "name", but my query is dynamic so it might not include the "name" field at all, I suppose my only option would be using the approach I'm using with Query.All("name") right?
Please bare with me for a bit longer only so I can get the grasp of how queries work exactly.
1) What if I change the query order in the AND? Would it make a difference? ie
Query query = query = Query.And(Query.Contains("name", filterData.name), Query.All("name", Query.Ascending));
Would this first filter by name and only then apply sorting to the filtered subset instead of all documents?
2) How does the automatic index sorting work if you use multiple conditions? I now understand that doing Query.Contains("name", filterData.name) results will be sorted by name, but what order will be applied if I have something like:
Query.And(Query.Contains("name", filterData.name), Query.EQ("age", 20))
Will it be sorted by the field in the first condition? It won't have any defined order?
3) How can I filter on inner document fields? This doesn't seem to work:
Query.EQ("address.street", filterData.address)
Using a predicate expression instead of queries it works fine:
x => x.address.street.Equals(filterData.address)
So is this done differently when using queries?
Thanks
Hi,
If you want your results ordered by name you need deserialize all results documents (ignoring skip/limit) to first order and then apply skip/limit. If you are not using name index, there is no way to avoid this. If you collection has 5 millions documents but you query returns only 5k using another index (like age field), this 5k need to be deserialized (to order by name).
1) You don't need use And with Contains + All in same query. When engine use first name index to get contains all documents are came in name order (asc).
2) Left first. If your And contains 2 indexes, will be used Left query. It麓s not best solution, but it麓s easy to know. Look here: https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Query/QueryAnd.cs#L41
3) Is this field have an index? Because your predicate expression will be converted in same result you wrote:
x => x.address.street.Equals(filterData.address)
=
Query.EQ("address.street", filterData.address)
I will write an unit test this.
Hi David,
Regarding 1. that was just a simple example, as I said my filters are dynamic so they are normally much more complex than that. Let me give you a better example:
Query.And(Query.EQ("age", 20), Query.All("name", Query.Ascending));
So name field isn't part of the query so it won't be sorted by it, but what will be the behavior here? Will ALL documents in the collection be deserialized and then the age filter applied to them?
Or will the documents first be filtered by age=20 and only that sub-set of documents will be deserialized?
Is there any behavior difference from the above query and this one:
Query.And(Query.EQ("age", 20), Query.All("name", Query.Ascending));
Regarding 2. my sorting always need to be done by the "name" field, but this field might not be present in the query since the filter is dynamic. Based on your answer and that the results will always be sorted by the left query index, what if I always include Query.Contains("name", "") as the leftmost query? Would this be a workaround to force a sort on name column instead of needing to use Query.All?
Regarding 3. I figured out the problem it wasn't exactly what I though, but not sure why this happens. I was actually also filtering on one id column. So my document object id property is declared as:
public String id { get; set; }
But if I use say Query.EQ("address.id", id) this won't work, but using Query.EQ("address._id", id) does work. Is there any way I can get it to work with id instead of _id?
I wasn't using EnsureIndex in the id column since I thought it would be applied by default, but seems this is done to the virtual _id column instead. I'm guessing the identifier field name will always be named _id regardless the property name in the object is this correct?
Thanks
Hi @cardosso, sometimes my poor english is a big problem :)
1) Well, when you run a query, this search can be executed in 2 ways: indexed or full scan. Indexed query run over only keys-values in a skip-list structure. Do not deserialize documents to check. Full scan, otherwise, needs deserialize each document to "filter" your to check if match your value if document value.
When you use AND, only one index are used. Over this results, LiteDB will deserialize and apply filter for other. Lets see:
```C#
db.col.ensureIndex name
db.col.ensureIndex age
// use single index, find direct
Query.EQ("name", "john")
Query.StartsWith("name", "j")
// use index-scan (need visit all INDEX NODES - but not deserialize all documents)
Query.Contains("name", "hn")
// now, "age" has index, so will find (in index) only documents with Age=20.
// For those documents, will "filter-all" by "name".... it's like in SQL use "1 = 1".
// No index will be used and no Query.Ascending will be respected.
// Documents will be returned by age order (because use index)
Query.And(Query.EQ("age", 20), Query.All("name", Query.Ascending));
// Let invert and it's change everyhing
// Now, engine will find all documents by name (no filter, will deserialize all
// documents using name order) and than will be test each document
// with age=20 (no index will be used). This is the case is: both left/right
// has index and left has preference.
Query.And(Query.All("name", Query.Ascending), Query.EQ("age", 20));
if you want see "explain plain" of how your query will be executed, you can test in shell tool (using debug 16):
debug 16
db.col1.find name = "john" and age = 20
```
2) If you use Query.All("name") or Query.Contains("name", "") in left side of any Query.And you will deserialize all documents (in name order) to after this filter by your right query. Bad idea if you have too many documents. To this, i recommend you to do: use your filter with no skip/limit parameter. Get all data, apply OrderBy (from LinqToObject) and than use Skip/Take. If your user can pass like empty string in name, so all documents will be returned, use something like "limit 1000" in Find method. This will avoid deserialize all documents (ok, will not return all documents).
Using Skip/Limit with OrderBy are very expensive solution in any database. I was reading this article and agree with him (because I have same issue in litedb).
http://use-the-index-luke.com/no-offset
3) Yes, it's like a bug.... if your class contains an "Id" LiteDB auto convert to "_id"... make sense if you class will be used as a document, but in this case it's a sub-document. I guess you can rename this using [BsonField("Id")] attribute or using fluent api (BsonMapper.Global.Entity<YourSubDocument>().Field(x => x.Id, "Id").
There is no default index for sub-documents. Use EnsureIndex to use index (works with no index, but using full scan as I mentioned before).
Thank you for the explanation David and your English is perfectly fine :)
I think I have a much better understanding of how things work now, I didn't know the "debug 16" command, that's very useful. Is there any extra info about it?
Am I assuming correctly that in the query explanation "Seek" means an index search and "Filter" full scan?
Regarding ID vs _ID, it's not a big deal now that I'm aware of it, I wasn't aware the id index wasn't created automatically for sub-documents though. My db was created on v3 so since I was using them in my queries they ended up being created automatically so I never realized that.
If I annotate my id property with [BsonField("Id")] I don't need to call EnsureIndex right? I was never sure about that and always call EnsureIndex anyway.
Yes, "Seek" will find using index. "Filter" will do a full scan in documents and "Scan" will scan over index only (like in Query.Content)
About sub document, you still need use EnsureIndex in Id even if you use as BsonField("Id") attribute. In v4 only document _id is auto-created index. If you use shell, type db.cols.indexes to show all collection indexes. Slot 0 is always for _id document (root). All others must be manual created (using EnsureIndex).
I recommend you have a "Initialize" method that create index every time you change your schema. Like here:
I'm already doing that, just wasn't creating the id ones for sub-documents because I (wrongly) assumed they would always be created automatically. It actually makes perfect sense they aren't otherwise you would run out of indexes very soon (16 limit), it was pretty obvious actually I just never thought about it.
16 indexes per collection was a v1/2 limitation because each document must keep all pointers to indexes nodes (even if not used). In v3/4 it changes but I keep this limit because if you use more indexes, more expensive are to insert/update/delete. MongoDB uses 64 indexes per collection limit. Maybe LiteDB can support 32 (I need to some calcs if all fit in a single CollectionPage structure).
Yes, that would be very useful to me as I keep hitting the 16 limit (this is too short for any medium sized document allowing dynamic filters). I opened a issue about it earlier this year: #615
Hi David,
I was doing some experiences and I found the major bottleneck on my app isn't related to the actual queries, but iterating over the enumerations return by it and I really don't understand what's going on behind the scenes because this is really slow.
So for instance I do:
IEnumerable
This query will only return 50 elements and is pretty fast, less than 1 second, but then I need to populate an ObservableCollection with the results to display them on a DataGrid so:
foreach (ItemData item in list)
{
items.Add(item);
}
And this takes around 5 seconds(!) for 50 records only which doesn't make much sense to me, and doesn't make much difference if I limit the query to 50 or 1000 records. Is there any reason for this to happen?
I was looking at the code and I see you are using a yield return which could be the cause of the issue since you are forcing the creation of the state machines over and over, why not return the list directly?
See what I mean here:
https://stackoverflow.com/a/3970047
Thanks
EDIT: I made a test querying the engine directly and returning a list instead of using the Find method of the collection and the performance is good so the problem is definitely related with the yield calls.
Hi @cardosso, the problem is not only yield, but yield with long resultset an then an OrderBy. Where you call:
IEnumerable list = col.Find(query).OrderBy(x => x.name).Skip(0).Take(50);
You are not running nothing until your do a foreach loop or ToList(). You are running orderby over IEnumerable that always fetch values (not from memory, from disk). If you use this:
IEnumerable list = col.Find(query).ToList().OrderBy(x => x.name).Skip(0).Take(50).ToList();
Will run much more fast, because order by will run over in-memory documents.
But........ if you have a large collection, this can consume lot of memory (because ToList before Take/Skip).
So, I still thinking in how resolve this for next v5 - when I will implement internal order by. For now, I made this extensions that can "help" (still not best solution).
```C#
public static class PagingExtensions
{
public static List
{
var tmp = "tmp_" + Guid.NewGuid().ToString().Substring(0, 5);
var engine = db.Engine;
// insert unsorted result inside a temp collection
engine.InsertBulk(tmp, engine.Find(col.Name, query));
// create an index to order by this result
engine.EnsureIndex(tmp, "orderBy", orderByExpr);
var skip = pageIndex * pageSize;
// now, get all documents in temp orderBy expr with skip/limit
var sorted = engine.Find(tmp, Query.All("orderBy", order), skip, pageSize);
// convert docs to T entity
var list = new List<T>(sorted.Select(x => db.Mapper.ToObject<T>(x)));
// drop temp collection
engine.DropCollection(tmp);
return list;
}
}
```
This will order by in disk, avoiding consume lots of memory. I made an example in LiteDB.Demo project here https://github.com/mbdavid/LiteDB/commit/6054697855ca0f63535a768d864f4a837ed9a6b3
Hi David,
That solution partially solves the memory issue of loading all records into memory, but it introduces another big overhead which is the bulk insert and collection drop for each query.
On my personal scenario this is too high to live with, it seems pagination and litedb is very problematic with the current version due to the way things were designed so I guess we will have to wait and see if things will be better with v5. I think the lack of internal index sorting is the main problem, if you can address that most of the issues would go away.
One question about your extension method, wouldn't it be more perfomant if you move the EnsureIndex call to before the InsertBulk?
Right now I think the only option is to always read the full collection into memory on each query and then apply Skip/Take on it, this will be a memory hog unfortunately.
Also if the query returns a lot of records it will be somewhat slow and then every time you move to the next page you'll have to call that same query again which makes navigation not user friendly at all.
Loading everything into memory also solves this since then page navigation will never hit the db again, the drawback being the memory usage of course.
Hi @cardosso,
First, yes, it's better create index in collection before bulk insert. It's avoid read twice. I fixed in my last commit. I also add another way to do paging, without coping all document (copy only specified order column).
About sort results, I was reading about other databases in how they implement this. There are many solution and must be used according each scenario. If you has few documents, using in-memory sort are best solution. If you already has an index to sort, it's even better. But, if you have lots of documents to sort, you must use in-disk solution. SQL Server uses TempDB for that.
My idea in v5 is how detect when is better use index, in-memory or in-disk sort.
In my example, you can use memory disk and copy only sorted field, like this:
```C#
public static List
{
var tmp = "tmp_" + Guid.NewGuid().ToString().Substring(0, 5);
var expr = new BsonExpression(orderByExpr);
var disk = new StreamDiskService(new MemoryStream(), true);
// create in-memory database with large cache area
using (var engine = new LiteEngine(disk, cacheSize: 200000))
{
// create index in tmp collection on orderBy column
engine.EnsureIndex(tmp, "orderBy", orderByExpr);
// insert unsorted result inside a temp collection - only _id value and orderBy column
engine.InsertBulk(tmp, db.Engine.Find(col.Name, query).Select(x => new BsonDocument
{
["_id"] = x["_id"],
["orderBy"] = expr.Execute(x, true).First()
}));
var skip = pageIndex * pageSize;
// now, get all documents in temp orderBy expr with skip/limit
var sorted = engine.Find(tmp, Query.All("orderBy", order), skip, pageSize);
// convert docs to T entity
var list = new List<T>(sorted.Select(x => db.Engine.FindById(col.Name, x["_id"]))
.Select(x => db.Mapper.ToObject<T>(x)));
return list;
}
}
```
Hi @cardosso, I created a branch Find_Sort to create a much better solution. My new solution use a new Temp (memory/disk) media engine that support fast way to sort. Take a look on LiteDB.Demo project that show how use it and some compares with SQLite.
In 100.000 customer collection (fields: id, name, age), filtering by age = 22 (results 4730 documents - using index in age field) it's possible sort/paging results in 13ms (in SQLite is 8ms). If works fine I will add this in v4.1 version
Hi David,
What VS is needed to build LiteDB? I'm not able to open the project in VS2012 or VS2015 I get an error stating "The default XML namespace of the project must be the MSBuild XML namespace" which seems to point that VS2017 is needed?
I'm happy to test your new solution if you can build the DLL for me since I can't.
Also on your other solution this doesn't seem to compile:
engine.InsertBulk(tmp, db.Engine.Find(col.Name, query).Select(x => new BsonDocument
{
["_id"] = x["_id"],
["orderBy"] = expr.Execute(x, true).First()
}));
Problem with the new BsonDocument initialization, seems this requires .NET 4.5? I must stick with 4.0 due to XP support.
Hi @cardosso, to build LiteDB you need VS2017... because VS2017 has a better support for multi target framework compiler.
I attach here net40 version to test.
The method is FindSort and you can use like this:
C#
var result = _engine.FindSort(
"collectionName",
Query.EQ("Age", 22),
"$.name", // sort column
Query.Ascending,
skip,
limit);
It's a experimental version yet and I will change API to final release. For now are only implemented to BsonDocument. If you want convert to your class, you can use BsonMapper.ToObject . I'm thinking in release this in v.4.1 because v5 will be take a long time to be release.
Hi David,
Was just testing this and there performance is a lot better with this new method so I would say this would be a great addition since it will be a big improvement of what currently exists.
Unfortunately for my use case it still takes longer than what I needed so I think I will have to rely on loading everything into memory, the query duration now is reasonable, but having to wait that same amount of time every time you move to the next or previous page doesn't provide a good experience to the users.
My documents are rather large and with several nested documents so I'm sure this also takes a big role on how much time the loading takes to deserialize them. I don't really need to read all fields at this point though (only 4 or 5), but I don't think it's possible to only partially read a document right?
Hi @cardosso, in json it's not possible read partially documents. You always need full document... that's why is important try keeps document small as possible.
What size average is your document? Can you post a json example of 1? I will try in others databases to test how fast they can implement. I want test it.
I'm afraid I can't post an example since this is confidential data from a client and I don't want to risk having troubles, but it has over 30 fields and around 10 nested documents (It doesn't go deeper than one level though).
I was using MongoDB before and queries were pretty fast on it, but I needed an embedded solution so mongo wouldn't be an option and LiteDB is still the best solution.
Also due to the index limitation I ended up creating one extra field which is an array that will contain information for a lot of other fields so I can then filter on that array otherwise I wouldn't be able to perform the queries since 16 indexes are way too short, this will also have an impact obviously since I will be duplicating data.
It's not too bad to load everything into memory at this point, but it will probably start getting problematic once the amount of data grows.
Hi @cardosso, no problem about data.
This weekend I work exclusive to detect/fix performance issues about insert/query. I made more and more fixes to improve performance. And get great results. Insert processes was running in 4800ms (100.000 documents) and was possible reduce to 1500ms. Query now are running under 10ms.
But off course, it's nothing compare to mongo. I realise that my skip-list index structure are much more "expensive" to create (insert) than a comom b+tree. But, to upgrade no b-tree will be a huge step forward. Maybe next major version... but I don't know, b-tree are super hard to implement...
About your case: keep data in memory is not a big deal if you are using small range of documents and a single process. You can limit your results to not exceed some number.
Hi David, for my particular use case insert performance isn't an issue I can live with slow inserts since documents are mostly read only and inserts are done in batches at pre-defined times so it doesn't really matter if they take a long time to run.
Obviously it's great that you managed to improve the performance that much, because for other use cases that can be very important.
I'm intrigued how you are getting so fast queries though, 10ms is super fast. How many documents are you retrieving and how many fields do they have? I think at this point the performance bottleneck may very well be in reading the JSON data into memory (so bigger documents => larger times)
Hi @cardosso, in this example, I got 10ms with 100.000 documents inside a collection, filtering (using index) where age=22 that returns 4700 documents. After get this 4700, order all by Name field and get first 10 results with no offset/skip. My new computer is an i7-7gen with SSD disk.
But off course, larger documents (like yours) resulting more than 4700 after indexed filter will need much more time. Also, if you need skip lots of document, this will consume much more resources, because when you return only 10 documents with no skip, you need only 10 slots to order results. But if you skip 4000 documents, your slot must have 4010 spaces to order. In my example, skips 4000 rows increase time from 10ms to 45ms.
Hi @mbdavid
Just noticed that you recently released an alpha version of v5, congratulations on reaching the alpha milestone! I suppose v5 should solve my performance sorting issues since it supports in-memory sorting right?
If that's the case I'm looking forward to test it, although seems I won't be able to test it on my real life scenario yet because due to the dynamic nature of my queries I make extensive use of LinqKit's PredicateBuilder and according to the docs: "Currently not completely compatible with LinqKit PredicateBuilder"
Could you please provide a bit more info on this? Is it not supported at all? Only partially supported? And is full support for it currently on your roadmap?
Thank you in advance
Hi Cardoso, v5 still in alpha (will be beta soon) and will support PredicateBuilder on next release.
V5 support native sort, select, group by and include operations also a Sql language for queries
Thank you for your reply, I eagerly await for the next release then so I can test it and compare the performance against my current implementation using v4.
Hi @mbdavid,
Is there any documentation yet for v5?
I am interested to use the "native sort" capabilities you mentioned and I am using the v5 -beta
@freever, not yet... I'm writing this right now.... You can check ILiteQueryable api for now.
@freever, not yet... I'm writing this right now.... You can check
ILiteQueryableapi for now.
Thanks, it looks fantastic!
Hi @cardosso, sometimes my poor english is a big problem :)
- Well, when you run a query, this search can be executed in 2 ways: indexed or full scan. Indexed query run over only keys-values in a skip-list structure. Do not deserialize documents to check. Full scan, otherwise, needs deserialize each document to "filter" your to check if match your value if document value.
When you use AND, only one index are used. Over this results, LiteDB will deserialize and apply filter for other. Lets see:db.col.ensureIndex name db.col.ensureIndex age // use single index, find direct Query.EQ("name", "john") Query.StartsWith("name", "j") // use index-scan (need visit all INDEX NODES - but not deserialize all documents) Query.Contains("name", "hn") // now, "age" has index, so will find (in index) only documents with Age=20. // For those documents, will "filter-all" by "name".... it's like in SQL use "1 = 1". // No index will be used and no Query.Ascending will be respected. // Documents will be returned by age order (because use index) Query.And(Query.EQ("age", 20), Query.All("name", Query.Ascending)); // Let invert and it's change everyhing // Now, engine will find all documents by name (no filter, will deserialize all // documents using name order) and than will be test each document // with age=20 (no index will be used). This is the case is: both left/right // has index and left has preference. Query.And(Query.All("name", Query.Ascending), Query.EQ("age", 20));if you want see "explain plain" of how your query will be executed, you can test in shell tool (using debug 16):
> debug 16 > db.col1.find name = "john" and age = 20
- If you use Query.All("name") or Query.Contains("name", "") in left side of any Query.And you will deserialize all documents (in name order) to after this filter by your right query. Bad idea if you have too many documents. To this, i recommend you to do: use your filter with no skip/limit parameter. Get all data, apply OrderBy (from LinqToObject) and than use Skip/Take. If your user can pass like empty string in name, so all documents will be returned, use something like "limit 1000" in
Findmethod. This will avoid deserialize all documents (ok, will not return all documents).Using Skip/Limit with OrderBy are very expensive solution in any database. I was reading this article and agree with him (because I have same issue in litedb).
http://use-the-index-luke.com/no-offset
- Yes, it's like a bug.... if your class contains an "Id" LiteDB auto convert to "_id"... make sense if you class will be used as a document, but in this case it's a sub-document. I guess you can rename this using [BsonField("Id")] attribute or using fluent api (
BsonMapper.Global.Entity<YourSubDocument>().Field(x => x.Id, "Id").
There is no default index for sub-documents. Use EnsureIndex to use index (works with no index, but using full scan as I mentioned before).
This weirdness is a deal breaker. Software shouldn't act weirdly, that is First. Second, Software shouldn't act against common sense and usual rules, and so X (And) Y should equal the statement Y (And) X