In the fullTextSearch.js example, searching is done using this query:
db.emails.where("messageWords").startsWithIgnoreCase("v").distinct().toArray(function (a) {
However, it is only possible to search for one word in messageWords with this query. Is there a way to use this for multi-word search queries? I'm trying to search very large documents, so Collection.filter is not fast enough.
Just use startsWithAnyOf(["word1", "word2", ...]) or startsWithAnyOfIgnoreCase(["word1", "word2", ...]). And if you want, you can populate the messageWords field with the words from multiple properties instead of just a single one. That way you can have a single query search for all documents containing any of the searched words in any of the attributes.
Sorry for not being more clear, I'm trying to limit the result set to items that contain all of the words, so that I could run something like db.emails.where("messageWords").allOf(["a", "b", "c"]).distinct().toArray() and find records where messageWords contains the items "a", "b", and "c". Is there a way to do this?
Ok. Since your documents are large, filter() is not ideal as you've also concluded. I would suggest the following:
function find (prefixes) {
return db.transaction('r', db.docs, function*() {
// Parallell search for all prefixes - just select resulting primary keys
const results = yield Dexie.Promise.all (prefixes.map(prefix =>
db.docs
.where('words')
.startsWith(prefix)
.primaryKeys()));
// Intersect result set of primary keys
const reduced = results
.reduce ((a, b) => {
const set = new Set(b);
return a.filter(k => set.has(k));
});
// Finally select entire documents from intersection
return yield db.docs.where(':id').anyOf (reduced).toArray();
});
}
// Sample use of the function:
find(['animal', 'food', 'nutrition']).then(docs => {
console.table(docs);
}).catch (err => {
console.error (err.stack || err);
});
This sample will query on indexes and prinary keys only and take out the intersection of all searched words. Finally when the intersection is known, the actual documents are accessed. It also utilize the new indexeddb getAllKeys () api method so it should be maximum performant. You could change startsWith to startsWithIgnoreCase also, but then getAllKeys() wont be possible to use internally. Don't know how much this would matter performance-wise, but I believe it's better to stick to just startsWith() or equals() and instead make sure you are storing 'words' in lowercase version and always to toLowerCase() before calling find().
Note that the result will be ordered by primary key. If primary key is incremented, this means you'll receive the oldest documents first. To reverse that, just add .reverse() to the final document selection query. Optionally with a limit() as well.
Thanks, that's really useful!
@dfahlander still recommended approach?
@negamaxi It's still a logical and databaseish way of searching for multiple words in a full-text manner. There are other strategies as well and I haven't yet found out which one is most performant. It might be the case that it's better to avoid roundtrips as much as possible, even though we're talking to to a local database (as event dispatching interrupts the GUI thread). Or, it could be best to do these request-intensive stuff from a Worker thread. I'm planning to create some performance tests for various scenarios (join operations, anyOf() and full-text) to find out which strategy is the best. This will hopefully result in some recommendations and / or maybe some new built-in features or addons.
One thing that I will be going to test, is whether it's faster to just query them all like this:
const docs = flatten(await Promise.all(
prefixes.map(prefix =>
db.docs.where('words').startsWith(prefix)
)
));
..and from there, filter in-memory out the result. Might occupy more memory though.
To add on to this, I've been using this approach for a while now, and it seems to work pretty well. For ~3k documents, with ~1500 tokens per document (so around 4 million tokens total), most searches take a couple hundred ms, which seems like a reasonable amount of time. When I originally implemented this, equals turned out to be significantly faster than startsWith, so you might want to try that as well (although it's possible that it's changed since then).
@dfahlander What is that flatten() function you are using in your comment above? Is that the same as flat() as described here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat
yes
How to implement a search not for a prefix or suffix, but for a pattern in middle of a string, like relational databases do it with %pattern% ?
We would like to make this full text search through 3 different stores like: categories, filters, filterOptions
A database engine that allows %pattern% needs to do a full table scan to find the result. In Dexie you can do
db.friends.filter(f => /pattern/.test(f.name))
or
db.friends.toArray().then(result => result.filter(f => /pattern/.test(f.name))
...which is basically the same but a bit faster since the raw query can use IDBObjectStore.getAll() rather than IDBObjectStore.openCursor() which is slower.
But if you really need performance on searches in large tables, there's an algorithm for that trigram search. There are people that have built trigram index on top of dexie but of my knowledge no ready-to-use library or addon. The idea is to use a multiEntry index and instead of indexing all the words as we are discussing in this issue, index each three-letter combination.
To eliminate the few false positives that may occur in the result from a trigram search based on multiEntry index in Dexie, you can add a manual filter on the result.
Some databases have built-in support for this, or modules that does it (like pg_trm for postgresql)
I'm grateful for your response @dfahlander . It helps so much!!
That's what I was looking for!
I think the both approachs will be sufficient in terms of performance, because our db is not so big, it's like 30k of items.
Most helpful comment
Ok. Since your documents are large, filter() is not ideal as you've also concluded. I would suggest the following:
This sample will query on indexes and prinary keys only and take out the intersection of all searched words. Finally when the intersection is known, the actual documents are accessed. It also utilize the new indexeddb getAllKeys () api method so it should be maximum performant. You could change
startsWithtostartsWithIgnoreCasealso, but then getAllKeys() wont be possible to use internally. Don't know how much this would matter performance-wise, but I believe it's better to stick to just startsWith() or equals() and instead make sure you are storing 'words' in lowercase version and always to toLowerCase() before calling find().Note that the result will be ordered by primary key. If primary key is incremented, this means you'll receive the oldest documents first. To reverse that, just add
.reverse()to the final document selection query. Optionally with alimit()as well.