No plans to support full text search?
There's nothing on the roadmap.
I'm more than a little interested in DGraph as an alternative to PostgreSQL as the back-end of application state, but I recall the days of struggling to integrate something like Solr for full-text search at the application-level, and don't wish to go back there.
Any plans (or community initiative) to integrate a full-text search-engine?
Hey @mindplay-dk,
Does term matching not work for you?
https://wiki.dgraph.io/Query_Language#Term_matching
What sort of queries do you intend to run?
What sort of queries do you intend to run?
Well, full-text search against article body content - presumably that would require a full-text search engine with language support and features such as stemming, stopwords, synonyms, etc. and special index-types.
It's not clear to me how indexing works - simply adding @index doesn't specify an index-type, so there's currently no way to specify index-types to optimize for different access patterns and query-types etc.?
You have just one type of index for all data-types? How does that work?
@mindplay-dk:
We have different tokenizers for different data types. We chose those tokenizers automatically. This is very basic, in the sense that you only get one tokenizer per data type.
Now We have changed that to allow multiple tokenizers for the same data type. So, you can specify the tokenizer you want. It would be part of the release we're doing today. Currently, the string tokenizers we have is for breaking a phrase into terms and doing equality matches.
We could add more tokenizers to handle English language better with stemming, stopwords, etc. But then that won't scale to other languages.
We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?
We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?
Full text search is not only a matter of tokenization, the search itself requires processing of the search terms. Stemming, removal of stop words, dictionary replacements, and so on.
If you want to provide real text search, you should consider integrating a real FTS engine - there are, for example, numerous C ports of Lucene. We would need a language-annotation that can be applied at the field level.
Attempting to roll your own likely only means we'll need to integrate with a stand-alone search package (solr etc) at the application level, to get the search quality users expect today, which is really clumsy - FTS is a really big and complex domain, and, in my opinion, not where you should be investing your time; pick an available open source library and integrate that instead, it's a much better use of your time.
Full text search is not only a matter of tokenization, the search itself requires processing of the search terms.
We do that. The tokenizer is applied on both the data and the search query. Otherwise, good point.
@tzdybal : Can you look into Bleve and see if we could use certain packages from them. We don't want to use their data storage layer; but only the layer which analyzes the languages, generates the tokens, and finally do the same on the query level. This would also allow us to cut off ICU, and go native Go.
I would also suggest bleve be looked at.
The FtS engine of couch base uses bleve to make a cluster ready FTS.
Bleve can use numerous data stores too.
There are two ways to approach this too. You can run bleve seperate from dgraph.
Then whenever you mutate data in dgraph pass it to bleve to do whatever index mapping you need. So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.
Either way I have found bleve to be fantastic and highly supported by couch base too
So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.
This is what I would call a high-level integration.
I'd suggest a more low-level integration - you should be able to plan the net query better, if you can assess in advance the dimensions of other indices (of other fields) involved in the query, etc.
While a high-level and low-level integration will likely provide the same convenience an client-facing features, it will likely have net performance similar to an appliation-level integration with an external FTS service - whereas a low-level integration might be able to make some optimizations we can't make at the application-level.
Bleve would need to be integrated in a way, where we still control how the data gets stored. We use our own mechanism for data storage, and all we need from Bleve is to do the tokenization for us, taking into account porter stemming, stop words, what not. So, we need the library part of Bleve; not the storage part.
Running anything outside of Dgraph is out of the question.
@manishrjain @mindplay-dk
I agree fully ,and it would be awesome. DGraph needs only the library part.
i would be a happy chappy if the facets part of bleve is included too. Its very powerful.
In terms of GUI, its an amazingly useful way to search for data when you have a ton of it.
@joeblew99 facets would be killer, but doesn't need to arrive with the first feature release :-)
@mindplay-dk
Its actually a tiny amount of code:
https://github.com/blevesearch/bleve/tree/master/search/facet
I looked through the code of bleve. The separation of concerns is clear, packages are very fine-grained, API seems reusable. It's easy to select only some of the functionalities.
All we need is tokenizer (probably Unicode) and some token filters. Natural candidates are: Lowercase, Stemmer, Stop Token.
@manishrjain: Replacing current ICU tokenizer with bleve based solution (tokenizer+filters) should be straightforward.
@joeblew99 @mindplay-dk
From the code-level perspective, integration of facets building logic is definitely possible.
This looks awesome. Let's get on it.
Bleve integrated for full text search. Changes available in master.
Implemented features:
For more FTS-related features please feel free to open new github issues.
Exciting news! :-)
Documentation updates pending?
Sorry, didn't notice this question. Here're the docs:
https://docs.dgraph.io/v0.7.5/query-language/#full-text-search
Fantastic!
Most helpful comment
Bleve integrated for full text search. Changes available in master.
Implemented features:
For more FTS-related features please feel free to open new github issues.