Dgraph: Full text search?

Created on 15 Feb 2017 · 17Comments · Source: dgraph-io/dgraph

No plans to support full text search?

There's nothing on the roadmap.

I'm more than a little interested in DGraph as an alternative to PostgreSQL as the back-end of application state, but I recall the days of struggling to integrate something like Solr for full-text search at the application-level, and don't wish to go back there.

Any plans (or community initiative) to integrate a full-text search-engine?

Source

mindplay-dk

Most helpful comment

Bleve integrated for full text search. Changes available in master.
Implemented features:

new functions for FTS matching
tokenization, UTF-normalization, stemming, stop words
support for multiple languages (stemmers and stop words lists)

For more FTS-related features please feel free to open new github issues.

tzdybal on 17 Mar 2017

👍3

All 17 comments

Hey @mindplay-dk,

Does term matching not work for you?
https://wiki.dgraph.io/Query_Language#Term_matching

What sort of queries do you intend to run?

manishrjain on 16 Feb 2017

What sort of queries do you intend to run?

Well, full-text search against article body content - presumably that would require a full-text search engine with language support and features such as stemming, stopwords, synonyms, etc. and special index-types.

It's not clear to me how indexing works - simply adding @index doesn't specify an index-type, so there's currently no way to specify index-types to optimize for different access patterns and query-types etc.?

You have just one type of index for all data-types? How does that work?

mindplay-dk on 16 Feb 2017

@mindplay-dk:

We have different tokenizers for different data types. We chose those tokenizers automatically. This is very basic, in the sense that you only get one tokenizer per data type.

Now We have changed that to allow multiple tokenizers for the same data type. So, you can specify the tokenizer you want. It would be part of the release we're doing today. Currently, the string tokenizers we have is for breaking a phrase into terms and doing equality matches.

We could add more tokenizers to handle English language better with stemming, stopwords, etc. But then that won't scale to other languages.

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

manishrjain on 21 Feb 2017

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms. Stemming, removal of stop words, dictionary replacements, and so on.

If you want to provide real text search, you should consider integrating a real FTS engine - there are, for example, numerous C ports of Lucene. We would need a language-annotation that can be applied at the field level.

Attempting to roll your own likely only means we'll need to integrate with a stand-alone search package (solr etc) at the application level, to get the search quality users expect today, which is really clumsy - FTS is a really big and complex domain, and, in my opinion, not where you should be investing your time; pick an available open source library and integrate that instead, it's a much better use of your time.

mindplay-dk on 21 Feb 2017

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms.

We do that. The tokenizer is applied on both the data and the search query. Otherwise, good point.

@tzdybal : Can you look into Bleve and see if we could use certain packages from them. We don't want to use their data storage layer; but only the layer which analyzes the languages, generates the tokens, and finally do the same on the query level. This would also allow us to cut off ICU, and go native Go.

manishrjain on 21 Feb 2017

I would also suggest bleve be looked at.
The FtS engine of couch base uses bleve to make a cluster ready FTS.

Bleve can use numerous data stores too.

There are two ways to approach this too. You can run bleve seperate from dgraph.
Then whenever you mutate data in dgraph pass it to bleve to do whatever index mapping you need. So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

Either way I have found bleve to be fantastic and highly supported by couch base too

joeblew99 on 21 Feb 2017

So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

This is what I would call a high-level integration.

I'd suggest a more low-level integration - you should be able to plan the net query better, if you can assess in advance the dimensions of other indices (of other fields) involved in the query, etc.

While a high-level and low-level integration will likely provide the same convenience an client-facing features, it will likely have net performance similar to an appliation-level integration with an external FTS service - whereas a low-level integration might be able to make some optimizations we can't make at the application-level.

mindplay-dk on 21 Feb 2017

❤1

Bleve would need to be integrated in a way, where we still control how the data gets stored. We use our own mechanism for data storage, and all we need from Bleve is to do the tokenization for us, taking into account porter stemming, stop words, what not. So, we need the library part of Bleve; not the storage part.

Running anything outside of Dgraph is out of the question.

manishrjain on 21 Feb 2017

@manishrjain @mindplay-dk
I agree fully ,and it would be awesome. DGraph needs only the library part.

i would be a happy chappy if the facets part of bleve is included too. Its very powerful.
In terms of GUI, its an amazingly useful way to search for data when you have a ton of it.

joeblew99 on 21 Feb 2017

@joeblew99 facets would be killer, but doesn't need to arrive with the first feature release :-)

mindplay-dk on 21 Feb 2017

@mindplay-dk

Its actually a tiny amount of code:
https://github.com/blevesearch/bleve/tree/master/search/facet

https://github.com/blevesearch/bleve/search?p=1&q=facet

joeblew99 on 21 Feb 2017

❤1

I looked through the code of bleve. The separation of concerns is clear, packages are very fine-grained, API seems reusable. It's easy to select only some of the functionalities.

All we need is tokenizer (probably Unicode) and some token filters. Natural candidates are: Lowercase, Stemmer, Stop Token.

@manishrjain: Replacing current ICU tokenizer with bleve based solution (tokenizer+filters) should be straightforward.

@joeblew99 @mindplay-dk
From the code-level perspective, integration of facets building logic is definitely possible.

tzdybal on 22 Feb 2017

😄2

This looks awesome. Let's get on it.

manishrjain on 23 Feb 2017

👍3

Bleve integrated for full text search. Changes available in master.
Implemented features:

new functions for FTS matching
tokenization, UTF-normalization, stemming, stop words
support for multiple languages (stemmers and stop words lists)

For more FTS-related features please feel free to open new github issues.

tzdybal on 17 Mar 2017

👍3

Exciting news! :-)

Documentation updates pending?

mindplay-dk on 17 Mar 2017

Sorry, didn't notice this question. Here're the docs:
https://docs.dgraph.io/v0.7.5/query-language/#full-text-search

manishrjain on 21 Apr 2017

👍1

Fantastic!

dahankzter on 30 May 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JSON in the same order that it was request

KadoBOT · 5Comments

Support logical replication / change data capture (CDC) with Kafka

fritzblue · 5Comments

python Driver

xhochipe · 3Comments

Allow value variables in facet filters

djdoeslinux · 4Comments

Support 1 to 1 associations seems to fail in V1.1.0

pepoospina · 3Comments