Elasticsearch-dsl-py: Can't understand "word_delimeter" token filter

Created on 19 Nov 2018  路  7Comments  路  Source: elastic/elasticsearch-dsl-py

elasticsearch-dsl==6.2.1
elasticsearch==6.3.1

I have analyzer:

test_analyzer = analyzer(
    'test_analyzer',
    tokenizer=tokenizer('trigram', 'edge_ngram', min_gram=3, max_gram=10),
    filter=['lowercase', 'word_delimiter']
)

This analyzer used in Document:

class TestIndex(Document):
    name = Text(analyzer=test_analyzer)
    id = Integer()

    class Index:
        name = 'test-index'

Objects in my Index have next pattern in name:

word1:word2:word3:word4

(there a lot of ":" in name)

As I understood ES docs, with this analyzer I can search my objects by sub-words (like word2 from my example), but really search works only with almost-full-name in query.

My Search request is:

search = Search(
    index='test-index'
).query(
    "multi_match",
    query="word2",
    fields=['name'],
    fuzziness='AUTO'
)

(returns nothing)

Most helpful comment

The problem is with your tokenizer which produces just edge ngrams of length 3-10 from the original input (word1:word2:word3), in your case:

word
word1
word1:
word1:w
word1:wo
word1:wor
word1:word

which is not particularly useful I believe. I would recommend you play around with the _analyze API (0) to find an analyzer that does what you want, in this case I believe you want the simple_pattern_split tokenizer (1) instead:

es.indices.analyze(body={
    'text': 'word1:word2:word3',
    'tokenizer': tokenizer('split_words', 'simple_pattern_split', pattern=':').get_definition()
})

Hope this helps!

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html

All 7 comments

The problem is with your tokenizer which produces just edge ngrams of length 3-10 from the original input (word1:word2:word3), in your case:

word
word1
word1:
word1:w
word1:wo
word1:wor
word1:word

which is not particularly useful I believe. I would recommend you play around with the _analyze API (0) to find an analyzer that does what you want, in this case I believe you want the simple_pattern_split tokenizer (1) instead:

es.indices.analyze(body={
    'text': 'word1:word2:word3',
    'tokenizer': tokenizer('split_words', 'simple_pattern_split', pattern=':').get_definition()
})

Hope this helps!

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html

1. I created this analyzer with simple_pattern_split
screenshot 2018-11-19 at 23 27 27

2. Then I insert object with ':'
screenshot 2018-11-19 at 23 27 57

3. Try _analyze with my analyzer and, I suppose, it works.
screenshot 2018-11-19 at 23 28 19

4. But when I try to search for object by its second part, I got nothing.
screenshot 2018-11-19 at 23 28 40

5. Maybe I am wrong with search part and it's impossible to combine fuziness with such analyzer?

It is very much possible, in fact it is impossible to use fuzziness without an analyzer.

This works just fine for me:

from elasticsearch_dsl import analyzer, tokenizer, connections, Document, Text

# connect
connections.create_connection()

test_analyzer = analyzer(
    'testing-1061',
    tokenizer=tokenizer('split_words', 'simple_pattern_split', pattern=':'),
    filter=['lowercase']
)

class D(Document):
    content = Text(analyzer=test_analyzer)
    class Index:
        name = 'test-1061'

# create the index
D.init()

D(content='word1:word2:word3').save(refresh=True)

assert 1 == D.search().query('match', content='word2').count()
assert 0 == D.search().query('match', content='wordX').count()
assert 1 == D.search().query('multi_match', query='wordX', fields=['content'], fuzziness=1).count()

Can you perhaps share your complete code? Are you creating your mappings and attaching the correct analyzer?

Oh, after posting comment I've understood that I haven't created mapping. Now it works, at least in Postman, now I'll try to use this in Python. Thank you so much, you've saved me a day! <3

Sorry for re-opening, but it still strange in Python...

I've fully copy-paste your code, except connection, I have connections.create_connection(hosts=['localhost:32770'], timeout=20) with ES (6.4.3) in Docker and the first test is failing... I've even deleted index with DELETE localhost:32770/test-1061*, after this I run Django and this is what I get in console AssertionError on assert 1 == D.search().query('match', content='word2').count()

can you just print out the result instead of the assert? My guess would be that the code is accidentally run twice so the document is duplicated. Or just print out the hits instead of just a count().

You are right about duplicates, sorry :( Now everything is fine in Python too.

Was this page helpful?
0 / 5 - 0 ratings