Elasticsearch: Ignoring deleted stopwords while generating shingles

Created on 17 May 2016 · 14Comments · Source: elastic/elasticsearch

Hi !
I have the same issue that the one exposed here : #12819 (before closing due to the lack of feedback asked by @clintongormley to the author @apanimesh061)

I discussed it here on stackoverflow because I thought I made a mistake in my parameters

When using a shingle token filter after removal of stopwords why don't we have a parameter to completely ignore stopwords ?
There would be two huge benefits for me :

Even with "filler_token":"" (or "filler_token":" " then a "trim" token filter) there are duplicated shingles generated, with the same text, "pos" and "start" but only differing by "end" that might tweak the queries scores for the duplicated words as seen here :
If a word is surrounded by stopwords, no shingles would be generated associating it with other words around, several (deleted) stopwords away...

I shall add that those duplicated tokens are not removed by a "unique" token filter... and that I don't understand why.

Here is my analysis (I only kept the "test" analyzer which I used to generate examples but I have a few others ):

"settings": {
    "index": {
        "analysis": {
            "filter": {
                "fr_stop": {
                    "ignore_case": "true",
                    "remove_trailing": "true",
                    "type": "stop",
                    "stopwords": "_french_"
                },
                "fr_worddelimiter": {
                    "type": "word_delimiter",
                    "language": "french"
                },
                "fr_snowball": {
                    "type": "snowball",
                    "language": "french"
                },
                "custom_nGram": {
                    "type": "ngram",
                    "min_gram": "3",
                    "max_gram": "10"
                },
                "custom_shingles": {
                    "max_shingle_size": "4",
                    "min_shingle_size": "2",
                    "token_separator": " ",
                    "output_unigrams": "true",
                    "filler_token":"",
                    "type": "shingle"
                },
                "custom_unique": {
                    "type": "unique",
                    "only_on_same_position": "true"
                },
                "fr_elision": {
                    "type": "elision",
                    "articles": [
                        "l",
                        "m",
                        "t",
                        "qu",
                        "n",
                        "s",
                        "j",
                        "d",
                        "c",
                        "jusqu",
                        "quoiqu",
                        "lorsqu",
                        "puisqu",
                        "parce qu",
                        "parcequ",
                        "entr",
                        "presqu",
                        "quelqu"
                    ]
                }
            },
            "charfilter": "html_strip",
            "analyzer": {
                "test": {
                    "filter": [
                        "asciifolding",
                        "lowercase",
                        "fr_stop",
                        "fr_elision",
                        "custom_shingles"
                        "trim",
                        "custom_unique",
                    ],
                    "type": "custom",
                    "tokenizer": "standard"
                }
            },
            "tokenizer": {
                "custom_edgeNGram": {
                    "token_chars": [
                        "letter",
                        "digit"
                    ],
                    "min_gram": "3",
                    "type": "edgeNGram",
                    "max_gram": "20"
                },
                "custom_nGram": {
                    "token_chars": [
                        "letter",
                        "digit"
                    ],
                    "min_gram": "3",
                    "type": "nGram",
                    "max_gram": "20"
                }
            }
        }
    }
}

:SearcAnalysis feedback_needed

Source

vchalmel

Most helpful comment

Thanks for your explanations !

Token filters are not allowed to change positions or offsets. That jobs belongs solely to the tokenizer.

Oh that is something I failed to take into account...
That would prevent us to re-calculate the size of the shingle to ignore deleted stopwords with the current formula, am I right ?... But how about a parameter to prevent the creation of a shingle starting or ending with a deleted stopword like these :

{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1}, /*virtually start with deleted stopword "au"*/
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},/*virtually start with deleted stopword "au"*/
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2}, /*this unigram is duplicate with first token because due to stopwords I have shingles of size 2 containing only 1 word*/
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":2}, /*duplicate with second token*/

How could the size of a stopword being different of the count of words effectively existing in it be "working correctly as is" ?

vchalmel on 17 May 2016

👍3

All 14 comments

Hi @vchalmel

Could you provide a complete (and minimal) curl recreation showing exactly what you're doing, the results you're getting, and how they differ from what you expect?

clintongormley on 17 May 2016

Just use the analyze API:
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/indices-analyze.html

On Tue, 17 May 2016 at 12:06 CHALMEL [email protected] wrote:

Hi @clintongormley https://github.com/clintongormley
I would be glad to... but I'm afraid I don't know how to get the detail of
generated token using curl, I always use kopf to test my analysis...

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/18391#issuecomment-219674522

clintongormley on 17 May 2016

Hi @clintongormley

curl -XGET 'localhost:9200/test_marches_publics/_analyze' -d '
{
"analyzer":"test",
"text" : "Scandale au pays de Candy"
}

give me, with default "filler_token"

{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale _","start_offset":0,"end_offset":12,"type":"shingle","position":0},
{"token":"scandale _ pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"scandale _ pays _","start_offset":0,"end_offset":20,"type":"shingle","position":0},
{"token":"_ pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"_ pays _","start_offset":12,"end_offset":20,"type":"shingle","position":1},
{"token":"_ pays _ candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2},
{"token":"pays _","start_offset":12,"end_offset":20,"type":"shingle","position":2},
{"token":"pays _ candy","start_offset":12,"end_offset":25,"type":"shingle","position":2},
{"token":"_ candy","start_offset":20,"end_offset":25,"type":"shingle","position":3},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}

and with "filler_token":""

{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale  pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2},
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":2},
{"token":"candy","start_offset":20,"end_offset":25,"type":"shingle","position":3},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}

I would want, for example (among other duplicates), only one token "pays", and a token "scandale pays candy" which currently overthrow the max shingle size due to deleted stopwords "ghosts".
Another effect of those deleted stopwords "ghosts", the multiple spaces " " (one for each deleted stopword) instead of " " in the shingles.
What I would want :

{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"scandale pays candy","start_offset":0,"end_offset":25,"type":"shingle","position":0}
{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}

vchalmel on 17 May 2016

N.b. I had a mistake (now edited) in my first post, in "custom_unique", "only_on_same_position" is set to true.
If this parameter is false, the "trim"/"custom_unique" get rid of the duplicated tokens, but I don't want to delete identical tokens from distinct parts of the original text and every other issue remains.

vchalmel on 17 May 2016

@vchalmel positions are used for phrase queries, offsets are used for highlighting. It's a mistake to try to use a shingled field for either of these purposes. A shingled field is used for finding associated words, and that's it. You should use a shingled field in a bool.should clause to boost the relevance of docs with matching fields.

Also, having min=max shingle of 2 is usually all you need. Remove the unigrams.

clintongormley on 17 May 2016

I don't understand your objection... Is that only a usecase scenario issue ? Or am I deeply wrong about the rôle of those tokens differing only by offset or position ?

If I remove the unigrams and search on a unique word, will it match ?
If "pays" is duplicated three times here, are you saying that a search on "pays" won't get a higher score on this document than on a document where it is not surrounded by stopwords and hence is not duplicated because tokens differing only by positions and offset are only affecting highlighting and phrase queries ?

You seem also to consider that shingles should only be used in a complementary field to boost score in queries if associated words are found, but that is simply not my use case... I want an index with all these tokens, both unigrams and shingles up to three words at least, because i will use them in a text-mining algorithm...

On the other hand if min=max shingle=2 in that case "fox to be quick" won't generate the shingle "fox quick"
And what about the issue with the extras spaces ?

vchalmel on 17 May 2016

If I remove the unigrams and search on a unique word, will it match ?

No. But if you're using shingles as a secondary field for boosting, then the single word will match in the primary field you're searching on.

If "pays" is duplicated three times here, are you saying that a search on "pays" won't get a higher score on this document than on a document where it is not surrounded by stopwords and hence is not duplicated because tokens differing only by positions and offset are only affecting highlighting and phrase queries ?

Yes it will get a higher score. But then it is used more frequently, so it should get a higher score. If you don't want it to match when it is combined with different stopwords, then leave the stopwords there (or use a different filler token).

You seem also to consider that shingles should only be used in a complementary field to boost score in queries if associated words are found, but that is simply not my use case... I want an index with all these tokens, both unigrams and shingles up to three words at least, because i will use them in a text-mining algorithm...

Ok... then perhaps your requirements are different, but I don't understand what your exact requirements are. All I'm saying is that this is working correctly as is. Token filters are not allowed to change positions or offsets. That jobs belongs solely to the tokenizer. This comes directly from Lucene and isn't something we can change in Elasticsearch.

clintongormley on 17 May 2016

Thanks for your explanations !

Token filters are not allowed to change positions or offsets. That jobs belongs solely to the tokenizer.

{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1}, /*virtually start with deleted stopword "au"*/
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},/*virtually start with deleted stopword "au"*/
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2}, /*this unigram is duplicate with first token because due to stopwords I have shingles of size 2 containing only 1 word*/
{"token":"pays  candy","start_offset":12,"end_offset":25,"type":"shingle","position":2}, /*duplicate with second token*/

How could the size of a stopword being different of the count of words effectively existing in it be "working correctly as is" ?

vchalmel on 17 May 2016

👍3

Hi,

I would like to bring back this issue as I think one of the main reasons for needing the stop words being removed has not been taken into account.

I think the main impact of not ignoring the stopwords positions is when you have to aggregate by a shingle analyzed field, in those cases you either end up with underscores in your results or several white spaces making results that should be equal be treated as different values.

In fact I don't get the point of having the option of configuring the token used to replace the stopword, but keeping the spaces if you set "" as filler token.

Let's put an easy example of a nice functionallity that could be achieved via shingles and aggregations that can not be done properly due to stop words positions not being ignored. If I want a tag cloud of words and group of words appearing in a text field I could do it by analyzing the field with shingles (let's say 1-3 shingles) and then making an aggregation by that field (setting fielddata=true). However I don't need the stop words for a tag cloud, but removing them will result on having for example, for "word1 stopword word2" and "word1 word2" different aggregation values ("word1 word2" and "word1 word2"), which may look the same but one has two whitespaces and the other just one.

I'm facing this same problem while trying to implement an amazon-like autocomplete that is based on a n-gram search and an aggregation over a shingle analyzed field. Only reasonable way I found of achieving what I need is generating an additional field before sending the document to elasticsearch removing the stopwords so they would not be taken into account. And that is just cause when I try to make that aggregation and remove the stop words, they are removed but their postions are still taken into account when generating the aggregation results.

Therefore, as there is already a "filler_token" parameter I think it makes sense that if you set "" as filler the multiple spaces resulting of the stopword removal are removed too.

alejandro-marques on 14 Jun 2018

Hi,

you can remove the redundant whitespaces by using the pattern replace char filter like so:

"whitespace_remove": { "type": "pattern_replace", "pattern": " +", "replacement": " " }

However, you'll still have the same issue @vchalmel has.

fhinze on 24 Jan 2019

Thanks for the suggestion @fhinze. I would still prefer something like the "preserve_position_increments=false" from completion suggester, but the "pattern_replace" filter combined with a "trim" one worked perfectly in order to remove those whitespaces left by stopwords.

alejandro-marques on 6 Feb 2019

@vchalmel Were you able to find a workaround for this issue? I am facing the same issue when using shingles and stop_words together and want to dedupe shingles based on start_offset.

amateurCoder on 17 May 2019

Dear All
I am facing the same issue >>>> I need to remove all of the shingles with fillers >>> How can I do this ?

ABDULAZIZALQATAN on 16 Dec 2019

That multiple spaces problem can be solved using pattern_replace filter

"space_filter": { 
  "type": "pattern_replace", 
  "pattern": "  *", 
  "replacement": " " 
}

This will replace multiple spaces with one space

Sanket-Valani on 12 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[feature request]smart routing detection when search

makeyang · 3Comments

Check available disk space before starting a build

dadoonet · 3Comments

Should range aggregations support the `missing` option?

jpountz · 3Comments

More Lucene suggesters

clintongormley · 3Comments

The max_clause_count setting is not documented

dawi · 3Comments