Hi !
I have the same issue that the one exposed here : #12819 (before closing due to the lack of feedback asked by @clintongormley to the author @apanimesh061)
I discussed it here on stackoverflow because I thought I made a mistake in my parameters
When using a shingle token filter after removal of stopwords why don't we have a parameter to completely ignore stopwords ?
There would be two huge benefits for me :
I shall add that those duplicated tokens are not removed by a "unique" token filter... and that I don't understand why.
Here is my analysis (I only kept the "test" analyzer which I used to generate examples but I have a few others ):
"settings": {
"index": {
"analysis": {
"filter": {
"fr_stop": {
"ignore_case": "true",
"remove_trailing": "true",
"type": "stop",
"stopwords": "_french_"
},
"fr_worddelimiter": {
"type": "word_delimiter",
"language": "french"
},
"fr_snowball": {
"type": "snowball",
"language": "french"
},
"custom_nGram": {
"type": "ngram",
"min_gram": "3",
"max_gram": "10"
},
"custom_shingles": {
"max_shingle_size": "4",
"min_shingle_size": "2",
"token_separator": " ",
"output_unigrams": "true",
"filler_token":"",
"type": "shingle"
},
"custom_unique": {
"type": "unique",
"only_on_same_position": "true"
},
"fr_elision": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c",
"jusqu",
"quoiqu",
"lorsqu",
"puisqu",
"parce qu",
"parcequ",
"entr",
"presqu",
"quelqu"
]
}
},
"charfilter": "html_strip",
"analyzer": {
"test": {
"filter": [
"asciifolding",
"lowercase",
"fr_stop",
"fr_elision",
"custom_shingles"
"trim",
"custom_unique",
],
"type": "custom",
"tokenizer": "standard"
}
},
"tokenizer": {
"custom_edgeNGram": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "edgeNGram",
"max_gram": "20"
},
"custom_nGram": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "nGram",
"max_gram": "20"
}
}
}
}
}
Hi @vchalmel
Could you provide a complete (and minimal) curl recreation showing exactly what you're doing, the results you're getting, and how they differ from what you expect?
Just use the analyze API:
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/indices-analyze.html
On Tue, 17 May 2016 at 12:06 CHALMEL [email protected] wrote:
Hi @clintongormley https://github.com/clintongormley
I would be glad to... but I'm afraid I don't know how to get the detail of
generated token using curl, I always use kopf to test my analysis...—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/18391#issuecomment-219674522
Hi @clintongormley
curl -XGET 'localhost:9200/test_marches_publics/_analyze' -d '
{
"analyzer":"test",
"text" : "Scandale au pays de Candy"
}
give me, with default "filler_token"
{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale _","start_offset":0,"end_offset":12,"type":"shingle","position":0},
{"token":"scandale _ pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"scandale _ pays _","start_offset":0,"end_offset":20,"type":"shingle","position":0},
{"token":"_ pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"_ pays _","start_offset":12,"end_offset":20,"type":"shingle","position":1},
{"token":"_ pays _ candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2},
{"token":"pays _","start_offset":12,"end_offset":20,"type":"shingle","position":2},
{"token":"pays _ candy","start_offset":12,"end_offset":25,"type":"shingle","position":2},
{"token":"_ candy","start_offset":20,"end_offset":25,"type":"shingle","position":3},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}
and with "filler_token":""
{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2},
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":2},
{"token":"candy","start_offset":20,"end_offset":25,"type":"shingle","position":3},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}
I would want, for example (among other duplicates), only one token "pays", and a token "scandale pays candy" which currently overthrow the max shingle size due to deleted stopwords "ghosts".
Another effect of those deleted stopwords "ghosts", the multiple spaces " " (one for each deleted stopword) instead of " " in the shingles.
What I would want :
{"tokens":[{"token":"scandale","start_offset":0,"end_offset":8,"type":"<ALPHANUM>","position":0},
{"token":"scandale pays","start_offset":0,"end_offset":16,"type":"shingle","position":0},
{"token":"scandale pays candy","start_offset":0,"end_offset":25,"type":"shingle","position":0}
{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1},
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},
{"token":"candy","start_offset":20,"end_offset":25,"type":"<ALPHANUM>","position":4}]}
N.b. I had a mistake (now edited) in my first post, in "custom_unique", "only_on_same_position" is set to true.
If this parameter is false, the "trim"/"custom_unique" get rid of the duplicated tokens, but I don't want to delete identical tokens from distinct parts of the original text and every other issue remains.
@vchalmel positions are used for phrase queries, offsets are used for highlighting. It's a mistake to try to use a shingled field for either of these purposes. A shingled field is used for finding associated words, and that's it. You should use a shingled field in a bool.should clause to boost the relevance of docs with matching fields.
Also, having min=max shingle of 2 is usually all you need. Remove the unigrams.
I don't understand your objection... Is that only a usecase scenario issue ? Or am I deeply wrong about the rôle of those tokens differing only by offset or position ?
You seem also to consider that shingles should only be used in a complementary field to boost score in queries if associated words are found, but that is simply not my use case... I want an index with all these tokens, both unigrams and shingles up to three words at least, because i will use them in a text-mining algorithm...
On the other hand if min=max shingle=2 in that case "fox to be quick" won't generate the shingle "fox quick"
And what about the issue with the extras spaces ?
If I remove the unigrams and search on a unique word, will it match ?
No. But if you're using shingles as a secondary field for boosting, then the single word will match in the primary field you're searching on.
If "pays" is duplicated three times here, are you saying that a search on "pays" won't get a higher score on this document than on a document where it is not surrounded by stopwords and hence is not duplicated because tokens differing only by positions and offset are only affecting highlighting and phrase queries ?
Yes it will get a higher score. But then it is used more frequently, so it should get a higher score. If you don't want it to match when it is combined with different stopwords, then leave the stopwords there (or use a different filler token).
You seem also to consider that shingles should only be used in a complementary field to boost score in queries if associated words are found, but that is simply not my use case... I want an index with all these tokens, both unigrams and shingles up to three words at least, because i will use them in a text-mining algorithm...
Ok... then perhaps your requirements are different, but I don't understand what your exact requirements are. All I'm saying is that this is working correctly as is. Token filters are not allowed to change positions or offsets. That jobs belongs solely to the tokenizer. This comes directly from Lucene and isn't something we can change in Elasticsearch.
Thanks for your explanations !
Token filters are not allowed to change positions or offsets. That jobs belongs solely to the tokenizer.
Oh that is something I failed to take into account...
That would prevent us to re-calculate the size of the shingle to ignore deleted stopwords with the current formula, am I right ?... But how about a parameter to prevent the creation of a shingle starting or ending with a deleted stopword like these :
{"token":"pays","start_offset":12,"end_offset":16,"type":"shingle","position":1}, /*virtually start with deleted stopword "au"*/
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":1},/*virtually start with deleted stopword "au"*/
{"token":"pays","start_offset":12,"end_offset":16,"type":"<ALPHANUM>","position":2}, /*this unigram is duplicate with first token because due to stopwords I have shingles of size 2 containing only 1 word*/
{"token":"pays candy","start_offset":12,"end_offset":25,"type":"shingle","position":2}, /*duplicate with second token*/
How could the size of a stopword being different of the count of words effectively existing in it be "working correctly as is" ?
Hi,
I would like to bring back this issue as I think one of the main reasons for needing the stop words being removed has not been taken into account.
I think the main impact of not ignoring the stopwords positions is when you have to aggregate by a shingle analyzed field, in those cases you either end up with underscores in your results or several white spaces making results that should be equal be treated as different values.
In fact I don't get the point of having the option of configuring the token used to replace the stopword, but keeping the spaces if you set "" as filler token.
Let's put an easy example of a nice functionallity that could be achieved via shingles and aggregations that can not be done properly due to stop words positions not being ignored. If I want a tag cloud of words and group of words appearing in a text field I could do it by analyzing the field with shingles (let's say 1-3 shingles) and then making an aggregation by that field (setting fielddata=true). However I don't need the stop words for a tag cloud, but removing them will result on having for example, for "word1 stopword word2" and "word1 word2" different aggregation values ("word1 word2" and "word1 word2"), which may look the same but one has two whitespaces and the other just one.
I'm facing this same problem while trying to implement an amazon-like autocomplete that is based on a n-gram search and an aggregation over a shingle analyzed field. Only reasonable way I found of achieving what I need is generating an additional field before sending the document to elasticsearch removing the stopwords so they would not be taken into account. And that is just cause when I try to make that aggregation and remove the stop words, they are removed but their postions are still taken into account when generating the aggregation results.
Therefore, as there is already a "filler_token" parameter I think it makes sense that if you set "" as filler the multiple spaces resulting of the stopword removal are removed too.
Hi,
you can remove the redundant whitespaces by using the pattern replace char filter like so:
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " +",
"replacement": " "
}
However, you'll still have the same issue @vchalmel has.
Thanks for the suggestion @fhinze. I would still prefer something like the "preserve_position_increments=false" from completion suggester, but the "pattern_replace" filter combined with a "trim" one worked perfectly in order to remove those whitespaces left by stopwords.
@vchalmel Were you able to find a workaround for this issue? I am facing the same issue when using shingles and stop_words together and want to dedupe shingles based on start_offset.
Dear All
I am facing the same issue >>>> I need to remove all of the shingles with fillers >>> How can I do this ?
That multiple spaces problem can be solved using pattern_replace filter
"space_filter": {
"type": "pattern_replace",
"pattern": " *",
"replacement": " "
}
This will replace multiple spaces with one space
Most helpful comment
Thanks for your explanations !
Oh that is something I failed to take into account...
That would prevent us to re-calculate the size of the shingle to ignore deleted stopwords with the current formula, am I right ?... But how about a parameter to prevent the creation of a shingle starting or ending with a deleted stopword like these :
How could the size of a stopword being different of the count of words effectively existing in it be "working correctly as is" ?