Currently the options for ngram and shingle tokenizers/token filters allow the user to set min_size and max_size to any values. This is dangerous as users can set values which produces huge numbers of terms and at best bloat their index but at worst cause problems such as https://github.com/elastic/elasticsearch/issues/25841.
I think we should add soft (and/or maybe hard) limits so that neither min_size or max_size can be more than say 6 and the difference between min_size and max_size can't be more than 2 or 3 (we may even want to make this limit 1).
Note that this does not apply to edge_ngrams where it is useful to have higher values and a larger difference between min and max values. We should probably decide if there should be different limits here though.
I think this is worth having a look at. A soft limit will catch anyone just playing around and give back an error so they know this can be problematic.
We discussed during fixit friday and we all agreed on the fact that there should be a soft limit for the difference between the min_size and max_size settings. In case of the ngram tokenizer / token filter the soft limit should be 0 and the shingle token filter's soft limit should be 3. For the edge_ngram tokenizer / token filter there should be no soft limit.
Hello guys,
I was having a look at this issue.
Should be raised a new IllegalArgumentException in case the conditions are not satisfied?
Is the constructor of ShingleTokenFilter a good place to do these validations?
I guess we will raise IllegalArgumentException if conditions are not satisfied .
I was thinking we can do a check at the level of org.elasticsearch.index.analysis.ShingleTokenFilterFactory and org.elasticsearch.index.analysis.NGramTokenizerFactory.
@mayya-sharipova If we change the settings in those classes to use org.elasticsearch.common.settings.Setting.intSetting(String, int, int, int, Property...) instead of Settings.getAsInt() then we should be able to define and validate the min and max values using the Settings infrastructure and the exception will be consistent with other similar validation errors thrown form settings.
Analyzers use a group setting (analysis.*) to register their settings. We don't check the inner setting for analyzers which is why we have to use Settings.getAs.... We should at some point move the analyzers options to checked settings (org.elasticsearch.common.settings.Setting) but that would be a breaking change for plugins devs and not a low hanging fruit.
For this issue I think it's fine to check the conditions in the factories directly and throw an IllegalArgumentException if conditions are not met as @mayya-sharipova suggests.
I'll open an issue to discuss how we could validate inner settings for analyzers but that's not a low hanging fruit ;).
@jimczi @colings86 Do we want to create a new Setting for the difference between min_size and max_size? If yes, where do keep this setting?
If no, does it mean that will always throw an exception for example when Ngram min_gram != max_gram?
When we say " soft limit should be 0" what do we mean? Does it mean always throwing an exception when the limit is not 0?
@mayya-sharipova I think the soft-limit for the difference between min_size and max_size should be a setting that is set on the index as a whole rather than a setting on each ngram token filter individually and its value should default to 0. We should then check this in the constructor of NGramTokenizerFactory and throw an IllegalArgumentException if the difference exceeds the settings value.
I feel that this is going to cause issues because of the following:
If I need to do a contains search over something like "AA-PP-001-002" I will be restricted by it. For instance - min ngram = 3, max ngram = 4
Would result in - AA-, A-P, ..... AA-P, A-PP, ....
I would not find a string like AA-PP - or P-001
And I don't see any suggested alternatives. I saw the extreme case of max = 6000. Clearly, this should be a developer discretion. Deprecating without an alternative is not a solution IMHO.
Clearly, this should be a developer discretion
The solution implemented just adds a setting to limit the allowed diff between min_ngram and max_ngram to 1 by default. You can change this setting at your discretion ;)
Note that as detailed in the linked PR (#27411), these limits are backed by an index level setting that can be changed if needed, in the case of ngrams the setting is index.max_ngram_diff which controls what the maximum difference between max_ngram and min_ngram can be. The intention with these limits is to help cluster administrators have limits on the usage of their cluster that are sensible for the resources available to the cluster. Use cases with a wide difference between min and max ngrams are few and far between and those use cases can still modify the setting from its default.
Also note the in your example a query for say AA-PP would, after analysis of the query, search for AA-, AA-P, A-P, A-PP, -PP which would all be present in a document containing a field with AA-PP-001-002 sincee the ngram analysis would be done at both index and search time
@colings86 - Thanks for your reply. Ok thats great I was under the impression that the deprecation message indicated that the diff would eventually disappear and we would be forced into using the max diff of 1. As long as this is just a cautionary message I am fine.
Most helpful comment
Analyzers use a group setting (
analysis.*) to register their settings. We don't check the inner setting for analyzers which is why we have to useSettings.getAs.... We should at some point move the analyzers options to checked settings (org.elasticsearch.common.settings.Setting) but that would be a breaking change for plugins devs and not a low hanging fruit.For this issue I think it's fine to check the conditions in the factories directly and throw an
IllegalArgumentExceptionif conditions are not met as @mayya-sharipova suggests.I'll open an issue to discuss how we could validate inner settings for analyzers but that's not a low hanging fruit ;).