Elasticsearch: Faster wildcard search

Created on 4 Nov 2019  路  20Comments  路  Source: elastic/elasticsearch

While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made prefix queries faster on text fields by indexing edge ngrams?

I tried to think about how this could work and here is an idea:

  • Null bytes would be used as start/end markers in order to make suffix queries possible (*foobar) and prefix queries faster (foobar*).
  • Content would be indexed with a ngram tokenizer that has a fixed gram size, e.g. 5 (could be configurable). It would be used to return a good approximation of the matches of the wildcard query.
  • Doc values would store the original value and could be used for a two-phase verification.

Here are some example queries and how they would run internally with this setup:

| Query | Approximation | Two-phase verification | Note |
| --- | --- | --- | --- |
| *foobar* | fooba AND oobar | Check positions | Like a phrase query |
| foobar* | \0foob AND fooba AND oobar | Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
| *foobar | fooba AND oobar AND obar\0 | Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
| *foo* | *foo* | Always returns true | Need a wildcard query because the substring is shorter than the gram size |
| foo* | \0foo* | Always returns true | Need a wildcard query because the substring is shorter than the gram size |
| *foo | *foo\0 | Always returns true | Need a wildcard query because the substring is shorter than the gram size |
| foobar | \0foob AND fooba AND oobar AND obar\0 | Check positions | We could skip the middle terms like Lucene's NGramPhraseQuery |
| *foobar*quux* | fooba AND oobar | Check doc values | We don't want to run a wildcard query on *quux*, the approximation we have might be selective enough already |

Notes:

  • This might be quite space-intensive on long fields.

Questions:

  • Are there other ways it could work?
  • What is a good default gram size?
  • How much more space does it require compared to keyword indexing?
  • Should it be an option on text/keyword fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams.
  • Should we index positions? Whenever we check positions above, we could also check doc values instead.
  • Should we use BINARY or SORTED doc values? BINARY feels more appropriate for the access pattern described above, but compression would be terrible.
:SearcSearch >feature SIEM Top Ask v7.9.0 v8.0.0

Most helpful comment

Can we reuse the same logic as WildcardQuery? It would be a better experience if a wildcard query on a keyword or this new type would return the exact same results.

All 20 comments

Pinging @elastic/es-search (:Search/Search)

Are there other ways it could work?

I think we can try to extend the optimization to handle leading wildcard efficiently. It seems for example that this solution would allow to transform any query in the form of *foo* to a simple prefix query foo* ? Simple leading wildcard query like *foo are more problematic if the ending part is too short. Maybe we could make the ngram size a factor of this ? 3 seems a good compromise, we'd need to check more positions but we'd avoid the degraded case more often ?

Oh I like this idea. Maybe instead of changing the gram size, we could index suffixes with multiple gram sizes. For instance under my first proposal, foobar would be indexed as ["\0foob", "fooba", "oobar", "obar\0"], but maybe we could actually do ["\0foob", "fooba", "oobar", "obar\0", "bar\0", "ar\0", "r\0"]. Then we could always run *foo* as foo* internally? So this would give this updated table

| Query | Approximation | Two-phase verification | Note |
| --- | --- | --- | --- |
| *foobar* | fooba AND oobar | Check positions | Like a phrase query |
| foobar* | \0foob AND fooba AND oobar | Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
| *foobar | fooba AND oobar AND obar\0 | Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
| *foo* | foo* | Always returns true | Need a trailing wildcard query because the substring is shorter than the gram size |
| foo* | \0foo* | Always returns true | Need a wildcard query because the substring is shorter than the gram size |
| *foo | foo\0 | Always returns true | Term query |
| foobar | \0foob AND fooba AND oobar AND obar\0 | Check positions | We could skip the middle terms like Lucene's NGramPhraseQuery |
| *foobar*quux* | fooba AND oobar, or maybe fooba AND oobar AND quux* | Check doc values, or maybe check positions a la MultiPhraseQuery? | The right approach probably depends on the number of expansions of substrings that are shorter than the gram size |

What is a good default gram size?

Given this is intended to be supportive of EQL can we do some analysis of existing rules out there?

+1 it would be useful to know the distribution of the number of chars between wildcards

I had a scan of EQL rules they fell into these groups

Pattern wildcard type| Number of patterns| Average pattern length
---------|---------|---------
left and right (*foo*) | 482 | 22
left and middle (*fo*o) | 107 | 56
middle and right (f*oo*) | 15 | 56
middle and middle (f*o*o) | 11 | 68
left only (*foo) | 296 | 31
right only (foo*) | 68 | 18
middle only (f*oo) | 38 | 54

similarly to how we made prefix queries faster on text fields

So would this be an index_wildcards addition to the index_prefixes and index_phrases options for text fields?

This is a good question. One difference with this way of indexing unlike indexing prefixes or shingles is that it can also help find exact matches. And I think that a number of users would rather like to save some space by only indexing ngrams, instead of indexing both the original keyword and its ngrams. So I'm leaning towards exposing a different field that only indexes ngrams?

So I'm leaning towards exposing a different field that only indexes ngrams?

Fair enough. Will there be a specialised new query type for this too? Would we reject traditional token-based queries like term, terms and match? Also, will we support highlighting?

I'm viewing this as a specialized keyword field for wildcard search, so I think it shouldn't have a specialized query type but reuse the existing ones based on the approach outlined above. And like keywords, it wouldn't support highlighting.

Would interval queries be efficient here for checking the sequences of these expressions?
I've sketched out an approach that doesn't verify positions here but if I simply swapped an Interval query rather than the Boolean query used in my example would that have all the logic for efficient 2-phase operation? It would avoid having big doc-value fields and string matching for the verification step.

I think we can't escape from having to run wildcards against doc values in some cases. For instance, if the query looks like *a*b* then we need to find all ngrams that start with an a that are followed by an ngram that starts with a b, which creates combinatorial explosion issues if we want to verify matches using positions. For a query like that, I think the only realistic way to verify that there is a match is by running the wildcard against doc values.

OK - I'll code up the DV approach for all cases for now and we can optimise later for cases with longer patterns if we think intervals would be a big improvement.

This sounds good to me.

I'd like us to try both approaches to see how they compare, but my gut feeling is that the storage overhead of positions is going to be _huge_, and probably a barrier to adoption for many users. So only indexing ngrams with IndexOptions.DOCS and then checking with doc values is likely going to be a better trade-off. I might be surprised by data though. :)

For the verification phase we may have to consider some of the flags we see in regex queries to do with "greedy" evaluation.
I tried to match a food is good document with a f*od query using Regex.simpleMatch and it failed whereas a food*od query matched OK.

Can we reuse the same logic as WildcardQuery? It would be a better experience if a wildcard query on a keyword or this new type would return the exact same results.

Is there mileage in thinking of this feature more broadly as a form of index-backed regex with its own query type?
There's extra functionality in Interval queries like ORs or max gap that we could expose if we move beyond wildcard syntax and include some things from regex e.g. the | OR symbol to give us patterns like [elastic search|elasticsearch] error.

I guess we could add the field type for now and dream up new query syntax later. It might make us want to reconsider the name of the field type in the first instance though.

I was thinking that this field would be a good fit for regexp or fuzzy queries indeed. Why would you create a custom query type, we could reuse the existing regexp and fuzzy queries? MappedFieldType has factory methods for both these methods so fields are free to implement these queries however they want.

Why would you create a custom query type

I was thinking that if we opt for the pos-based implementation the regex functionality we could offer might be constrained by what Interval queries can support. This would create a divergence in functionality between "real" regex and what this new field type could support

OK, let's focus on the positions vs. doc-values discussion first then. I'll write some thoughts about this on the PR.

Was this page helpful?
0 / 5 - 0 ratings