Elasticsearch: Wildcard field - proposal to reduce required disk space

Created on 23 Sep 2020 · 18Comments · Source: elastic/elasticsearch

The ngram index used by the wildcard field indexes _every_ ngram e.g. the document value 12345 is indexed as 123 , 234 and 345. At query time however we thin out the ngrams used in queries e.g. we'd drop the middle 234 in order to avoid having too many query terms.

Given the thinned use of ngrams at query time the proposal is we can shift much of this thinning to index-time.
This may make some queries slower (less selective ngram queries on short query strings) but would save on disk space.

The question becomes how many ngrams do we drop and which ones?
We could drop ngrams containing what we consider to be punctuation or perhaps those that ==0 when hash/modulo N'ed.
In this issue I will report on experiments using some sample data and various approaches.

:SearcSearch >enhancement Search

Source

markharwood

Most helpful comment

One thing I'm also considering is that Lucene uses FOR rather than PFOR for doc IDs in postings, so we might be wasting some disk there as well. For sparse postings lists, spending 20 bits per value rather than 18 is not a big deal, but when the choice is between 4 and 6 the trade-off is a bit different as this could be a 33% saving. (Postings for ngrams are typically denser than postings of text fields.)

I wrote a quick and dirty patch to check the possible gain on the dataset we use for testing:

Encoding|Postings size
------------ | -------------
FOR|2,0GB
PFOR|1,5GB
FOR + Normalize|1.7GB
PFOR + Normalize|1.4GB

PFOR saves 25% on the raw ngram index and 23% on the normalized one (lowercase + character folding) so that seems like a great follow up if we want to optimize further.

jimczi on 1 Oct 2020

👍2

All 18 comments

Pinging @elastic/es-search (:Search/Search)

elasticmachine on 23 Sep 2020

If we go ahead with the idea of indexing less than 100% of ngrams in values (to save disk space) then I'd like to drop the existing support for fuzzy queries.

Why? - we lose the optimisation that made fuzzy queries fast

The current implementation relies on the fact that the max fuzzy query edit distance of 2 can be neatly translated to an ngram OR query with a minimum-should-match of numNgrams - 2. This drastically reduces the number of candidate matches we need to verify. This logic does not hold if we drop ngrams from the index and we would have to revert to a match_all type ngram query meaning we would have to decompress and parse every binary document value in the index to verify matches.

At what cost? - Little, it was likely never very useful anyway

Wildcard fields are typically long and the search string can only be 2 edit distances max different from the value. This seems like an unlikely search requirement - people are normally looking for fuzzy matching on short strings.

This does start to break our guarantees that wildcard fields == keyword fields in terms of feature compatibility though.

markharwood on 24 Sep 2020

The dangers of index-time thinning

The problem with dropping terms at index-time (as opposed to the current search-time optimisations) is that you may make future searches hugely inefficient.
As an example - I experimented with dropping tokens where token.hashcode()%2 == 0 - roughly one in every 2 tokens.
The difficulty is that a query like *.exe* may be unlucky enough to throw away both the .ex and exe ngrams due to the way they hash. This is quite likely given the probability is the same as tossing heads twice in a row. When the coin tosses don't come up in your favour you have no tokens in the index with which to accelerate the search and then we have to perform a linear scan of all compressed binary doc values which would be very expensive.
There's an unwelcome lack of predictability to searches then - why should *.exe* take hours to find 10 results but *.foo* take milliseconds? Having this distinction be on the roll of a dice is not something with a precedent in elasticsearch.

Unfortunately a position-independent means of indexing is required. For example, we can't just drop every other token in a stream because the document value and search string may not start with the same characters - the doc may have opted to index tokens 123 34_ in the value 1234 but the user searched for *234* so would be requiring a match on 234 which is not indexed. Each token needs to be verifiably indexed or not based purely on examining the 3 characters it contains and not any surrounding content.
Another deterministic approach that was suggested was to drop tokens that contain any punctuation. This may be more sympathetic to the sorts of searches users might generally run but might not be as effective at reducing disk space.
Whatever algorithm we choose there is the real danger that it won't be aligned with user searches and cause huge difference in response times compared to searches that are lucky enough to contain indexed terms. Maybe admins could take control of thinning choices and use a custom Analyzer?
Either way, part of the appeal of the wildcard field compared to the text field is that users do not have to know what indexing strategies have been employed when writing queries.

Unacceptable trade off?

Initial results suggest we can save ~35% of disk space indexing only half the ngrams but this benefit is paid for with the cost outlined above that searches can take massively longer to execute if they are unlucky enough to use the wrong terms.
This feels like a trade-off we shouldn't make by default.

markharwood on 24 Sep 2020

@markharwood I agree that fuzzy support is unlikely important but I would avoid dropping support for fuzzy queries entirely as the compatibility between keyword and wildcard is an important feature of the wildcard field. I would rather make fuzzy queries operate purely on doc values.

jpountz on 24 Sep 2020

The hash/modulo dropping approach looks interesting to me but I wonder how it would compare with other approaches, I have the following ideas in mind for instance:

More aggressively normalizing characters in order to have fewer, denser postings lists, e.g. normalizing all spaces (line breaks, tabs, ...) to a white space and all punctuation characters to a single character e.g. _.
The above idea is pretty conservative, we could look into how much disk space we'd save and how much slower queries would be if we e.g. normalized all even digits to the previous digit (1->0, 3->2, 5->4, etc.) and possibly the same for characters (b->a, d->c, etc.).
Not dropping ngrams based on their hashes but based on heuristics, e.g. dropping all 3-grams that have a space or punctuation character in the middle, or all 3-grams that contain 2 spaces or punctuation characters. I wonder if we could reason about the 3-grams that are less likely to be useful for queries?

jpountz on 24 Sep 2020

Being so data/use-case dependent was why I floated the idea of having user-definable TokenFilters.

Maybe we limit choices to pre-defined token filters we ship for this trimming purpose

markharwood on 24 Sep 2020

I got hold of some log data and related queries to measure the effects of index-time token dropping on queries.

Ironically, my tests failed for the query string *ERROR*. The hash modulos of this string's ngrams (err, rro, ror) all come up as tokens to drop and my code doesn't currently handle this properly. The fix isn't great either because it would need to fall back to a match_all query that does a brute force scan of all unique values in the index.

markharwood on 28 Sep 2020

Agreed that this sounds bad. This is why I prefer skipping ngrams based on their actual content than based on the result of a hashing function.

I hope we could find a way to improve space efficiency with very minimal assumptions so that we wouldn't need to expose options to tailor the behavior of this field differently for every use-case.

jpountz on 28 Sep 2020

I trialled dropping tokens with spaces in them with my test logs dataset.
This reduced the field size by 18% from 3.4GB to 2.8GB:

==== wildcard field on master

total disk:      4,540,272,889
num docs:           18,910,699
stored fields:     870,096,528
term vectors:                0
norms:                       0
docvalues:       1,166,315,683
postings:        2,358,532,109
prox:                        0
points:             38,816,955
terms:             106,510,273

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
         line   3,487,256,622         306,098   2,358,532,031               0               0   1,128,418,493          100.0%                    docs binary
          _id     106,204,253     106,204,175              78               0               0               0            0.0%                           docs
      _seq_no      37,896,728               0               0               0               0      37,896,728          100.0%              8bytes/1D numeric
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source               0               0               0               0               0               0            0.0%              


====== wildcard field, master modified to trim tokens containing spaces

total disk:      3,845,563,101
num docs:           18,910,699
stored fields:     870,224,450
term vectors:                0
norms:                       0
docvalues:       1,166,288,842
postings:        1,663,539,771
prox:                        0
points:             38,811,842
terms:             106,696,855

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
         line   2,792,248,234         275,929   1,663,539,693               0               0   1,128,432,612          100.0%                    docs binary
          _id     106,421,004     106,420,926              78               0               0               0            0.0%                           docs
      _seq_no      37,855,768               0               0               0               0      37,855,768          100.0%              8bytes/1D numeric
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source               0               0               0               0               0               0            0.0%

The example queries I had weren't noticeably slower as a result of dropping tokens with spaces.
While the strategy may be useful for my example dataset and queries, a worst-case scenario might be on docs that record sequences of data with spaces e.g,

1, 3, 4, 10, 9, 3, 4, 6....

In this scenario there would be no tokens indexed other than 10,

, we could look into how much disk space we'd save and how much slower queries would be if we e.g. normalized all even digits to the previous digit

The bulk of the disk space costs look to be tied up in postings. By normalising terms in the way suggested above I expect we just move the costs around rather than make a significant impact. If bbc and abc are both normalised to abc then that will have the same number of postings - perhaps just denser and therefore compressing a bit better. The postings are denser than a typical index already because they are only 3 character terms (median num postings is 13 in my test data).

Dropping terms will likely have a bigger disk space reduction than merging terms but its side-effect when mis-aligned with searches is potentially bigger. If all search terms are missing (like in my 1, 3, 4 example) the cost is a full scan Vs if search terms are only ever merged (bbc == abc) then we just have more false positive matches to be filtered.

markharwood on 29 Sep 2020

By normalising terms in the way suggested above I expect we just move the costs around rather than make a significant impact.

I tried normalisation on my test data set and as suspected, we halved the number of unique terms but maintained roughly the same posting size. Probably not worth pursuing this approach. FYI @jpountz

```
total disk: 4,312,685,895
num docs: 18,910,699
stored fields: 869,976,159
term vectors: 0
norms: 0
docvalues: 1,166,296,514
postings: 2,131,185,546
prox: 0
points: 38,814,908
terms: 106,411,427

    field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
    =====           =====      ==========        ========       =========       =========       =========        ========                       ========
     line   3,259,723,192         105,632   2,131,185,468               0               0   1,128,432,092          100.0%                    docs binary
      _id     106,305,873     106,305,795              78               0               0               0            0.0%                           docs
  _seq_no      37,863,960               0               0               0               0      37,863,960          100.0%              8bytes/1D numeric

_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0% ```

markharwood on 29 Sep 2020

It's still more 6%, which I don't consider negligible. I suspect we will want to look into combinations of (hopefully low-impact) tricks that help save space in the ngram index.

One thing I'm also considering is that Lucene uses FOR rather than PFOR for doc IDs in postings, so we might be wasting some disk there as well. For sparse postings lists, spending 20 bits per value rather than 18 is not a big deal, but when the choice is between 4 and 6 the trade-off is a bit different as this could be a 33% saving. (Postings for ngrams are typically denser than postings of text fields.)

jpountz on 29 Sep 2020

I trialled an even more aggressive normalizer based on this code:

        private int normalize(int codepoint) {
            // Normalize  space ! " # $ % &  ' ( } * + , - . chars to / 
            if (codepoint >=32 && codepoint < 48) {
                return 47; 
            }
            // Normalize  [ \ ] ^ _ ` chars to / 
            if (codepoint >=91 && codepoint <= 96) {
                return 47; 
            }
            // Normalize  { | } ~ chars to / 
            if (codepoint >=123 && codepoint <= 126) {
                return 47; 
            }
            // All other ascii characters, normalize odd numbers to even.
            if (codepoint >= 48 && codepoint <= 128 && codepoint % 2 ==0) {
                // Even ascii chars in 0-9 a-z range.
                return codepoint -1;
            } else {
                //return even ascii char or non-ascii chars
                return codepoint;
            }                
        }

This led to some more reductions in disk space - 3.4GB master wildcard to 3GB with the modification.
Search times were not noticeably slower for the test queries provided.

====== wildcard field, master modified to further normalise ascii tokens (most normal punctuation)

total disk:      4,020,035,717
num docs:           18,910,699
stored fields:     870,274,144
term vectors:                0
norms:                       0
docvalues:       1,166,316,164
postings:        1,838,346,697
prox:                        0
points:             38,813,894
terms:             106,283,477

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
         line   2,966,833,153          59,368   1,838,346,619               0               0   1,128,427,166          100.0%                    docs binary
          _id     106,224,187     106,224,109              78               0               0               0            0.0%                           docs
      _seq_no      37,888,536               0               0               0               0      37,888,536          100.0%              8bytes/1D numeric
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source               0               0               0               0               0               0            0.0%

markharwood on 30 Sep 2020

👍1

With this new approach of normalizing tokens (rather than dropping them) we can still accelerate fuzzy queries the way we used to.
However, the range query is a casualty because the scrambling of tokens in the index means we can't rely on querying a contiguous range the way we used to accelerate range queries. @jpountz shall we drop range query support or revert to a match-all approach (if allow expensive queries flag permits)?

markharwood on 30 Sep 2020

@jpountz shall we drop range query support or revert to a match-all approach?

Let's not drop as the compatibility with the keyword field is very important. How are we handling lowercasing today, which already shuffles ordering? Could the current approach be generalized to handle these new normalizations? If not I think it'd be nice to at least support the case when the lower term and the higher term share a common prefix of 2 chars or more so that if you search for terms between abcd and abxy then we could use documents that match \0ab as an approximation?

jpountz on 30 Sep 2020

How are we handling lowercasing today, which already shuffles ordering?

We have 2 strategies for accelerating range queries:
1) Long prefix eg c:/LongPrefix/a.txt TO c:/LongPrefix/z.txt
Here we make an AND of all the leading ngrams in common with the start and end
2) Short prefix eg a.txt TO z.txt
Here we use a TermRangeQuery to get all the values from \0a* to \0z*

It's the second strategy which will no longer work because the normalised values may not exist in a contiguous range.

markharwood on 30 Sep 2020

I identified another block of characters that can be normalized ( : ; < = > ? @) and adding these brought a small decrease in index size:

total disk:      4,007,497,483
num docs:           18,910,699
stored fields:     870,386,123
term vectors:                0
norms:                       0
docvalues:       1,166,321,896
postings:        1,825,648,928
prox:                        0
points:             38,815,932
terms:             106,323,263

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
         line   2,954,132,832          59,276   1,825,648,850               0               0   1,128,424,706          100.0%                    docs binary
          _id     106,264,065     106,263,987              78               0               0               0            0.0%                           docs
      _seq_no      37,896,728               0               0               0               0      37,896,728          100.0%              8bytes/1D numeric
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source               0               0               0               0               0               0            0.0%

markharwood on 30 Sep 2020

I think we have a bug due to this in our current implementation of rangeQuery? For instance A.txt TO a.txt should match B.txt but we use \0a* as an acceleration query, which is wrong?

Regardless of that I'm+1 on only keeping the 1st strategy to accelerate range queries and removing the 2nd which is harder to make correct with this change.

Aggressive normalization looks like it yields significant savings without hurting queries too much, maybe we should move forward with it and evaluate dropping ngrams based on heuristics on top of it in a separate issue since it seems to have more implications on query times and search-time complexity (e.g. the point about fuzzy queries that you brought up)?

jpountz on 1 Oct 2020

👍1

One thing I'm also considering is that Lucene uses FOR rather than PFOR for doc IDs in postings, so we might be wasting some disk there as well. For sparse postings lists, spending 20 bits per value rather than 18 is not a big deal, but when the choice is between 4 and 6 the trade-off is a bit different as this could be a 33% saving. (Postings for ngrams are typically denser than postings of text fields.)

I wrote a quick and dirty patch to check the possible gain on the dataset we use for testing:

Encoding|Postings size
------------ | -------------
FOR|2,0GB
PFOR|1,5GB
FOR + Normalize|1.7GB
PFOR + Normalize|1.4GB

PFOR saves 25% on the raw ngram index and 23% on the normalized one (lowercase + character folding) so that seems like a great follow up if we want to optimize further.

jimczi on 1 Oct 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings