The ngram index used by the wildcard field indexes _every_ ngram e.g. the document value 12345 is indexed as 123 , 234 and 345. At query time however we thin out the ngrams used in queries e.g. we'd drop the middle 234 in order to avoid having too many query terms.
Given the thinned use of ngrams at query time the proposal is we can shift much of this thinning to index-time.
This may make some queries slower (less selective ngram queries on short query strings) but would save on disk space.
The question becomes how many ngrams do we drop and which ones?
We could drop ngrams containing what we consider to be punctuation or perhaps those that ==0 when hash/modulo N'ed.
In this issue I will report on experiments using some sample data and various approaches.
Pinging @elastic/es-search (:Search/Search)
If we go ahead with the idea of indexing less than 100% of ngrams in values (to save disk space) then I'd like to drop the existing support for fuzzy queries.
The current implementation relies on the fact that the max fuzzy query edit distance of 2 can be neatly translated to an ngram OR query with a minimum-should-match of numNgrams - 2. This drastically reduces the number of candidate matches we need to verify. This logic does not hold if we drop ngrams from the index and we would have to revert to a match_all type ngram query meaning we would have to decompress and parse every binary document value in the index to verify matches.
Wildcard fields are typically long and the search string can only be 2 edit distances max different from the value. This seems like an unlikely search requirement - people are normally looking for fuzzy matching on short strings.
This does start to break our guarantees that wildcard fields == keyword fields in terms of feature compatibility though.
The problem with dropping terms at index-time (as opposed to the current search-time optimisations) is that you may make future searches hugely inefficient.
As an example - I experimented with dropping tokens where token.hashcode()%2 == 0 - roughly one in every 2 tokens.
The difficulty is that a query like *.exe* may be unlucky enough to throw away both the .ex and exe ngrams due to the way they hash. This is quite likely given the probability is the same as tossing heads twice in a row. When the coin tosses don't come up in your favour you have no tokens in the index with which to accelerate the search and then we have to perform a linear scan of all compressed binary doc values which would be very expensive.
There's an unwelcome lack of predictability to searches then - why should *.exe* take hours to find 10 results but *.foo* take milliseconds? Having this distinction be on the roll of a dice is not something with a precedent in elasticsearch.
Unfortunately a position-independent means of indexing is required. For example, we can't just drop every other token in a stream because the document value and search string may not start with the same characters - the doc may have opted to index tokens 123 34_ in the value 1234 but the user searched for *234* so would be requiring a match on 234 which is not indexed. Each token needs to be verifiably indexed or not based purely on examining the 3 characters it contains and not any surrounding content.
Another deterministic approach that was suggested was to drop tokens that contain any punctuation. This may be more sympathetic to the sorts of searches users might generally run but might not be as effective at reducing disk space.
Whatever algorithm we choose there is the real danger that it won't be aligned with user searches and cause huge difference in response times compared to searches that are lucky enough to contain indexed terms. Maybe admins could take control of thinning choices and use a custom Analyzer?
Either way, part of the appeal of the wildcard field compared to the text field is that users do not have to know what indexing strategies have been employed when writing queries.
Initial results suggest we can save ~35% of disk space indexing only half the ngrams but this benefit is paid for with the cost outlined above that searches can take massively longer to execute if they are unlucky enough to use the wrong terms.
This feels like a trade-off we shouldn't make by default.
@markharwood I agree that fuzzy support is unlikely important but I would avoid dropping support for fuzzy queries entirely as the compatibility between keyword and wildcard is an important feature of the wildcard field. I would rather make fuzzy queries operate purely on doc values.
The hash/modulo dropping approach looks interesting to me but I wonder how it would compare with other approaches, I have the following ideas in mind for instance:
_.Being so data/use-case dependent was why I floated the idea of having user-definable TokenFilters.
Maybe we limit choices to pre-defined token filters we ship for this trimming purpose
I got hold of some log data and related queries to measure the effects of index-time token dropping on queries.
Ironically, my tests failed for the query string *ERROR*. The hash modulos of this string's ngrams (err, rro, ror) all come up as tokens to drop and my code doesn't currently handle this properly. The fix isn't great either because it would need to fall back to a match_all query that does a brute force scan of all unique values in the index.
Agreed that this sounds bad. This is why I prefer skipping ngrams based on their actual content than based on the result of a hashing function.
I hope we could find a way to improve space efficiency with very minimal assumptions so that we wouldn't need to expose options to tailor the behavior of this field differently for every use-case.
I trialled dropping tokens with spaces in them with my test logs dataset.
This reduced the field size by 18% from 3.4GB to 2.8GB:
==== wildcard field on master
total disk: 4,540,272,889
num docs: 18,910,699
stored fields: 870,096,528
term vectors: 0
norms: 0
docvalues: 1,166,315,683
postings: 2,358,532,109
prox: 0
points: 38,816,955
terms: 106,510,273
field total terms dict postings proximity points docvalues % with dv features
===== ===== ========== ======== ========= ========= ========= ======== ========
line 3,487,256,622 306,098 2,358,532,031 0 0 1,128,418,493 100.0% docs binary
_id 106,204,253 106,204,175 78 0 0 0 0.0% docs
_seq_no 37,896,728 0 0 0 0 37,896,728 100.0% 8bytes/1D numeric
_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0%
====== wildcard field, master modified to trim tokens containing spaces
total disk: 3,845,563,101
num docs: 18,910,699
stored fields: 870,224,450
term vectors: 0
norms: 0
docvalues: 1,166,288,842
postings: 1,663,539,771
prox: 0
points: 38,811,842
terms: 106,696,855
field total terms dict postings proximity points docvalues % with dv features
===== ===== ========== ======== ========= ========= ========= ======== ========
line 2,792,248,234 275,929 1,663,539,693 0 0 1,128,432,612 100.0% docs binary
_id 106,421,004 106,420,926 78 0 0 0 0.0% docs
_seq_no 37,855,768 0 0 0 0 37,855,768 100.0% 8bytes/1D numeric
_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0%
The example queries I had weren't noticeably slower as a result of dropping tokens with spaces.
While the strategy may be useful for my example dataset and queries, a worst-case scenario might be on docs that record sequences of data with spaces e.g,
1, 3, 4, 10, 9, 3, 4, 6....
In this scenario there would be no tokens indexed other than 10,
, we could look into how much disk space we'd save and how much slower queries would be if we e.g. normalized all even digits to the previous digit
The bulk of the disk space costs look to be tied up in postings. By normalising terms in the way suggested above I expect we just move the costs around rather than make a significant impact. If bbc and abc are both normalised to abc then that will have the same number of postings - perhaps just denser and therefore compressing a bit better. The postings are denser than a typical index already because they are only 3 character terms (median num postings is 13 in my test data).
Dropping terms will likely have a bigger disk space reduction than merging terms but its side-effect when mis-aligned with searches is potentially bigger. If all search terms are missing (like in my 1, 3, 4 example) the cost is a full scan Vs if search terms are only ever merged (bbc == abc) then we just have more false positive matches to be filtered.
By normalising terms in the way suggested above I expect we just move the costs around rather than make a significant impact.
I tried normalisation on my test data set and as suspected, we halved the number of unique terms but maintained roughly the same posting size. Probably not worth pursuing this approach. FYI @jpountz
```
total disk: 4,312,685,895
num docs: 18,910,699
stored fields: 869,976,159
term vectors: 0
norms: 0
docvalues: 1,166,296,514
postings: 2,131,185,546
prox: 0
points: 38,814,908
terms: 106,411,427
field total terms dict postings proximity points docvalues % with dv features
===== ===== ========== ======== ========= ========= ========= ======== ========
line 3,259,723,192 105,632 2,131,185,468 0 0 1,128,432,092 100.0% docs binary
_id 106,305,873 106,305,795 78 0 0 0 0.0% docs
_seq_no 37,863,960 0 0 0 0 37,863,960 100.0% 8bytes/1D numeric
_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0% ```
It's still more 6%, which I don't consider negligible. I suspect we will want to look into combinations of (hopefully low-impact) tricks that help save space in the ngram index.
One thing I'm also considering is that Lucene uses FOR rather than PFOR for doc IDs in postings, so we might be wasting some disk there as well. For sparse postings lists, spending 20 bits per value rather than 18 is not a big deal, but when the choice is between 4 and 6 the trade-off is a bit different as this could be a 33% saving. (Postings for ngrams are typically denser than postings of text fields.)
I trialled an even more aggressive normalizer based on this code:
private int normalize(int codepoint) {
// Normalize space ! " # $ % & ' ( } * + , - . chars to /
if (codepoint >=32 && codepoint < 48) {
return 47;
}
// Normalize [ \ ] ^ _ ` chars to /
if (codepoint >=91 && codepoint <= 96) {
return 47;
}
// Normalize { | } ~ chars to /
if (codepoint >=123 && codepoint <= 126) {
return 47;
}
// All other ascii characters, normalize odd numbers to even.
if (codepoint >= 48 && codepoint <= 128 && codepoint % 2 ==0) {
// Even ascii chars in 0-9 a-z range.
return codepoint -1;
} else {
//return even ascii char or non-ascii chars
return codepoint;
}
}
This led to some more reductions in disk space - 3.4GB master wildcard to 3GB with the modification.
Search times were not noticeably slower for the test queries provided.
====== wildcard field, master modified to further normalise ascii tokens (most normal punctuation)
total disk: 4,020,035,717
num docs: 18,910,699
stored fields: 870,274,144
term vectors: 0
norms: 0
docvalues: 1,166,316,164
postings: 1,838,346,697
prox: 0
points: 38,813,894
terms: 106,283,477
field total terms dict postings proximity points docvalues % with dv features
===== ===== ========== ======== ========= ========= ========= ======== ========
line 2,966,833,153 59,368 1,838,346,619 0 0 1,128,427,166 100.0% docs binary
_id 106,224,187 106,224,109 78 0 0 0 0.0% docs
_seq_no 37,888,536 0 0 0 0 37,888,536 100.0% 8bytes/1D numeric
_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0%
With this new approach of normalizing tokens (rather than dropping them) we can still accelerate fuzzy queries the way we used to.
However, the range query is a casualty because the scrambling of tokens in the index means we can't rely on querying a contiguous range the way we used to accelerate range queries. @jpountz shall we drop range query support or revert to a match-all approach (if allow expensive queries flag permits)?
@jpountz shall we drop range query support or revert to a match-all approach?
Let's not drop as the compatibility with the keyword field is very important. How are we handling lowercasing today, which already shuffles ordering? Could the current approach be generalized to handle these new normalizations? If not I think it'd be nice to at least support the case when the lower term and the higher term share a common prefix of 2 chars or more so that if you search for terms between abcd and abxy then we could use documents that match \0ab as an approximation?
How are we handling lowercasing today, which already shuffles ordering?
We have 2 strategies for accelerating range queries:
1) Long prefix eg c:/LongPrefix/a.txt TO c:/LongPrefix/z.txt
Here we make an AND of all the leading ngrams in common with the start and end
2) Short prefix eg a.txt TO z.txt
Here we use a TermRangeQuery to get all the values from \0a* to \0z*
It's the second strategy which will no longer work because the normalised values may not exist in a contiguous range.
I identified another block of characters that can be normalized ( : ; < = > ? @) and adding these brought a small decrease in index size:
total disk: 4,007,497,483
num docs: 18,910,699
stored fields: 870,386,123
term vectors: 0
norms: 0
docvalues: 1,166,321,896
postings: 1,825,648,928
prox: 0
points: 38,815,932
terms: 106,323,263
field total terms dict postings proximity points docvalues % with dv features
===== ===== ========== ======== ========= ========= ========= ======== ========
line 2,954,132,832 59,276 1,825,648,850 0 0 1,128,424,706 100.0% docs binary
_id 106,264,065 106,263,987 78 0 0 0 0.0% docs
_seq_no 37,896,728 0 0 0 0 37,896,728 100.0% 8bytes/1D numeric
_primary_term 231 0 0 0 0 231 100.0% numeric
_version 231 0 0 0 0 231 100.0% numeric
_source 0 0 0 0 0 0 0.0%
I think we have a bug due to this in our current implementation of rangeQuery? For instance A.txt TO a.txt should match B.txt but we use \0a* as an acceleration query, which is wrong?
Regardless of that I'm+1 on only keeping the 1st strategy to accelerate range queries and removing the 2nd which is harder to make correct with this change.
Aggressive normalization looks like it yields significant savings without hurting queries too much, maybe we should move forward with it and evaluate dropping ngrams based on heuristics on top of it in a separate issue since it seems to have more implications on query times and search-time complexity (e.g. the point about fuzzy queries that you brought up)?
One thing I'm also considering is that Lucene uses FOR rather than PFOR for doc IDs in postings, so we might be wasting some disk there as well. For sparse postings lists, spending 20 bits per value rather than 18 is not a big deal, but when the choice is between 4 and 6 the trade-off is a bit different as this could be a 33% saving. (Postings for ngrams are typically denser than postings of text fields.)
I wrote a quick and dirty patch to check the possible gain on the dataset we use for testing:
Encoding|Postings size
------------ | -------------
FOR|2,0GB
PFOR|1,5GB
FOR + Normalize|1.7GB
PFOR + Normalize|1.4GB
PFOR saves 25% on the raw ngram index and 23% on the normalized one (lowercase + character folding) so that seems like a great follow up if we want to optimize further.
Most helpful comment
I wrote a quick and dirty patch to check the possible gain on the dataset we use for testing:
Encoding|Postings size
------------ | -------------
FOR|2,0GB
PFOR|1,5GB
FOR + Normalize|1.7GB
PFOR + Normalize|1.4GB
PFOR saves 25% on the raw ngram index and 23% on the normalized one (lowercase + character folding) so that seems like a great follow up if we want to optimize further.