Elasticsearch: Shard request cache and script queries/aggregations

Created on 19 Nov 2019 · 8Comments · Source: elastic/elasticsearch

Support for caching queries including scripts:

Although the documentation for the shard request query currently says:

If your query uses a script whose result is not deterministic (e.g. it uses a random function or references the current time) you should set the request_cache flag to false to disable caching for that request

In practice the cache is skipped whenever ScriptService is used

This is intentional, per @jimczi:

this is intentional ... we cannot ensure that the result is deterministic.

An alternative (which per the docs seems consistent with how some other scenarios are handled) would be to default to skipping the cache in such cases but allow use of the existing request_cache=true param for clients to declare their script is deterministic and can be cached

Note that scripted aggregations are often very expensive and therefore great candidates to be cached!

:AnalyticAggregations :SearcSearch >enhancement

Source

AlexP-Elastic

👍1

Most helpful comment

Fixed by the following changes:

Scripting: Groundwork for caching script results https://github.com/elastic/elasticsearch/pull/49895 (backport)
Scripting: Cache script results if deterministic https://github.com/elastic/elasticsearch/pull/50106 (backport)
[TEST] Unknown scripting annotations raise error https://github.com/elastic/elasticsearch/pull/50343 (backport)
Scripting: ScriptFactory not required by compile https://github.com/elastic/elasticsearch/pull/50344 (backport)
[DOCS] Deterministic scripted queries are cached https://github.com/elastic/elasticsearch/pull/50408 (backport)

stu-elastic on 20 Dec 2019

❤2

All 8 comments

Pinging @elastic/es-search (:Search/Search)

elasticmachine on 19 Nov 2019

I'm seeing scripted queries/aggs as a way to trade performance for flexibility, as they allow to do things that had not been planned at index time. Since these are already trading performance for something else, it doesn't feel right to me to now trade correctness for performance by enabling caching when the user declares it is safe.

Maybe tell us more about your usage of scripts? I wonder that you might be using scripts as a workaround to a missing aggregation feature?

jpountz on 19 Nov 2019

👍1

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

elasticmachine on 19 Nov 2019

I'm personally a heavy script user and my usage patterns certainly shouldn't be taken as representative :) but for the purposes of discussion, my uses of scripts include:

Formatting and transforming fields in Kibana using the script field functionality
- _(unclear to what extent cache is needed for this scenario ... eg if I create a visualization and share the link, a cache is one way of handling the resulting spike? Is the shard request the right cache for that?)_
Similarly, I use a spreadsheet connector (https://github.com/Alex-At-Home/elasticsearch-sheets) which lets (/encourages!) you to create script fields and scripts for queries and aggregations (and build quite complex transforms between the source data and the spreadsheet's cell range using the scripted_metric aggregation)
- _(obviously a random app I built isn't evidence of any requirement though! The case of caching would be similar to the Kibana one, ie sharing a link to lots of people)_
An aggregation I use somewhat commonly involves having a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation
- (this is actually the thing I was experimenting with the performance of when I came upon the out-of-date documentation and starting asking around)

So it could be summarized as a mix of "missing aggregation features", (related) "trading off performance to provide (query-time) flexibility". and to a lesser extent "trading off performance to keep all logic in one place"

In all cases I'm not so much trading off "correctness for performance" with cache, I'm trading off memory for performance (based on the knowledge/expectation that there will be a large number of queries with the same results in a given time period)

AlexP-Elastic on 19 Nov 2019

Thanks for the detailed reply!

eg if I create a visualization and share the link, a cache is one way of handling the resulting spike? Is the shard request the right cache for that?

This is exactly the reason why we have this cache. :)

a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation

That one sounds interesting to me. Do I understand correctly that instead of sorting terms by doc_count descending, you want to sort them by descending weight? Or maybe even descending weight*doc_count? Can you tell me more about the higher-level use-case, is it something like a rollup?

To be clear I'm not against caching scripted queries or aggs, but I'm worried about allowing users to cache data that is not cacheable. My preferred way of fixing this would be by enabling Painless to tell us when a script is deterministic or not, so that we could make caching decisions accordingly. @jdconrad @stu-elastic Do you think it'd be doable?

jpountz on 19 Nov 2019

An aggregation I use somewhat commonly involves having a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation

This caught my eye as well, would love to know more. We've talked about making bucket_sort scriptable, which would allow a lot more custom sorting of agg buckets. I realize that's still using a script, but being a post-processing step it'd also be a lot faster since it would only invoke the script once.

(although it would have different semantics since it's only sorting the final list of buckets, instead of all the buckets at runtime).

$polyfractal picture$ polyfractal on 19 Nov 2019

So, I think we could make this possible through Painless for which scripts are deterministic, but I don't think it would be all that useful unless we are safe to assume that any access to docvalues (or _source) or whatever else the user is doing would be flushed from the cache upon changes. And if anything is done from user-defined params (are weights done this way or is a new script created every time with constants?) then it's also not deterministic as we explicitly expect those to be changed throughout a script's life.

The other thing is right now Painless isn't really aware of something like doc and just views this input as a simple Map. We would need to specialize certain inputs to be known as deterministic.

Edit: After thinking about this I realized that all those values are deterministic because otherwise the cache wouldn't work. (Oops.) I think Painless only has one non-deterministic methods right now in randomUUID.