For a current project, we have an index containing products which have a complex pricing model. Some price components are stored in the documents, where others are influenced by external parameters. Prices are calculated on the fly by a groovy script which is passed these external parameters from our application. The calculation performed by this script is pretty cpu-intensive and may take several ms per document.
The use case here is "_I have a budget between X and Y. Show me the 10 products that have a price within this range, sorted by price descending, and show me the price buildup (including all price components)_".
Since it doesn't seem to be possible to share information between scripts executed in different query phases, we currently need three groovy scripts in order to achieve this:
These scripts are almost exactly alike, only differing in the type of value they return. This creates a lot of overhead. Let's say we have 10000 documents in our index, 5000 of which are eligible for a given price range and parameter set. That means the scripts will be evaluated:
script_filter)script_field)For a grand total of 15010 full calculations. About a third of these (those in sort and display) should not have been necessary because the bulk of what they're calculating has already been found by the filter script.
Ideally, the scripted field would be evaluated first and remain available during the rest of the query scope, so that the filtering and sorting scripts could access its calculated results (or one of its properties) and not have to perform the entire calculation again. This would reduce our total script execution time greatly.
Is there, or will there be, any way to achieve this in elasticsearch?
Hi @bsander
Interesting question. It would be nice to be able to cache the output of a script, although we'd need to be able to mark it as cacheable (eg can't contain rand() etc).
However, the problem as you describe it above can be solved today using the function_score query. The script would run once for every document, and use the calculated price for the _score. Any prices outside the desired range (calculated as part of the script) could be excluded with min_score, and the resulting price would be returned in the score field, eg:
DELETE t
POST t/t/_bulk
{"index":{}}
{"num":1}
{"index":{}}
{"num":2}
{"index":{}}
{"num":3}
{"index":{}}
{"num":4}
{"index":{}}
{"num":5}
{"index":{}}
{"num":6}
{"index":{}}
{"num":7}
{"index":{}}
{"num":8}
{"index":{}}
{"num":9}
{"index":{}}
{"num":10}
GET t/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"min_score": 0,
"boost_mode": "replace",
"functions": [
{
"script_score": {
"script": "num = doc['num'].value; if (num >= min_val && num <= max_val) { return num}; return -1",
"params": {
"min_val": 3,
"max_val": 5
}
}
}
]
}
}
}
returns:
"hits": {
"total": 3,
"max_score": 5,
"hits": [
{
"_index": "t",
"_type": "t",
"_id": "AUz2cefd6wt16ozAgRtq",
"_score": 5,
"_source": {
"num": 5
}
},
{
"_index": "t",
"_type": "t",
"_id": "AUz2cefd6wt16ozAgRtp",
"_score": 4,
"_source": {
"num": 4
}
},
{
"_index": "t",
"_type": "t",
"_id": "AUz2cefd6wt16ozAgRto",
"_score": 3,
"_source": {
"num": 3
}
}
]
}
Hi @clintongormley,
Thanks for your thorough response. We've been experimenting with your approach during the last week and indeed it does reduce the time spent calculating prices by quite a margin. It does however complicate other interactions with the content, for instance when sorting by price _ascending_ or sorting by an entirely different field. Where we would normally just change the sort directive, we need to change our query in various different places to support sorting by price. Also, queries where price isn't necessarily relevant become much slower than they used to with this method, so we need to watch out for that too.
So, for the sake of maintainability, it would still be much better for us to be able to use cached values from previously executed scripts instead.
sorting by price ascending
just add:
"sort": { "_score": {"order": "asc" }}
or sorting by an entirely different field
Just do:
"sort": [
{ "some_field": { "order": "asc" }},
{ "_score": { "order": "asc" }}
]
Also, queries where price isn't necessarily relevant become much slower than they used to with this method, so we need to watch out for that too.
Yes, you need to send the right query for the right purpose.
So, for the sake of maintainability, it would still be much better for us to be able to use cached values from previously executed scripts instead.
Elasticsearch has no idea what is happening inside your script. Even if we were to add a _cache flag, then we'd need to know how long to cache it for, how many values we should cache etc, all of which makes the execution more complex, and produces more garbage to use up the heap and require collection.
I'll leave this as discuss, but I think: better to send the right query for the right job.
Closing. As @clintongormley pointed out, this can be improved a lot by just formulating the query differently.
Is this advice still valid? I have opened a discussion topic: https://discuss.elastic.co/t/script-result-caching-reusing/60042
The discussion I opened would need to be updated now, as I dug deeper into it.
However I just noticed that document score is not accessible in the post_filter, as per #20131
I have the same case, and I don't know how to optimize it with _score. Also, as @gm42 noticed #20131 not allows to get _score while post_filter-ing.
Another question: is there are a way to apply a filter first in a post_filter with boolean condition, and then only in case if it will be not filtered out to apply second filter with script filter?
As I understand here is already 3 ES users, who have very similar use case and since one year here is no solution for this case.
My understanding:
_cache (across multiple ES requests) not allowed for parametrized or non-parametrized script function as they could be undeterministic;aggs, then in post_filter, and afterwards in sort);_score not accessible in a post_filter (#20131), and nobody know when it could be solved;min_score is just a tiny optimization, which will not work in our case (what about max_score, why it's missing?);After listing possible optimization variants I find out there are no solution at the moment to write a good ES query to optimizely get filtered, aggregated and sorted results in a resasonable time frame.
@clintongormley @jpountz
Can you please at least suggest some additional optimization practices, which are invisible for us?
@dmitry the sad truth might be that we need a different technology when it's necessary to perform custom map/reduce at this level. It's a kind-of compute-search-and-throw-away scenario, rather than everything-indexed scenario which seems better covered by ES.
I need to stick with ES for the time being, so I will keep trying to make it fit for my use case - but some tell-tale signs are evident.
My workaround is to simply get all results (without post_filter on score) and then filter them out on the application side. I know, very inefficient and not scalable, but couldn't find any other approach so far
Just found this issue, seems to be very similar to one I create a while ago: https://github.com/elastic/elasticsearch/issues/13469
Most helpful comment
Hi @bsander
Interesting question. It would be nice to be able to cache the output of a script, although we'd need to be able to mark it as cacheable (eg can't contain
rand()etc).However, the problem as you describe it above can be solved today using the
function_scorequery. The script would run once for every document, and use the calculated price for the_score. Any prices outside the desired range (calculated as part of the script) could be excluded withmin_score, and the resulting price would be returned in thescorefield, eg:returns: