Elasticsearch: Sharing calculations between filter, sort and field scripts

Created on 24 Apr 2015 · 10Comments · Source: elastic/elasticsearch

For a current project, we have an index containing products which have a complex pricing model. Some price components are stored in the documents, where others are influenced by external parameters. Prices are calculated on the fly by a groovy script which is passed these external parameters from our application. The calculation performed by this script is pretty cpu-intensive and may take several ms per document.

The use case here is "_I have a budget between X and Y. Show me the 10 products that have a price within this range, sorted by price descending, and show me the price buildup (including all price components)_".

Since it doesn't seem to be possible to share information between scripts executed in different query phases, we currently need three groovy scripts in order to achieve this:

A filter script, performing the entire calculation and returning a boolean indicating whether the calculated price is between X and Y
A sorting script, performing the entire calculation and returning the total price as a number
A field script, performing the entire calculation and returning an object containing the total price along with a summary of how the price was calculated.

These scripts are almost exactly alike, only differing in the type of value they return. This creates a lot of overhead. Let's say we have 10000 documents in our index, 5000 of which are eligible for a given price range and parameter set. That means the scripts will be evaluated:

10000x for filtering purposes (script_filter)
5000x for sorting documents that passed through the filter
10x for displaying the price components for the top 10 matching products that will actually be returned (script_field)

For a grand total of 15010 full calculations. About a third of these (those in sort and display) should not have been necessary because the bulk of what they're calculating has already been found by the filter script.

Ideally, the scripted field would be evaluated first and remain available during the rest of the query scope, so that the filtering and sorting scripts could access its calculated results (or one of its properties) and not have to perform the entire calculation again. This would reduce our total script execution time greatly.

Is there, or will there be, any way to achieve this in elasticsearch?

:CorInfrScripting discuss

Source

bsander

👍2

Most helpful comment

Hi @bsander

Interesting question. It would be nice to be able to cache the output of a script, although we'd need to be able to mark it as cacheable (eg can't contain rand() etc).

However, the problem as you describe it above can be solved today using the function_score query. The script would run once for every document, and use the calculated price for the _score. Any prices outside the desired range (calculated as part of the script) could be excluded with min_score, and the resulting price would be returned in the score field, eg:

DELETE t

POST t/t/_bulk
{"index":{}}
{"num":1}
{"index":{}}
{"num":2}
{"index":{}}
{"num":3}
{"index":{}}
{"num":4}
{"index":{}}
{"num":5}
{"index":{}}
{"num":6}
{"index":{}}
{"num":7}
{"index":{}}
{"num":8}
{"index":{}}
{"num":9}
{"index":{}}
{"num":10}


GET t/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "min_score": 0,
      "boost_mode": "replace", 
      "functions": [
        {
          "script_score": {
            "script": "num = doc['num'].value; if (num >= min_val && num <= max_val) { return num}; return -1",
            "params": {
              "min_val": 3,
              "max_val": 5
            }
          }
        }
      ]
    }
  }
}

returns:

"hits": {
  "total": 3,
  "max_score": 5,
  "hits": [
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRtq",
        "_score": 5,
        "_source": {
           "num": 5
        }
     },
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRtp",
        "_score": 4,
        "_source": {
           "num": 4
        }
     },
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRto",
        "_score": 3,
        "_source": {
           "num": 3
        }
     }
  ]
}

clintongormley on 26 Apr 2015

👍2

All 10 comments

Hi @bsander

Interesting question. It would be nice to be able to cache the output of a script, although we'd need to be able to mark it as cacheable (eg can't contain rand() etc).

DELETE t

POST t/t/_bulk
{"index":{}}
{"num":1}
{"index":{}}
{"num":2}
{"index":{}}
{"num":3}
{"index":{}}
{"num":4}
{"index":{}}
{"num":5}
{"index":{}}
{"num":6}
{"index":{}}
{"num":7}
{"index":{}}
{"num":8}
{"index":{}}
{"num":9}
{"index":{}}
{"num":10}


GET t/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "min_score": 0,
      "boost_mode": "replace", 
      "functions": [
        {
          "script_score": {
            "script": "num = doc['num'].value; if (num >= min_val && num <= max_val) { return num}; return -1",
            "params": {
              "min_val": 3,
              "max_val": 5
            }
          }
        }
      ]
    }
  }
}

returns:

"hits": {
  "total": 3,
  "max_score": 5,
  "hits": [
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRtq",
        "_score": 5,
        "_source": {
           "num": 5
        }
     },
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRtp",
        "_score": 4,
        "_source": {
           "num": 4
        }
     },
     {
        "_index": "t",
        "_type": "t",
        "_id": "AUz2cefd6wt16ozAgRto",
        "_score": 3,
        "_source": {
           "num": 3
        }
     }
  ]
}

clintongormley on 26 Apr 2015

👍2

Hi @clintongormley,

Thanks for your thorough response. We've been experimenting with your approach during the last week and indeed it does reduce the time spent calculating prices by quite a margin. It does however complicate other interactions with the content, for instance when sorting by price _ascending_ or sorting by an entirely different field. Where we would normally just change the sort directive, we need to change our query in various different places to support sorting by price. Also, queries where price isn't necessarily relevant become much slower than they used to with this method, so we need to watch out for that too.

So, for the sake of maintainability, it would still be much better for us to be able to use cached values from previously executed scripts instead.

bsander on 12 May 2015

sorting by price ascending

just add:

"sort": { "_score": {"order": "asc" }}

or sorting by an entirely different field

Just do:

"sort": [
    { "some_field": { "order": "asc" }},
    { "_score": { "order": "asc" }}
]

Also, queries where price isn't necessarily relevant become much slower than they used to with this method, so we need to watch out for that too.

Yes, you need to send the right query for the right purpose.

So, for the sake of maintainability, it would still be much better for us to be able to use cached values from previously executed scripts instead.

Elasticsearch has no idea what is happening inside your script. Even if we were to add a _cache flag, then we'd need to know how long to cache it for, how many values we should cache etc, all of which makes the execution more complex, and produces more garbage to use up the heap and require collection.

I'll leave this as discuss, but I think: better to send the right query for the right job.

clintongormley on 15 May 2015

Closing. As @clintongormley pointed out, this can be improved a lot by just formulating the query differently.

jpountz on 26 Aug 2015

Is this advice still valid? I have opened a discussion topic: https://discuss.elastic.co/t/script-result-caching-reusing/60042

gm42 on 8 Sep 2016

👍1

The discussion I opened would need to be updated now, as I dug deeper into it.

However I just noticed that document score is not accessible in the post_filter, as per #20131

gm42 on 20 Sep 2016

I have the same case, and I don't know how to optimize it with _score. Also, as @gm42 noticed #20131 not allows to get _score while post_filter-ing.

Another question: is there are a way to apply a filter first in a post_filter with boolean condition, and then only in case if it will be not filtered out to apply second filter with script filter?

dmitry on 24 Sep 2016

As I understand here is already 3 ES users, who have very similar use case and since one year here is no solution for this case.

My understanding:

_cache (across multiple ES requests) not allowed for parametrized or non-parametrized script function as they could be undeterministic;
script functions cannot be cached to be used across one ES request (first in function_score, then in aggs, then in post_filter, and afterwards in sort);
_score not accessible in a post_filter (#20131), and nobody know when it could be solved;
min_score is just a tiny optimization, which will not work in our case (what about max_score, why it's missing?);

After listing possible optimization variants I find out there are no solution at the moment to write a good ES query to optimizely get filtered, aggregated and sorted results in a resasonable time frame.

@clintongormley @jpountz
Can you please at least suggest some additional optimization practices, which are invisible for us?

dmitry on 24 Sep 2016

@dmitry the sad truth might be that we need a different technology when it's necessary to perform custom map/reduce at this level. It's a kind-of compute-search-and-throw-away scenario, rather than everything-indexed scenario which seems better covered by ES.

I need to stick with ES for the time being, so I will keep trying to make it fit for my use case - but some tell-tale signs are evident.

My workaround is to simply get all results (without post_filter on score) and then filter them out on the application side. I know, very inefficient and not scalable, but couldn't find any other approach so far

gm42 on 26 Sep 2016

👍1

Just found this issue, seems to be very similar to one I create a while ago: https://github.com/elastic/elasticsearch/issues/13469