Elasticsearch: Caching of terms aggregation with stored script can cause stale results

Created on 23 Sep 2016  Â·  32Comments  Â·  Source: elastic/elasticsearch

Updating a stored script doesn't invalidate terms aggs that are using this script. As a result these aggs can produce stale data. To reproduce, start a single elasticsearch node (5.0.0-beta1 or above) with the following settings:

indices.queries.cache.all_segments: true
script.engine.painless.stored: true

then run the following script:

DELETE test

PUT _scripts/painless/test
{
  "script": "1"
}

PUT test/doc/1?refresh
{
  "foo": 10
}

GET test/doc/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "script": {
          "id": "test",
          "lang": "painless"
        }
      }
    }
  }
}

PUT _scripts/painless/test
{
  "script": "2"
}

GET test/doc/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "script": {
          "id": "test",
          "lang": "painless"
        }
      }
    }
  }
}

the expected result of the last command:

...
      "buckets": [
        {
          "key": "2",
          "doc_count": 1
        }
      ]
...

the actual result of the last command:

...
      "buckets": [
        {
          "key": "1",
          "doc_count": 1
        }
      ]
...

You can get the expected result if you run POST test/_cache/clear after before running the aggregation.

Considering that we cannot always guarantee that scripts don't depend on external factors such as time or external data sources, I am not sure if we should cache results of aggregations that are using scripts by default.

:AnalyticAggregations :CorInfrScripting >bug blocker v5.0.0-rc1

All 32 comments

Agreed, we shouldn't cache this at all. There are two caches we need to take into account here, the request cache (which cached the search request in this case) and the query cache (if the script query is used).

We could in IndicesService#canCache(...) introspect the aggregation and query builder and decide to not cache if we encounter a file (the file can still be updated after node startup) or stored script.

Alternatively we could also during query builder rewrite resolve file and stored scripts and replace them with inline scripts. I think this is nicer as it fixes this problem for both request and query cache. Also ScriptService can then become simpler as it no longer has to worry where the content of the script comes from.

Alternatively we could also during query builder rewrite resolve file and stored scripts and replace them with inline scripts.

Yes, but caching of inline scripts only works for scripts that are pure functions, which doesn't necessary true even for painless scripts that have access to java.time.Instant.now() for example.

Right, I didn't think of that. I think we have then no other alternative then completely opting out of caching if scripts or being used in a search request.

_technically_ we can detect that sort of thing with painless.... Should we
disable caching for any request with a script and then slowly drag things
back? Like expressions can be cached, painless without now(). Maybe we
can't ever cache groovy.

Or do we tell users to opt requests out of the cache manually if they have
a script?

On Sat, Sep 24, 2016 at 6:44 AM, Martijn van Groningen <
[email protected]> wrote:

Right, I didn't think of that. I think we have then no other alternative
then completely opting out of caching if scripts or being used in a search
request.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/20645#issuecomment-249358333,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANLog-dx25jvTPVpje3WodSBENWbiNWks5qtP8MgaJpZM4KFPE_
.

The simplest solution might be to disable caching queries with scripts by default, but give users an ability to opt-in for caching if they know that it is acceptable for their use-case. A more sophisticated solution would be to ask script engine service to return some sort of a flag for a script that would indicate if the script is a pure function or not and then we can decide if we want to cache it or not based on that.

I think for now we should prevent caching if scripts are involved here. We even should not do it if users ask for it explicitly. Down the road we should require our scripts to be functions. There should not be access to APIs that are not idempotent. We can still allow access to NOW since we detect that or at least can detect it. Once we can ensure that we can allow caching for certain scripts (painless only?)

Once we can ensure that we can allow caching for certain scripts (painless only?)

I think allowing script engine to identify if a script is a pure function or not is not going to be much more complex than implementing painless-only solution, but it will also benefit plugins like PMML which always generates and executes pure function scripts.

I don't understand why PMML will benefit from a generic solution?
We need a solution for this for 5.0 and I am leaning towards just disabling cache if any script is involved. The only safe exception is expressions. I am curious if @jdconrad or @rjernst has ideas here.

I agree that the most sensible option for 5.0 is to disable cache if any scripts are involved. My comment was about your proposal to re-enable caching for some scripts in the future. My understanding was that you are proposing to enable this only for painless. My proposal was to allow any script engine to let the caching system know if scripts are pure functions or not. I envision it as a method of a script engine that in case of painless will return false if access to now is detected and true otherwise. In case of script engines like PMML it will also return true because all PMML scripts are pure functions by definition.

I think we have different understandings when it comes to PMML I don't see why it needs to be it's own script engine. Can't it just produce scripts using painless iff it really needs a script?

Yes, it doesn't have to be its own script engine (if we could add some ML-specific math functions to painless and if we are doing PMML at all). I am just proposing a clean generic solution that will work not only for painless but also for any script engine that can guarantee that its scripts are pure functions as in the case of the current implementation of PMML.

So, just to make sure I understand, there are two caching issues here -- one is related to queries being cached with scripts that are potentially updated in the background (stored/file), the other is related to non-idempotent whitelisted methods. For the first we need to check to make sure the stored/file script is the same as the one in the cached request, I don't know who the responsibility should belong to for invalidation, but I guess it would be a callback to a listener coming out of the ScriptService every time a new stored/file script is cached. For the second one, I need to go through the whitelist for Painless and make sure that it is only the now methods of Instant that aren't idempotent. Currently, 'now' is already broken since the script will be calling it at slightly different times for each document. @rjernst can probably better explain a possible solution at least for this last part.

Edit: For 5.0, I completely agree with turning off support for caching involving scripts. I think other solutions are longer term and need to be fully thought through.

Second Edit: @imotov pointed out that I don't mean idempotent, but rather a pure function.

your understanding is correct @jdconrad

A thought I had while reading along (for the easiest problems of those above) - should we remove support for now() from painless and only support it as a script parameter? seems like it will give people what they want and be consistent (across shards?)/cacheable?

@bleskes This is the correct way to do this. We can even leave the method in the whitelist and just specialize on that specific method as it seems more intuitive than pulling 'now' out of a params map.

So, like, for painless we detect that someone wants now and if they do then we stick it in the params map? That sounds good, though we'll need to do some work to make sure that this properly dodges the request cache.

I actually like the idea more of having a special variable, instead of calling now(), as it makes it clear the value will not change within the request (for example, using in two different clauses within a script).

The only caveat is this will require some distributed changes, as we need to create this timestamp on coordinating node, and send it along with shard request, and then allow scripting engines to pull it in when a script uses now.

One upside of a parameter is that it can be set on the coordinating node which means now is the same across shard. Not sure if we can do that with painless magic. Another upside is that it will be available in expressions as well.

On 3 Oct 2016, at 6:42 PM, Nik Everett [email protected] wrote:

So, like, for painless we detect that someone wants now and if they do then we stick it in the params map? That sounds good, though we'll need to do some work to make sure that this properly dodges the request cache.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@rjernst Fair enough about the special variable. Do you care if it's a variable of a method?

I think a variable implies more that the value will not change between uses, vs a method, especially one called now() might be confusing. Perhaps we could at least change to a variable to start, with painless setting it equal to a now() call during parameter setup (so it would not be per document, but once when binding the script for a shard), and then follow up to make this value in a future change consistent across shards.

Yeah, sounds good.

Another question I had is the following - it seems to me that we should resolve stored scripts on the coordinating node as well. If that's the case, and we took care of the now(), is there anything else that can render a script not cachable? I think we're good there?

it seems to me that we should resolve stored scripts on the coordinating node as well

Not sure about this; it would mean passing along the source of the stored script to shards, which effectively becomes an inline script. I think that would be hard to manage, give the separate security settings for disabling inline scripts vs stored scripts, and there would be no way for shards to understand the difference (that could not also be used by someone trying to circumvent the disabling of inline scripts).

is there anything else that can render a script not cachable?

Anything that is not idempotent, but I don't know of anything else at this time.

I think the security aspect is the same, regardless if it's done on shard or the coordinating node?

Re transfer of the whole script - I agree it's a shame if the script is large. On the other hand it also provides consistency. We have no guarantee that all shards will have the same script if it is concurrently changed. Seems like all in all it is the easiest to do?

On 3 Oct 2016, at 7:56 PM, Ryan Ernst [email protected] wrote:

it seems to me that we should resolve stored scripts on the coordinating node as well

Not sure about this; it would mean passing along the source of the stored script to shards, which effectively becomes an inline script. I think that would be hard to manage, give the separate security settings for disabling inline scripts vs stored scripts, and there would be no way for shards to understand the difference (that could not also be used by someone trying to circumvent the disabling of inline scripts).

is there anything else that can render a script not cachable?

Anything that is not idempotent, but I don't know of anything else at this time.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

I think the security aspect is the same, regardless if it's done on shard or the coordinating node?

I don't think so? Inline scripting could be disabled, with a block on putting stored scripts (for example, putting them into cluster state while "closed off" from the outside world, and then restarting with a block), so you effectively have read only stored scripts.

I spoke with @rjernst and it seems that the concern with enforcing the script setting on the coordinating node is that someone will be able to bypass it by making/faking a shard level transport request directly to the data nodes. While that will indeed work it falls into a much bigger list of evil things people could do if the use direct transport messages - these range from deleting data to replacing any stored script with another. I don't think it should influence our decision here.

Security absolutely should influence our decision here. Just because something else is already insecure doesn't mean we should feel free to add more insecurities. We should be working toward making the other things more secure.

Spoke with @bleskes just now. I better understand his arguments, but need some more time to process. We'll hop on zoom tomorrow to discuss a solution.

whatever we decide here, I think this must be fixed for 5.0 GA so if anybody is working on this please assign the issue. I will pick it up tomorrow if nobody did so

I think this must be fixed for 5.0 GA

++ . I think we are all in alignment on the fix for 5.0 (disable caching for scripts). The discussion is about future work.

disable caching for scripts

++

closing this since it's fixed by #20750

Was this page helpful?
0 / 5 - 0 ratings