Elasticsearch: NullPointerException in Scripted metric aggregation reduce script when used within terms and filter aggs

Created on 2 Jun 2017  路  2Comments  路  Source: elastic/elasticsearch

This bug was first raised in this forum topic: https://discuss.elastic.co/t/elasticsearch-5-3-scripted-metric-reduce-script-running-twice/87784

To reproduce this start a fresh cluster and running the following (note this script will delete any existing index named test):



GET test/_search
{
  "size": 0,
  "aggs": {
    "a": {
      "terms": {
        "field": "id",
        "size": 10
      },
      "aggs": {
        "measure" : {
          "filter" : {
            "bool" : {
              "must" : [
                {
                  "terms" : {
                    "foo.bar" : [
                      "foo"
                    ],
                    "boost" : 1.0
                  }
                },
                {
                  "terms" : {
                    "foo.baz" : [
                      "bar"
                    ],
                    "boost" : 1.0
                  }
                }
              ]
            }
          },
          "aggregations" : {
           "profit": {
      "scripted_metric": {
                "init_script" : "params._agg.transactions = []",
                "map_script" : "params._agg.transactions.add(1)", 
                "combine_script" : "double profit = 0; for (t in params._agg.transactions) { profit += t } return profit",
                "reduce_script" : "double profit = 0; for (a in params._aggs) { profit += a } return profit"
            }
    }
          }
        }
      }
    }
  }
}

The search request will fail with:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "script_exception",
      "reason": "runtime error",
      "script_stack": [
        "profit += a } ",
        "^---- HERE"
      ],
      "script": "double profit = 0; for (a in params._aggs) { profit += a } return profit",
      "lang": "painless",
      "caused_by": {
        "type": "null_pointer_exception",
        "reason": null
      }
    }
  },
  "status": 503
}

Which is weird because profit should not be null since its a declared local variable in the script.

If the reduce script is replaced with:

"reduce_script" : "Debug.explain(params._aggs)"

The error becomes:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "script_exception",
      "reason": "runtime error",
      "painless_class": "ArrayList",
      "to_string": "[2.0]",
      "java_class": "java.util.ArrayList",
      "script_stack": [
        "Debug.explain(params._aggs)",
        "                    ^---- HERE"
      ],
      "script": "Debug.explain(params._aggs)",
      "lang": "painless",
      "caused_by": {
        "type": "painless_explain_error",
        "reason": null
      }
    }
  },
  "status": 503
}

Which is strange because its showing that params._aggs has a value.

I have tried to reproduce with combinations of the above documents but so far have only managed to reproduce this with these 10 documents so I am yet to determine the exactly criteria for reproducing this bug.

:AnalyticAggregations :CorInfrScripting >docs help wanted

Most helpful comment

After digging some more I have realised that this is actually not a bug but we should clarify the behaviour int he documentation.

Firstly, the reduce script is run multiple times because the scripted_metric aggregation is a sub-aggregation of the terms aggregation it is evaluated for each of the terms buckets so the reduce script needs to be run for each bucket in the terms aggregation. In my recreation script above the reduce script is run 5 times because the terms aggregation produces five buckets (one for each of my five id terms).

Now to why you get a NPE in the first reduce script:

 "double profit = 0; for (a in params._aggs) { profit += a } return profit"

The reason is because there are terms buckets where none of the documents that fall into the bucket match the filter aggregation. In my example above this is seen because of the document:

PUT test/doc/10
{
  "id": "e",
  "foo.bar": "fooo",
  "foo.baz": "barr"

}

which is the only document that contains the term e in the id field and also doesn't match the filter aggregation. This causes an empty aggregation response to be created for the aggregation in that bucket when it is returned from the shards which the reduce script doesn't expect and deal with. If you replace the reduce script with the following it executes correctly:

"double profit = 0; for (a in params._aggs) { if (a != null) { profit += a } } return profit"

At the moment there is no mention of the empty aggregation response in the documentation for the scripted_metric aggregation so I think we should clarify it there.

Another thing that is a little confusing here is that the output for the error seemed to point to profit being null instead of a:

      "script_stack": [
        "profit += a } ",
        "^---- HERE"
      ],

I can sort of see why this happens as its not the fact that a is null that causes the error but its when the operation is applied to profit (i.e. when we try to increment profit by null) that the error occurs, so I'm not sure whether we should change this? /cc @jdconrad

For the second reduce script which uses Debug.explain the reason it is not showing that the value is null is that because the Debug.explain throws an exception to tell you what the value of the variable is, it is throwing the exception on the first bucket it is evaluating which does not have the empty aggregation.

All 2 comments

After digging some more I have realised that this is actually not a bug but we should clarify the behaviour int he documentation.

Firstly, the reduce script is run multiple times because the scripted_metric aggregation is a sub-aggregation of the terms aggregation it is evaluated for each of the terms buckets so the reduce script needs to be run for each bucket in the terms aggregation. In my recreation script above the reduce script is run 5 times because the terms aggregation produces five buckets (one for each of my five id terms).

Now to why you get a NPE in the first reduce script:

 "double profit = 0; for (a in params._aggs) { profit += a } return profit"

The reason is because there are terms buckets where none of the documents that fall into the bucket match the filter aggregation. In my example above this is seen because of the document:

PUT test/doc/10
{
  "id": "e",
  "foo.bar": "fooo",
  "foo.baz": "barr"

}

which is the only document that contains the term e in the id field and also doesn't match the filter aggregation. This causes an empty aggregation response to be created for the aggregation in that bucket when it is returned from the shards which the reduce script doesn't expect and deal with. If you replace the reduce script with the following it executes correctly:

"double profit = 0; for (a in params._aggs) { if (a != null) { profit += a } } return profit"

At the moment there is no mention of the empty aggregation response in the documentation for the scripted_metric aggregation so I think we should clarify it there.

Another thing that is a little confusing here is that the output for the error seemed to point to profit being null instead of a:

      "script_stack": [
        "profit += a } ",
        "^---- HERE"
      ],

I can sort of see why this happens as its not the fact that a is null that causes the error but its when the operation is applied to profit (i.e. when we try to increment profit by null) that the error occurs, so I'm not sure whether we should change this? /cc @jdconrad

For the second reduce script which uses Debug.explain the reason it is not showing that the value is null is that because the Debug.explain throws an exception to tell you what the value of the variable is, it is throwing the exception on the first bucket it is evaluating which does not have the empty aggregation.

Was this page helpful?
0 / 5 - 0 ratings