Elasticsearch version (bin/elasticsearch --version):
Version: 6.3.0, Build: default/tar/424e937/2018-06-11T23:38:03.357887Z, JVM: 1.8.0_102
Plugins installed: []
JVM version (java -version):
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
OS version (uname -a:
Darwin Kernel Version 17.6.0: Tue May 8 15:22:16 PDT 2018; root:xnu-4570.61.1~1/RELEASE_X86_64 x86_64
Description of the problem including expected versus actual behavior:
When terms aggregation is applied to a Painless script, then empty strings, returned by the script, are ignored (there is no bucket for them in the response).
Expected behavior: there should be a bucket for empty strings.
Steps to reproduce:
# delete the index
curl -XDELETE localhost:9200/test
# re-create the index
curl -XPUT localhost:9200/test -H "Content-Type: application/json" -d '{
"mappings": {
"test": {
"properties" : {
"string_with_empty_values": {
"type" : "keyword"
}
}
}
}
}'
# insert documents
curl -XPOST localhost:9200/test/test -H "Content-Type: application/json" -d '{
"string_with_empty_values": "Not empty"
}'
curl -XPOST localhost:9200/test/test -H "Content-Type: application/json" -d '{
"string_with_empty_values": ""
}'
curl -XPOST localhost:9200/test/test -H "Content-Type: application/json" -d '{
"string_with_empty_values": null
}'
# aggregate using a script
curl -XPOST localhost:9200/test/_search -H "Content-Type: application/json" -d '{
"size": 0,
"aggregations": {
"string_with_empty_values_terms": {
"terms": {
"script": {
"source": "doc['\''string_with_empty_values'\''].value",
"lang": "painless"
}
}
},
"string_with_empty_values_missing": {
"missing": {
"script": {
"source": "doc['\''string_with_empty_values'\''].value",
"lang": "painless"
}
}
}
}
}'
The response will look like the following (note the absence of a bucket for an empty term):
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"string_with_empty_values_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Not empty",
"doc_count": 1
}
]
},
"string_with_empty_values_missing": {
"doc_count": 1
}
}
}
The same response will be received for the following query:
curl -XPOST localhost:9200/test/_search -H "Content-Type: application/json" -d '{
"size": 0,
"aggregations": {
"string_with_empty_values_terms": {
"terms": {
"script": {
"source": "params._source.string_with_empty_values",
"lang": "painless"
}
}
},
"string_with_empty_values_missing": {
"missing": {
"script": {
"source": "params._source.string_with_empty_values",
"lang": "painless"
}
}
}
}
}'
At the same time, an aggregation by the field directly:
curl -XPOST localhost:9200/test/_search -H "Content-Type: application/json" -d '{
"size": 0,
"aggregations": {
"string_with_empty_values_terms": {
"terms": {
"field": "string_with_empty_values"
}
},
"string_with_empty_values_missing": {
"missing": {
"field": "string_with_empty_values"
}
}
}
}'
returns correct result:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"string_with_empty_values_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "",
"doc_count": 1
},
{
"key": "Not empty",
"doc_count": 1
}
]
},
"string_with_empty_values_missing": {
"doc_count": 1
}
}
}
Pinging @elastic/es-search-aggs
@jdconrad Is this intrinsic to how Painless works, e.g. empty strings are treated as null? Or maybe something else?
There's also a case to be made that the regular aggregation shouldn't be making a bucket for empty strings, and _that's_ the bug not painless :)
@polyfractal After a brief look at this, there's nothing Painless is doing directly to cause an empty string to be modified to null. For whatever reason, we seem to interpret doc['field'].value and params._source.field as null. @rjernst Do you know where empty strings might be translated to null for this issue?
I think this may have been inadvertently fixed recently by https://github.com/elastic/elasticsearch/pull/34457. The keyword field will use ordinal based terms aggregation, while script based will use the strings aggregator. Before that change, if the first doc accessed had an empty string, it would have been skipped.
So, this should be fixed in 6.5.0.
Accidental fixes are the best kind of fixes :) Thanks for looking into this @rjernst and @jdconrad
Most helpful comment
I think this may have been inadvertently fixed recently by https://github.com/elastic/elasticsearch/pull/34457. The keyword field will use ordinal based terms aggregation, while script based will use the strings aggregator. Before that change, if the first doc accessed had an empty string, it would have been skipped.
So, this should be fixed in 6.5.0.