Elasticsearch: Percolator combining must_not and nested objects returns false positives (6.4..6.8)

Created on 22 May 2019  路  5Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version): 6.4.2, 6.5.0, 6.6.0. 6.7.0, 6.8.0

Plugins installed: []

JVM version : java version "10.0.2" 2018-07-17

OS version : Windows 10

Description of the problem including expected versus actual behavior:
A specific percolator query is returning false matches when combining must_not and nested objects in the source document - nested objects are actually not included in the percolator query, though. This only happens between 6.4 and 6.8 - it does not happen in 6.3 or 7.0.
I am not sure what is the specific condition that triggers the unexpected match, but it seems related to must_not and nested documents. The full case to reproduce it follows.

Steps to reproduce:
Tests are done on index test1 that should not exist a priori.


Create the mapping with sometext and somenested as fields, and the query field for the percolator type. The nested one contains a field somekeyword

Collapse/expand

PUT /test1
{
  "mappings": {
    "percolatorquery": {
      "properties": {
        "sometext": {
          "type": "text"
        },
        "somenested": {
          "type": "nested",
          "include_in_parent": true,
          "properties": {
            "somekeyword": {
              "type": "keyword"
            }
          }
        },
        "query": {
          "type": "percolator"
        }
      }
    }
  }
} 


Create _field sometext must not exist_ percolate query.

Collapse/expand

PUT /test1/percolatorquery/1?refresh
{
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "sometext",
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
} 


Do a percolate query for a document which actually has that field. It should not be returned, but the percolate is returned as matching. I think it is significant that

        "fields" : {
          "_percolator_document_slot" : [
            -1
          ]
        }
 ```
is returned regardless of the percolator query being returned as matched. The request follows: 

<details><summary>Collapse/expand</summary>
<p>

GET test1/_search
{
"query": {
"percolate": {
"field": "query",
"documents": [
{
"sometext": "test",
"somenested": {
"somekeyword": "test"
}
}
],
"boost": 1
}
}
}


</p>
</details>

And the response: 

<details><summary>Collapse/expand</summary>
<p>

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test1",
"_type" : "percolatorquery",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"query" : {
"bool" : {
"must_not" : [
{
"exists" : {
"field" : "sometext",
"boost" : 1
}
}
],
"adjust_pure_negative" : true,
"boost" : 1
}
}
},
"fields" : {
"_percolator_document_slot" : [
-1
]
}
}
]
}
}


</p>
</details>

------


Repeat the previous query, but this time the document that we pass does not contain the nested field. In this case, the percolate query works fine and it returns no matches. Worth to note again, the percolate query does not reference the nested field at all.  

<details><summary>Collapse/expand</summary>
<p>

GET test1/_search
{
"query": {
"percolate": {
"field": "query",
"documents": [
{
"sometext": "test"
}
],
"boost": 1
}
}
}


</p>
</details>

------


Repeat the query for a document that does not contain the field. It should be returned, and it is returned. Worth to note, in this case the result contains `"_percolator_document_slot" : [ -1, 0]`, regardless of having a single input document. 

<details><summary>Collapse/expand</summary>
<p>

GET test1/_search
{
"query": {
"percolate": {
"field": "query",
"documents": [
{
"otherfield": "test",
"somenested": {
"somekeyword": "test"
}
}
],
"boost": 1
}
}
}


</p>
</details>

If I remove from the input document `somenested`, the result contains `"_percolator_document_slot" : [0]` , it seems the nested document is playing a role

------

Next tests change in the original percolate query `must_not` with `must`. In this scenario, all of the queries returned the expected results. 

Change the percolator query to the new one: 

<details><summary>Collapse/expand</summary>
<p>

PUT /test1/percolatorquery/1?refresh
{
"query": {
"bool": {
"must": [
{
"exists": {
"field": "sometext",
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}


</p>
</details>

------


Test a document containing the field; the query matches as expected: 

<details><summary>Collapse/expand</summary>
<p>

GET test1/_search
{
"query": {
"percolate": {
"field": "query",
"documents": [
{
"sometext": "test",
"somenested": {
"somekeyword": "test"
}
}
],
"boost": 1
}
}
}


</p>
</details>

------


Test a document not containing the field; it correctly does not match:

<details><summary>Collapse/expand</summary>
<p>

GET test1/_search
{
"query": {
"percolate": {
"field": "query",
"documents": [
{
"differentfield": "test",
"somenested": {
"somekeyword": "test"
}
}
],
"boost": 1
}
}
}
```


:SearcPercolator >bug v6.8.1

Most helpful comment

Unfortunately the workaround doesn't work, the _primary_term field is not listed in the ES mapping so the query doesn't return any document even when it should :(. Another workaround would be to change the _primary_term to any field that appears in all root documents and never in nested fields but there are no such field in the example.

All 5 comments

Pinging @elastic/es-search

@ismael-hasan @jimczi has identified what is the problem and proposed a temporary workaround until we fix this issue in the next 6.x release.

The problem is nested documents which are separate Lucene documents are not excluded during query time. A temporary workaround for this issue is to modify your document to add another filter to exclude nested documents. The filter will be must exists on _primary_term field, as only parent documents have _primary_term field:

PUT test1/percolatorquery/1?refresh
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "bool": {
                "must": {
                  "exists": {
                    "field": "_primary_term"
                  }
                },
                "must_not": [
                  {
                    "exists": {
                      "field": "someText",
                      "boost": 1.0
                    }
                  }
                ],
                "adjust_pure_negative": true,
                "boost": 1.0
              }
            },
            "boost": 100.0
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  }
}

Unfortunately the workaround doesn't work, the _primary_term field is not listed in the ES mapping so the query doesn't return any document even when it should :(. Another workaround would be to change the _primary_term to any field that appears in all root documents and never in nested fields but there are no such field in the example.

Here's another simple workaround:
the percolator doesn't apply the right logic when the score of the query is needed so for use cases where the score of documents is not needed it is possible to bypass the bug by providing a different sort order. The following query for instance will effectively filter nested documents even in 6.x:
{ "query": { "percolate": { "field": "query", "documents": [{ "fullName": "test", "document": { "number": "test" } } ], "boost": 1.0 } }, "sort": "_doc" }
As Mayya said we'll work on a fix but we wanted to find workarounds for this bug first.

This is fixed by #42554, hence closing

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Praveen82 picture Praveen82  路  3Comments

dadoonet picture dadoonet  路  3Comments

abtpst picture abtpst  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

ttaranov picture ttaranov  路  3Comments