Elasticsearch: using default_operator "AND" causes query_strings actually containing AND to return wrong results

Created on 25 Aug 2015  Â·  7Comments  Â·  Source: elastic/elasticsearch

If I have a query string like "one AND two OR three", I expect the results to be the same as one of "(one AND two) OR three" or "one AND (two OR three)" but it seems like they are actually the same as "one OR two OR three".

Also, if the default_operator is AND, actually I expect "one two OR three" to work either like "(one AND two) OR three" or "one AND (two OR three)" -- but instead it works like "one OR two OR three".

Here's an example:

# create a test index
curl -XPUT 'localhost:9200/test'

# index a doc with a field containing the text "one"
curl -XPUT 'localhost:9200/test/mytype/1' -d '
{
    "text": "one"
}'

# query the index with no default_operator, for "one AND two OR three" (0 results as expected)
curl -XGET 'localhost:9200/test/mytype/_search?pretty' -d '
{
    "query": {
        "query_string": {
            "default_field": "_all", 
            "query": "text:(one AND two OR three)"
        }
    }
}'

# query with default operator, "(one AND two) OR three" (0 results as expected)
# "one AND (two OR three)" also gives the expected 0 results
curl -XGET 'localhost:9200/test/mytype/_search?pretty' -d '
{
    "query": {
        "query_string": {
            "default_field": "_all", 
            "default_operator": "AND", 
            "query": "text:((one AND two) OR three)"
        }
    }
}'

# query "one AND two OR three" now with default operator, returns one result but I expect 0
curl -XGET 'localhost:9200/test/mytype/_search?pretty' -d '
{
    "query": {
        "query_string": {
            "default_field": "_all", 
            "default_operator": "AND", 
            "query": "text:(one AND two OR three)"
        }
    }
}'

Using elasticsearch-1.7.1.

Most helpful comment

The answer is explained in that blog post, to quote:

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

  • Queries are parsed left to right
  • NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
  • AND will change the Occurs flag of the clause to it’s left to MUST unless it has already been set to MUST_NOT
  • AND sets the Occurs flag of the clause to it’s right to MUST
  • If the default operator of the query parser has been set to “And”: OR will change the Occurs flag of the clause to it’s left to SHOULD unless it has already been set to MUST_NOT
  • OR sets the Occurs flag of the clause to it’s right to SHOULD

Frankly, these rules are just too hard to remember. This is one of the many reasons I don't like using the query_string query at all. Here's another reason. Look at these two queries for example:

http://foo   # finds an empty regex in field `http` and `foo` in the `_all` field
http://foo/  # throws a malformed regex exception

If you want to understand how the query string syntax is being understood, then use the validate-query API:

GET _validate/query?explain
{
  "query": {
    "query_string": {
      "query": "x AND y OR z",
      "default_operator": "OR"
    }
  }
}

All 7 comments

This may help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html?q=query_string#_boolean_operators

Especially, "While the + and - only affect the term to the right of the operator, AND and OR can affect the terms to the left and right."

By the way, there are several problems with your specific queries:

  1. Since you are prefixing your query with text: already, is there a need for default_field?
  2. Since you're putting using AND and OR in the query, what's the point of the default_operator?

You're probably using the +/- syntax or using the match query DSL though.

The problem is that the query string does not use pure boolean logic. It is intended to be used as a query, not as a filter. Queries have required matches (must) and optional matches (should) which are not required but improve the score if they are present.

Worth reading this blogpost to understand more: https://lucidworks.com/blog/why-not-and-or-and-not/

As @sarwarbhuiyan said, you're better off using the query DSL if you want real boolean logic.

Thanks! You're probably right it's better to simply always use +/- but at this point that would require us to (further) translate user queries.

However, I still don't understand the default_operator behavior -- why for the query "x AND y OR z" is the behavior different when the default_operator is "AND" vs. when it's not specified (and from the docs I understand defaults to OR)?

To answer @sarwarbhuiyan -- initially I was trying to understand why "x y OR z" returned results with only x and not y and not z with the default_operator AND. And sorry about the default_field/specifying field in query string redundancy, unfortunately I constructed the test query from various sources. I don't think this is related to the behavior I see though. My point is actually about the default_operator behavior, not about the AND/OR operators.

Also according to that linked doc page (unless I understood incorrectly, again!) "a AND b OR c" should be equivalent to "(a AND b) OR c" because AND takes precedence -- and in case the default_operator is unspecified, it seems to be, but in case the default_operator is AND, it is not. This is the behavior I am trying to understand/work around.

The answer is explained in that blog post, to quote:

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

  • Queries are parsed left to right
  • NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
  • AND will change the Occurs flag of the clause to it’s left to MUST unless it has already been set to MUST_NOT
  • AND sets the Occurs flag of the clause to it’s right to MUST
  • If the default operator of the query parser has been set to “And”: OR will change the Occurs flag of the clause to it’s left to SHOULD unless it has already been set to MUST_NOT
  • OR sets the Occurs flag of the clause to it’s right to SHOULD

Frankly, these rules are just too hard to remember. This is one of the many reasons I don't like using the query_string query at all. Here's another reason. Look at these two queries for example:

http://foo   # finds an empty regex in field `http` and `foo` in the `_all` field
http://foo/  # throws a malformed regex exception

If you want to understand how the query string syntax is being understood, then use the validate-query API:

GET _validate/query?explain
{
  "query": {
    "query_string": {
      "query": "x AND y OR z",
      "default_operator": "OR"
    }
  }
}

Thanks so much for explaining!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abtpst picture abtpst  Â·  3Comments

Praveen82 picture Praveen82  Â·  3Comments

rpalsaxena picture rpalsaxena  Â·  3Comments

ttaranov picture ttaranov  Â·  3Comments

martijnvg picture martijnvg  Â·  3Comments