Elasticsearch: Confusing documentation on Query String Query on the split by whitespace behaviour

Created on 19 Mar 2018  Â·  5Comments  Â·  Source: elastic/elasticsearch

Elasticsearch version: 6.2.2 (but all 6.x is affected)

Hi everyone,

In the reference documentation for Query String Queries https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-query-string-query.html it is said the following

Whitespaces are not considered operators, this means that new york city will be passed "as is" to the analyzer configured for the field. If the field is a keyword field the analyzer will create a single term new york city and the query builder will use this term in the query. If you want to query each term separately you need to add explicit operators around the terms (e.g. new AND york AND city).

And right after this, we can find the documentation of the default_operator parameter:

The default operator used if no explicit operator is specified. For example, with a default operator of OR, the query capital of Hungary is translated to capital OR of OR Hungary, and with default operator of AND, the same query is translated to capital AND of AND Hungary. The default value is OR.

Because the split by whitespace is not available anymore directly (see https://github.com/elastic/elasticsearch/issues/25574), only queries with explicit separations like (new) (york) (city) would be indeed parsed as new AND york AND city (with default_operator=AND) but not new york city which would be sent as one query to the target analyzer(s). The documentation is confusing on that regard.

I'm not sure of the exact implications here but it broke one of our multi field (multi analyzer) unit test when we tried to migrate as is the query string we did.

:SearcSearch >docs v6.2.2

Most helpful comment

Because the split by whitespace is not available anymore directly (see #25574), only queries with explicit separations like (new) (york) (city) would be indeed parsed as new AND york AND city (with default_operator=AND) but not new york city which would be sent as one query to the target analyzer(s). The documentation is confusing on that regard.

That's not correct. The query_string first splits on operators and then pass each split to the analyzer of the targeted fields. The fact that whitespaces are not considered operator anymore does not change the behavior of the default_operator. The query new york city for instance is considered as one block of text and is passed "as is" to the analyzers but then the analysis will split this text into tokens and applies the default_operator to combine them. With the standard analyzer and a default_operator set to to OR the generated query is: new OR york OR city. What changed is if you have multiple fields defined in the fields parameter. Suppose you want to apply this query to the fields title and text, by default the query_string will first apply the analysis on the field title and build the query title:(new OR york OR city) and then it will combine this query with the text field: title:(new OR york OR city) OR text:(new OR york OR city). If you change the default_operator to AND the query would be: title:(new AND york AND city) OR text:(new AND york AND city). This means that a document will match if it contains new AND york AND city in one of the targeted fields.
Before 6.x the query is translated into (title:new OR text:new) AND (title:york OR title:york)....
This is the gist of the change, though the split on whitespace was artificial since it was done outside of the analysis so if the title field for instance has a shingle filter the query_string before 6.x is not able to build these shingles and would always build single term since it splits on whitespace prior to the real analysis.
If you want to restore the old behavior, you can add explicit operator (new AND york AND city) where each term would be considered as a split or if you use the same analyzer for each targeted field you can use the type:cross_fields option that will apply the cross_fields rewrite on each token created by the analyzer:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-multi-match-query.html#multi-match-types
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-multi-match-query.html#type-cross-fields

All 5 comments

Pinging @elastic/es-search-aggs

Because the split by whitespace is not available anymore directly (see #25574), only queries with explicit separations like (new) (york) (city) would be indeed parsed as new AND york AND city (with default_operator=AND) but not new york city which would be sent as one query to the target analyzer(s). The documentation is confusing on that regard.

That's not correct. The query_string first splits on operators and then pass each split to the analyzer of the targeted fields. The fact that whitespaces are not considered operator anymore does not change the behavior of the default_operator. The query new york city for instance is considered as one block of text and is passed "as is" to the analyzers but then the analysis will split this text into tokens and applies the default_operator to combine them. With the standard analyzer and a default_operator set to to OR the generated query is: new OR york OR city. What changed is if you have multiple fields defined in the fields parameter. Suppose you want to apply this query to the fields title and text, by default the query_string will first apply the analysis on the field title and build the query title:(new OR york OR city) and then it will combine this query with the text field: title:(new OR york OR city) OR text:(new OR york OR city). If you change the default_operator to AND the query would be: title:(new AND york AND city) OR text:(new AND york AND city). This means that a document will match if it contains new AND york AND city in one of the targeted fields.
Before 6.x the query is translated into (title:new OR text:new) AND (title:york OR title:york)....
This is the gist of the change, though the split on whitespace was artificial since it was done outside of the analysis so if the title field for instance has a shingle filter the query_string before 6.x is not able to build these shingles and would always build single term since it splits on whitespace prior to the real analysis.
If you want to restore the old behavior, you can add explicit operator (new AND york AND city) where each term would be considered as a split or if you use the same analyzer for each targeted field you can use the type:cross_fields option that will apply the cross_fields rewrite on each token created by the analyzer:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-multi-match-query.html#multi-match-types
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-multi-match-query.html#type-cross-fields

Hey! I am using query string query function in elasticsearch. If I query something like this query="canada chest" with default operator = "and" and analyzer = "keyword", then returns documents in which canada and chest both are present . But I want the documents in which canada chest is considered as a single term and then output is returned. How do I achieve this?

@ShwethaThakur please ask your use case questions in https://discuss.elastic.co/. Github is only for filing bugs or requests for new features.
But in short to answer your question here - you either need 1) to have keyword field, or 2) make a phrase from your query using double quotes inside a query string query or 3) use match phrase query.

I will raise a query there as well.
1) I have to use keyword field while declaring the mappings?
2)I have tried it. But doesn't work. It splits the word and gives the
result.
3)I cannot use AND and OR operations in match phrase query.

Thank you.

On 31 July 2018 at 03:17, Mayya Sharipova notifications@github.com wrote:

@ShwethaThakur https://github.com/ShwethaThakur please ask your use
case questions in https://discuss.elastic.co/. Github is only for filing
bugs or requests for new features.
But in short to answer your question here - you either need 1) to have
keyword field, or 2) make a phrase from your query using double quotes
inside a query string query or 3) use match phrase query.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/29148#issuecomment-409022287,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AnlHJBC155g7hR6jTWngk0um1BCojxUeks5uL372gaJpZM4Sw2RA
.

Was this page helpful?
0 / 5 - 0 ratings