Elasticsearch: Highlighters highlight filter parts

Created on 17 Feb 2016  路  10Comments  路  Source: elastic/elasticsearch

As reported at https://discuss.elastic.co/t/2-0-question-about-whats-returningw-ithin-the-highlighter/34065 highlighters mistakenly highlight based on query parts that are only used for filtering

:SearcHighlighting >bug

Most helpful comment

jpountz: Hello, I am the original poster of this issue on your Elastic forum (https://discuss.elastic.co/t/2-0-question-about-whats-returning-within-the-highlighter/34065) and I also spoke with you about this at Elasticon 2016 at your info booth. My suggestion, please, is to NOT include the filters in the highlighter, and here is my motivation:

What is the premise of the filter? To include, or exclude, given data that matches a given pattern.

What is the premise of the highlighter? To show the user where their search terms hit within the indexed documents, to provide context of the search result they are viewing.

This is my understanding of the distinction between the two.

As an engineer, the filter is used to control the return of data.
As a user (e.g. using an aggregation selection to filter down return data), the aggregation/UI interaction might have spawned the filter's creation, but does not mean they need to see a hit upon their aggregation within the highlight. E.g. They selected an aggregation result of document type, so we apply a filter upon it, it is a data limiter in the query, not part of why it hit their search criteria, per se.

But, most importantly, if we ever wish to filter data OUT of a users return set, due to security/authorizations, we NEVER want those filters as part of the highlight. (e.g. user A can only search upon documents of type X, so a filter is applied to the search.)

To this engineer, it seems like the highlighter should highlight search term matches, but not the filters that might control data that the search terms query against.

Thanks for listening.

All 10 comments

I replied on the mailing list, I am not sure it's a bug, Did we verify that the behaviour was different before? As far as I can remember, if you had a filtered query in the past, filter matches would be highlighted?

I stand corrected, I double checked and indeed this has changed as a side effect of query/filter merging. The same query (using a filtered query) would highlight only the query part in 1.7 and previous versions, while now filters get highlighted as well with 2.0+. Using a specific highlight_query without the filters is a valid workaround until we get this fixed.

Copying @jpountz as well

I guess we could get the old behavior back by trying to figure out if the query is used in a filter context. I thought this side effect of the filter->query conversion was known/expected though. Stuff like #15793 snuck in but the rest of the filter terms being highlighted was expected?

The whole terms extraction process has always been a bit hacky and when I was managing a production installation I always relied on good old highlight_query to save me.

++ on fixing this

I have been thinking a bit more about it today and I could not come up with a rule about what should be highlighted:

  • should filters really not be highlighted?
  • if yes then do we still agree that what lives under a constant_score should not be highlighted? (it is technically a filter)
  • isn't it mostly due to the use of "require_field_match":false? in that case maybe the actual fix is for users to specify a highlight_query to be more specific about what needs to be highlighted.

@jpountz thanks for looking at it! My comments go inline:

should filters really not be highlighted?

Maybe we should be able to specify whether the filters should be higlighted or not? Shoudn't this be a configuration? Depending on the use case, people would want to higlight this or not

if yes then do we still agree that what lives under a constant_score should not be highlighted? (it >is technically a filter)

This could be addressed with my statement in [1]

isn't it mostly due to the use of "require_field_match":false? in that case maybe the actual fix is for >users to specify a highlight_query to be more specific about what needs to be highlighted.

I think that when require_field_match is false, a highlight query should be required, otherwise what should be highlighted?

Maybe we should be able to specify whether the filters should be higlighted or not?

This is something I'd like to avoid if possible as settings/options increase the complexity of our APIs.

I think that when require_field_match is false, a highlight query should be required, otherwise what should be highlighted?

The current default is to highlight the main query and highlighters will try to highlight across fields. It is indeed quite a tricky option as weird things can happen if fields have different analyzers.

jpountz: Hello, I am the original poster of this issue on your Elastic forum (https://discuss.elastic.co/t/2-0-question-about-whats-returning-within-the-highlighter/34065) and I also spoke with you about this at Elasticon 2016 at your info booth. My suggestion, please, is to NOT include the filters in the highlighter, and here is my motivation:

What is the premise of the filter? To include, or exclude, given data that matches a given pattern.

What is the premise of the highlighter? To show the user where their search terms hit within the indexed documents, to provide context of the search result they are viewing.

This is my understanding of the distinction between the two.

As an engineer, the filter is used to control the return of data.
As a user (e.g. using an aggregation selection to filter down return data), the aggregation/UI interaction might have spawned the filter's creation, but does not mean they need to see a hit upon their aggregation within the highlight. E.g. They selected an aggregation result of document type, so we apply a filter upon it, it is a data limiter in the query, not part of why it hit their search criteria, per se.

But, most importantly, if we ever wish to filter data OUT of a users return set, due to security/authorizations, we NEVER want those filters as part of the highlight. (e.g. user A can only search upon documents of type X, so a filter is applied to the search.)

To this engineer, it seems like the highlighter should highlight search term matches, but not the filters that might control data that the search terms query against.

Thanks for listening.

Highlighting can override the main query so I think it's more flexible to differentiate the main query from the highlight query rather than trying to find which part of the main query should be highlighted.
By default we try to extract all terms/phrases from the main query but for more fine grained scenario the solution is to override the query in the highlight section. With this approach you can control what the highlighter should do in a way that is not hidden in the highlighting logic.

Was this page helpful?
0 / 5 - 0 ratings