Elasticsearch: Expose matching fields for a query in the search response

Created on 27 Mar 2020 · 10Comments · Source: elastic/elasticsearch

Describe the feature:

Along with the search response, provide a list of document fields which caused each document to match the query.

This is a useful ML (Machine Learning) feature for training models used for e.g. ranking search results.

Additional features, which would be really useful in ML model training, would be textual match scores, such as:

Term Frequency (TF) and separately Inverse Document Frequency (IDF) as described here.
Best Match 25 (BM25) as described here.

No concrete response format is required, as long as all of the above are present in the search response and it's possible to correlate the documents, matching fields and match scores.

:SearcRanking Search

Source

msufa

🎉6

Most helpful comment

We discussed this as a team and agreed that it's a good idea to add (1) information about what fields matched, and (2) the query _score (#29606). It also seemed technically feasible to return term frequency statistics. However we wanted to put more thought into whether we were happy exposing term frequency information, and also if this was the best place to do it.

Some other points for context:

We expect that fetching this additional information would add some latency to the search, especially if a large number of hits are fetched.
There are no immediate plans to work on this, but #29606 (exposing the _score) has been marked 'help wanted' and contributions are very welcome.

I think this was one of the concerns we had about the feature — how best to organize the response given a complex query DSL.

@joshdevins the proposal is to add these options as part of named queries. So you would be able to specify for a certain query (within a potentially complex query structure) that you would like a list of what fields it matched, and the score it contributed to the overall search score. The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)

jtibshirani on 29 Apr 2020

👍3

All 10 comments

Pinging @elastic/es-search (:Search/Ranking)

elasticmachine on 27 Mar 2020

@msufa thanks for sharing this request. I had a couple questions to help us better understand your use case and what an API might look like:

What does your query structure look like? Perhaps you are using multi_match to query across multiple fields?
How would you use this field match, term frequency, and score information? Would it be inside a custom rescorer, or outside of Elasticsearch in your application?

Depending on how your queries are structured, you may find the named queries feature to be helpful. It allows for queries to be tagged with a name, and the search response will contain the name of each matched query. We’ve also considered returning the score for each named query (https://github.com/elastic/elasticsearch/issues/29606).

A last note that this may relate to our work in leveraging Lucene’s matches API (https://github.com/elastic/elasticsearch/issues/29631).

jtibshirani on 27 Mar 2020

Thank you for moving this forward @jtibshirani. Answering your questions:

What does your query structure look like? Perhaps you are using multi_match to query across multiple fields?

The query structure varies a lot. In some simple cases, or with new functionality, it's feasible to write the queries in a way that would enable the use of named queries. This is well known functionality to us and we managed to get good results with it in some specific areas.

Some queries are quite complicated, matching across multiple fields. This is usually established functionality, so apart from rewriting the queries in a simpler fashion, a lot of additional effort would be required to verify that the results from the named queries are on par with the existing implementation.

How would you use this field match, term frequency, and score information? Would it be inside a custom rescorer, or outside of Elasticsearch in your application?

To start with, we would use the additional information in a custom ranking service outside of Elasticsearch (ES). As support for custom ML models inside of ES matures, we would hope to be able to have all of the retrieval and ranking logic inside ES.

I hope the additional information helps. If you need more details please let me know.

msufa on 31 Mar 2020

That additional context is helpful. I wonder if we could approach this by extending the named queries feature. For each named query, we would

include the score it contributed (as in #29606)
list the fields that it matched
include term and document frequency information (per field?)

Even if you had a complex query that matched across fields, if it had a _name you would be able to retrieve which fields it matched and scoring information. I'm not sure if how feasible this is, especially gathering the frequency data -- I'll ask the team for feedback.

jtibshirani on 3 Apr 2020

👍1

@jtibshirani or @msufa I'm wondering if you could add an example or two to help illustrate what the response might look like for a complex query DSL. I think this was one of the concerns we had about the feature — how best to organize the response given a complex query DSL.

joshdevins on 9 Apr 2020

This would be very useful. One can try to parse the explain score for a given query:document pair to obtain this information, but it would be more accurate and efficient for these scores to be provided in the way @msufa describes.

bilbof on 9 Apr 2020

👍1

@joshdevins A good example would be to use what was already proposed in https://github.com/elastic/elasticsearch/issues/29631#issue-316211321 for each hit and just extend the details of a field match with TF, IDF & BM25.

msufa on 28 Apr 2020

👍1

Some other points for context:

We expect that fetching this additional information would add some latency to the search, especially if a large number of hits are fetched.
There are no immediate plans to work on this, but #29606 (exposing the _score) has been marked 'help wanted' and contributions are very welcome.

I think this was one of the concerns we had about the feature — how best to organize the response given a complex query DSL.

jtibshirani on 29 Apr 2020

👍3

The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)

@jtibshirani It's fine to omit positions and offsets altogether from our perspective (although it would be nice to have the matched terms). I just wanted to reuse an existing example which has already been considered. We don't have strict requirements regarding the response structure as long as all / most of the additional data requested in this ticket is there.

msufa on 29 Apr 2020

The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)

Information about positions and offsets would also resolve #34214