Describe the feature:
Along with the search response, provide a list of document fields which caused each document to match the query.
This is a useful ML (Machine Learning) feature for training models used for e.g. ranking search results.
Additional features, which would be really useful in ML model training, would be textual match scores, such as:
No concrete response format is required, as long as all of the above are present in the search response and it's possible to correlate the documents, matching fields and match scores.
Pinging @elastic/es-search (:Search/Ranking)
@msufa thanks for sharing this request. I had a couple questions to help us better understand your use case and what an API might look like:
multi_match to query across multiple fields?Depending on how your queries are structured, you may find the named queries feature to be helpful. It allows for queries to be tagged with a name, and the search response will contain the name of each matched query. We’ve also considered returning the score for each named query (https://github.com/elastic/elasticsearch/issues/29606).
A last note that this may relate to our work in leveraging Lucene’s matches API (https://github.com/elastic/elasticsearch/issues/29631).
Thank you for moving this forward @jtibshirani. Answering your questions:
What does your query structure look like? Perhaps you are using multi_match to query across multiple fields?
The query structure varies a lot. In some simple cases, or with new functionality, it's feasible to write the queries in a way that would enable the use of named queries. This is well known functionality to us and we managed to get good results with it in some specific areas.
Some queries are quite complicated, matching across multiple fields. This is usually established functionality, so apart from rewriting the queries in a simpler fashion, a lot of additional effort would be required to verify that the results from the named queries are on par with the existing implementation.
How would you use this field match, term frequency, and score information? Would it be inside a custom rescorer, or outside of Elasticsearch in your application?
To start with, we would use the additional information in a custom ranking service outside of Elasticsearch (ES). As support for custom ML models inside of ES matures, we would hope to be able to have all of the retrieval and ranking logic inside ES.
I hope the additional information helps. If you need more details please let me know.
That additional context is helpful. I wonder if we could approach this by extending the named queries feature. For each named query, we would
Even if you had a complex query that matched across fields, if it had a _name you would be able to retrieve which fields it matched and scoring information. I'm not sure if how feasible this is, especially gathering the frequency data -- I'll ask the team for feedback.
@jtibshirani or @msufa I'm wondering if you could add an example or two to help illustrate what the response might look like for a complex query DSL. I think this was one of the concerns we had about the feature — how best to organize the response given a complex query DSL.
This would be very useful. One can try to parse the explain score for a given query:document pair to obtain this information, but it would be more accurate and efficient for these scores to be provided in the way @msufa describes.
@joshdevins A good example would be to use what was already proposed in https://github.com/elastic/elasticsearch/issues/29631#issue-316211321 for each hit and just extend the details of a field match with TF, IDF & BM25.
We discussed this as a team and agreed that it's a good idea to add (1) information about what fields matched, and (2) the query _score (#29606). It also seemed technically feasible to return term frequency statistics. However we wanted to put more thought into whether we were happy exposing term frequency information, and also if this was the best place to do it.
Some other points for context:
I think this was one of the concerns we had about the feature — how best to organize the response given a complex query DSL.
@joshdevins the proposal is to add these options as part of named queries. So you would be able to specify for a certain query (within a potentially complex query structure) that you would like a list of what fields it matched, and the score it contributed to the overall search score. The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)
The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)
@jtibshirani It's fine to omit positions and offsets altogether from our perspective (although it would be nice to have the matched terms). I just wanted to reuse an existing example which has already been considered. We don't have strict requirements regarding the response structure as long as all / most of the additional data requested in this ticket is there.
The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)
Information about positions and offsets would also resolve #34214
Most helpful comment
We discussed this as a team and agreed that it's a good idea to add (1) information about what fields matched, and (2) the query _score (#29606). It also seemed technically feasible to return term frequency statistics. However we wanted to put more thought into whether we were happy exposing term frequency information, and also if this was the best place to do it.
Some other points for context:
@joshdevins the proposal is to add these options as part of named queries. So you would be able to specify for a certain query (within a potentially complex query structure) that you would like a list of what fields it matched, and the score it contributed to the overall search score. The example that @msufa linked also includes the positions + offsets for each term. As a first pass I think we would omit positions and offsets, as it doesn't seem high priority to the use case (?)