Elasticsearch: TopDocs from first pass to be accessible by custom ScoreFunctions

Created on 14 Aug 2017 · 10Comments · Source: elastic/elasticsearch

Continuing off of the other issue I submitted and the surrounding discussion thread,

Our score function computes a minimum score threshold by looking at the TopDocs from the first pass and decides whether to rescore a doc or not based on whether its first pass score is above this threshold. Now that TopDocs is harder to access, I'm hoping there can be a more supported and official way of accessing TopDocs for this purpose in the future.

:SearcSearch >enhancement

Source

ethteck

Most helpful comment

so today plugins can install a RescoreBuilder that can be used to customize the rescorer.

Almost. I'm going to take this issue and make sure that it works. Right now it'd work in the Transport client API but not in the REST API because we do not have an extension point for the parsing.

@ethteck, do you need anything to implement your rescores other than the Searcher and the TopDocs? I'm going to open a PR to make rescorers a "real" extension point. In doing that I'm going to tighten up what we pass to the extension so we don't accidentally break things in the future.

nik9000 on 21 Aug 2017

👍2

All 10 comments

Our score function computes a minimum score threshold by looking at the TopDocs from the first pass and decides whether to rescore a doc or not based on whether its first pass score is above this threshold

It is still unclear to me why this is not just a function score on the original query with a min score specified? I had thought you needed to use the previous score to calculate the rescore, not simply omit docs based on their original score.

rjernst on 14 Aug 2017

It is still unclear to me why this is not just a function score on the original query with a min score specified?

We need to calculate this min score dynamically - it's not a static value we can specify. Or am I misunderstanding you?

I had thought you needed to use the previous score to calculate the rescore, not simply omit docs based on their original score.

This is not the case. We only need the previous score to be able to compare it to our threshold.

There seems to be a lot of confusion around this issue and the surrounding requests I have. @jimczi requested that I open up another issue in this message, so maybe he'd be better at explaining my request from an internal standpoint.

ethteck on 14 Aug 2017

We need to calculate this min score dynamically

Thank you. That is what I was missing. Does finding the threshold require a complex computation, or is it that you are trying to have a fluid value that, if there are not enough docs with a sufficient score, you include some more, less relevant docs?

rjernst on 15 Aug 2017

There seems to be a lot of confusion around this issue and the surrounding requests I have. @jimczi requested that I open up another issue in this message, so maybe he'd be better at explaining my request from an internal standpoint.

The surrounding requests are all specific to how you wrote your plugin in the first place. The main thing you need to do at this point is to think how your requirements could be achieved with the current plugin infrastructure without hacking the SearchContext accesses.
The ScoreFunctions are available for queries and I don't think we can make a special case only for when they are used in a rescore context.
If you need to rescore only documents that are beyond a threshold computed from the original scores then we could discuss how this feature could be added to rescorers. Can you explain how this threshold is computed like @rjernst requested ?

jimczi on 15 Aug 2017

Sure, I'm happy to explain a little more. We look at all the scores in the list and compute statistics through which we derive our threshold. We do this by iterating through the list, so we do need access to all of the individual scores.

In essence, we're trying to find a "tail" where scores trail off. We assume that the documents in this "tail" are not good matches, so we do not run our expensive rescore function on them. This has worked pretty well for us in Solr, Lucene, and (historically) ES as a way to improve performance with minimal accuracy loss.

Also, I think I should mention that when we find documents that are below our threshold, we don't outright ignore them. Instead, we rescore score them as 0.0 so that they are guaranteed to fall below our actually rescored documents, which we always score between 0 and 1.

ethteck on 15 Aug 2017

Thanks @ethteck and sorry for the late reply.
I understand your requirements now. As is the rescorer in es does not seem to be a good fit. I tried to come up with a solution that would require small changes to the code but your use case is more than just a rescoring query. The rescorer in es is just a query rescorer, it's simple and precise but cannot be customized. We've discussed some time ago the ability to open the door for custom rescorers but decided to not allow it for multiple reasons. The main one is that rescoring requires expert knowledge on how it works otherwise it is very easy to create custom rescorers with unexpected behavior. Though your use case is valid because it creates consistent scores at the shard level but also very specific. We've discussed again last week if we could allow custom rescorers but the conclusion was the same than before. Custom rescorers are tricky and we don't want to let people shoot themselves in the foot with them just because they seem to be powerful.
I'll leave this issue open to let others comment if they want but IMO the solution for now is to lower the expectations for your plugin and rely solely on what is available in es rescorers (a custom query that rescores the top N).

jimczi on 21 Aug 2017

Not a problem - thank you for the reply.

That's a fair viewpoint. I guess the frustration from our point of view is that while this functionality may have never officially been supported, we were able to very easily solve our problem when we could access the SearchContext via SearchContext.current. Obviously as a normal Elasticsearch user, the removal of SearchContext.current doesn't affect anything, but as a third party plugin developer, this change totally broke our plugin. If there was some way to get this functionality back, it would probably be sufficient for our plugin.

We can rethink the features we support and the way we support them, but it's a little frustrating to have to do this when everything we needed was so easily accessible before. Still, I appreciate the thought you and your colleagues have put into this issue, and I hope that maybe at some point this can be revisited.

ethteck on 21 Aug 2017

so today plugins can install a RescoreBuilder that can be used to customize the rescorer. I wonder if that would be enough? From a my perspective we can pass down more information to the build method of the rescore builder like a TopDocsStats object that contains relevant information. I would love to make this rescorer infrastructure more flexible to support other usecases and we can extend these interfaces if there is more information needed?!

s1monw on 21 Aug 2017

👍1

so today plugins can install a RescoreBuilder that can be used to customize the rescorer.

Almost. I'm going to take this issue and make sure that it works. Right now it'd work in the Transport client API but not in the REST API because we do not have an extension point for the parsing.

nik9000 on 21 Aug 2017

👍2

Oh great! Thank you so much.

Ideally, we also would have access to the rescore window size if available.

ethteck on 21 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings