Elasticsearch: [feature request]customrized tf/idf when building index

Created on 21 Oct 2019  路  9Comments  路  Source: elastic/elasticsearch

use case:
for small but still very distinct dataset, usually build separate index but search multi of them. in order to make there tf/idf based score comparable, DFS-then-fetch query type is usually used. but that search type is time consuming.

feature request:
let tf/idf can be customrized when build index

feature desc:
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "BM25",//or any which are based tf/idf
"tf-idf-type" : "file", //file format and location shoud be pre-defined
//tf-idf-type could be default, file, db, redis, service
//for db, connection, table should be configed but table schema should be pre-defined
//for redis, connection should be configed
//for service, endpoint should be configed but reposen fomat should be pre-defined
"missing-term-tf" : XX,
"missing-term-idf": XX
}
}
}
}

:SearcSearch >enhancement Search

Most helpful comment

in order to make there tf/idf based score comparable, DFS-then-fetch query type is usually used. but that search type is time consuming.

Do you have some numbers to back up this assumption ? While it certainly adds up to the overall latency to run dfs query the overhead should be minimal compared to the actual execution of the query. Maybe you ran into some edge cases or you started with the assumption that it will be slow but in both cases I'd be curious to hear more about your usage of this feature before discussing a replacement with a complex solution.

if u need to read files to get tf when searching, can u image how poorly the performance could be?
what I am asking for is the ability to do it in indexing time.

If you consider that the solution would be slow at query time, why would an index time solution be faster ? The number of terms to lookup would be much bigger as well as the throughput so I don't get your point. I also agree with @mayya-sharipova, if you want to take control of the scoring function you can use the scripted similarity that was added for this purpose.

All 9 comments

Pinging @elastic/es-search (:Search/Search)

Hi @riverbuilding, just to confirm that I correctly understand the ask: do you suggest looking up the scoring information from some externat source like a file/db/service etc. and that would help speed up DFS Query Then Fetch queries? Isn't that just substituting one costly call across the network with another?

@cbuescher no, it's not. the feature is for indexing time. when build index, lucene will calculate tf/idf based on current document, what I am asking for is to let users to control this tf/idf setting or calculate and if so, won't need DFS when search since it is already done in indexing time.

@riverbuilding

the feature is for indexing time. when build index, lucene will calculate tf/idf based on current document

I don't see how this is going to work. idf of a term depends on other documents and is always changing as new docs containing this term are added or old docs are deleted. Are you interested to supply your custom tf values?
Elasticsearch provides scripted similarity where you can define your own tf and idf values.

@mayya-sharipova what I am interest in currently is provide my own tf during index time. what you mentioned is search time.
since u mentioned,
"idf of a term depends on other documents and is always changing as new docs containing this term are added or old docs are deleted"
that's exactly how lucene's calculator
what I am asking for is to make this calculator customizedable.

@riverbuilding scripted similarity allows you to customize tf and idf and score calculation based on them.

"A similarity (scoring / ranking model) defines how matching documents are scored."
this statement declares that similarity is used for searching time, if u need to read files to get tf when searching, can u image how poorly the performance could be?
what I am asking for is the ability to do it in indexing time.

in order to make there tf/idf based score comparable, DFS-then-fetch query type is usually used. but that search type is time consuming.

Do you have some numbers to back up this assumption ? While it certainly adds up to the overall latency to run dfs query the overhead should be minimal compared to the actual execution of the query. Maybe you ran into some edge cases or you started with the assumption that it will be slow but in both cases I'd be curious to hear more about your usage of this feature before discussing a replacement with a complex solution.

if u need to read files to get tf when searching, can u image how poorly the performance could be?
what I am asking for is the ability to do it in indexing time.

If you consider that the solution would be slow at query time, why would an index time solution be faster ? The number of terms to lookup would be much bigger as well as the throughput so I don't get your point. I also agree with @mayya-sharipova, if you want to take control of the scoring function you can use the scripted similarity that was added for this purpose.

@jimczi:
I am aware of that it is a big change and I'll send more data later

Was this page helpful?
0 / 5 - 0 ratings