Introduce a new field of type vector
on which vector calculations can be done during rescoring phase
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_feature": {
"type": "vector"
}
}
}
}
Allow only a single value per document
Allow to index both dense and sparse vectors?
Dense form:
PUT my_index/_doc/1
{
"my_feature": [11.5, 10.4, 23.0]
}
Sparse form (represented as list of dimension names and values for corresponding dimensions):
PUT my_index/_doc/1
{
"my_feature": {"1": 11.5, "5": 10.5, "101": 23.0}
}
Introduce a special type of vector
query:
"vector" : {
"field" : "my_feature",
"query_vector": {"1": 3, "5": 10.5, "101": 12}
}
This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:
1) If a document doesn't have a vector value for field
, 0 value will be returned
2) If a document does have a vector value for field
: doc_vector, the cosine similarity between doc_vector and query_vector
is calculated:
dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
POST /_search
{
"query" : {"<user-query>"},
"rescore" : {
"window_size" : 50,
"query" : {
"rescore_query" : {
"vector" : {
"field" : "my_feature",
"query_vector": {"1": 3, "5": 10.5, "101": 12}
}
}
}
}
}
Encoding of vectors:
Internally both dense and sparse vectors are encoded as sorted hash?
Thus dense array is transformed:
[4, 12] -> {0: 4, 1: 12}
Keys are sorted, so we can iterate over them instead of calculating hash
What should be values in vectors?
Vectors are encoded as binaries.
Pinging @elastic/es-search-aggs
This query can only be used in the rescoring context.
If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: QueryRescorer
, but we can add more of them, see eg. https://github.com/elastic/elasticsearch/tree/master/plugins/examples/rescore). We might also want to give it a more explicit name like cosine_similarity
?
Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.
Allow only a single value per document
Do you mean only one vector
field per document or only one value for each field? It would be useful to allow more than one one vector
field per document for testing different embeddings, dimensionalities, etc. Something like:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"GloVe": {
"type": "vector"
},
"word2vec": {
"type": "vector"
}
}
}
}
@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several vector
fields.
Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine
In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:
@djp-search thanks for a suggestion, we will study Manhattan distance
@jtibshirani thanks for another use-case
Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context
@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions cosineSimilarity
and dotProduct
as a part of script score query. The idea is that these functions will be used for scoring after the match is already done.
We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.
Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
I've tried
POST /_search
{
"query" : {"<user-query>"},
"rescore" : {
"window_size" : 50,
"query" : {
"rescore_query" : {
"vector" : {
"field" : "my_feature",
"query_vector": {"1": 3, "5": 10.5, "101": 12}
}
}
}
}
}
and I got:
"error":{"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [vector]","line":9,"col":24}],
@ailurus1991 Yes, you are right, currently there is no way to query vector fields.
We are working on introducing the ways through painless script functions.
@mayya-sharipova wow I see, great work!
@mayya-sharipova
Hello Mayya, thank you for your work!
I need help. I just installed new Elastic,create index and try mapping by your example:
{
"properties": {
"my_vector": {
"type": "dense_vector"
},
"my_text" : {
"type" : "keyword"
}
}
}
and i get error:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "No handler for type [dense_vector] declared on field [my_vector]"
}
],
"type": "mapper_parsing_exception",
"reason": "No handler for type [dense_vector] declared on field [my_vector]"
},
"status": 400
}
Thank you advance for reply!
@psyapathy What version of elasticsearch have you installed?
The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1.
@mayya-sharipova Thank you for reply!
it's happy and sad at the same time.
is there an alternative still under development?
@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint?
@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2
From 7.2 two functions as a part of script_score
will be available cosineSimilarity
and dotProduct
@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions?
@prem6667 Sorry, we have decided to move these functions starting from 7.3.
Adding these functions manually involves non-trivial amount of work as besides painless functions, we need to add classes for supporting Doc and script values.
Also, please be aware, that these features are still experimental, and may change.
@mayya-sharipova is this feature published in 7.3? But I didn't fint it.
I believe I's mentioned here:
https://www.elastic.co/blog/elasticsearch-7-3-0-released
see "Built-in vector similarity functions for document script scoring"
On Thu, Aug 1, 2019 at 10:11 AM LiuGangR notifications@github.com wrote:
@mayya-sharipova https://github.com/mayya-sharipova is this feature
published in 7.3? But I didn't fint it.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/31615?email_source=notifications&email_token=ABGGISCQ7E5OHPKAVKHR4JDQCKECPA5CNFSM4FHHO5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JRVNA#issuecomment-517151412,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABGGISBR3GUTIL44D6OK2C3QCKECPANCNFSM4FHHO5NQ
.
I thank you for this clear presentation, and thank you to the participants for the exchange.
Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>
or
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>
@adouib this seems like a good question for our discuss forums, would you be able to create a discuss post and we can continue the conversation there? We usually try to keep GitHub focused on development efforts like bug reports and feature requests.
@mayya-sharipova I do not quite understand why we need to encode vector as sorted hash
. Why do we have to do so? and what does it mean that Vectors are encoded as binaries.
@dragon-warrior-nyc Please refer to our official documentation.
The details on this PR are potential implementations we have considered that may not be relevant any more.
"Vectors are encoded as binaries" means that vectors are encoded as Lucene BinaryDocValues.
@mayya-sharipova got it and thanks for the explanation!
I thank you for this clear presentation, and thank you to the participants for the exchange.
Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>or
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>
@
@mayya-sharipova got it and thanks for the explanation!
I thank you for this clear presentation, and thank you to the participants for the exchange.
Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>or
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>
@adouib Hi , Did we get resolution for same please ?
Most helpful comment
Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.