Elasticsearch: Introduce vector field, vector query and rescoring based on them

Created on 27 Jun 2018 · 29Comments · Source: elastic/elasticsearch

Introduce a new field of type vector on which vector calculations can be done during rescoring phase

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_feature": {
          "type": "vector"   
      }
    }
  }
}

Indexing

Allow only a single value per document
Allow to index both dense and sparse vectors?

Dense form:

PUT my_index/_doc/1
{
  "my_feature":   [11.5, 10.4, 23.0]
}

Sparse form (represented as list of dimension names and values for corresponding dimensions):

PUT my_index/_doc/1
{
  "my_feature": {"1": 11.5, "5": 10.5,  "101": 23.0}
}

Query and Rescoring

Introduce a special type of vector query:

"vector" : {
   "field" : "my_feature",
    "query_vector": {"1": 3, "5": 10.5,  "101": 12}
}

This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:
1) If a document doesn't have a vector value for field, 0 value will be returned
2) If a document does have a vector value for field : doc_vector, the cosine similarity between doc_vector and query_vector is calculated:
dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

Internal encoding

Encoding of vectors:
Internally both dense and sparse vectors are encoded as sorted hash?
Thus dense array is transformed:
[4, 12] -> {0: 4, 1: 12}
Keys are sorted, so we can iterate over them instead of calculating hash
What should be values in vectors?
- floats?
- smaller than floats? (lost some precision here, but less index size)
Vectors are encoded as binaries.

:SearcRanking

Source

mayya-sharipova

👍10

Most helpful comment

Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.

etienne1985 on 3 Jul 2018

👍5

All 29 comments

Pinging @elastic/es-search-aggs

elasticmachine on 27 Jun 2018

This query can only be used in the rescoring context.

If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: QueryRescorer, but we can add more of them, see eg. https://github.com/elastic/elasticsearch/tree/master/plugins/examples/rescore). We might also want to give it a more explicit name like cosine_similarity?

jpountz on 27 Jun 2018

👍3

etienne1985 on 3 Jul 2018

👍5

Allow only a single value per document

Do you mean only one vector field per document or only one value for each field? It would be useful to allow more than one one vector field per document for testing different embeddings, dimensionalities, etc. Something like:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "GloVe": {
          "type": "vector"   
      },
        "word2vec": {
          "type": "vector"   
      }
    }
  }
}

james-daily on 3 Jul 2018

👍2

@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several vector fields.

mayya-sharipova on 3 Jul 2018

Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine

djptek on 27 Jul 2018

In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:

Their use case also involves retrieving sentences or short paragraphs. Both the query and documents would be modelled using a sentence embedding (based on an RNN).
Vectors are dense and can have from 50 - 1000 dimensions, but are concentrated in the 200 - 300 range.
Ideally, cosine similarity would be applied to all documents when scoring (as opposed to just during a rescoring phase). In their use case, sentence retrieval is a component of a fairly general NLP pipeline, and they rely strongly on these sentence embeddings to understand synonyms/ textual similarity.

jtibshirani on 27 Jul 2018

👍1

@djp-search thanks for a suggestion, we will study Manhattan distance

@jtibshirani thanks for another use-case

mayya-sharipova on 28 Jul 2018

Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context

softwaredoug on 21 Dec 2018

@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions cosineSimilarity and dotProduct as a part of script score query. The idea is that these functions will be used for scoring after the match is already done.

mayya-sharipova on 27 Dec 2018

We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.

Certainly matching with this field will be a little slower, but there aren't any real surprises here. For instance, normal search with posting lists, etc. executes in O(num_docs), this field will surely still be O(num_docs) right? And if it's slower, I bet it's not _that_ much slower is it? (Is it?)
The users of this field are likely to be the more sophisticated users who would more likely know the issues they are getting into.
Part of the nice value of using this field for matching is that _presumably_ you would also be able to use it with other normal fields. For instance, I could have an index of "users" and I could say, "find me all users that are in San Francisco (geo search), that are most similar to this sample user (vector similarity)".

JnBrymn-EB on 29 Dec 2018

❤2

Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
I've tried

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

and I got:

"error":{"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [vector]","line":9,"col":24}],

cailurus on 28 Jan 2019

👀1

@ailurus1991 Yes, you are right, currently there is no way to query vector fields.
We are working on introducing the ways through painless script functions.

mayya-sharipova on 29 Jan 2019

@mayya-sharipova wow I see, great work!

cailurus on 5 Feb 2019

@mayya-sharipova
Hello Mayya, thank you for your work!

I need help. I just installed new Elastic,create index and try mapping by your example:

{
 "properties": {
   "my_vector": {
     "type": "dense_vector"
    },
    "my_text" : {
      "type" : "keyword"
    }
  }
}

and i get error:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "No handler for type [dense_vector] declared on field [my_vector]"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [dense_vector] declared on field [my_vector]"
    },
    "status": 400
}

Thank you advance for reply!

arpsyapathy on 6 Mar 2019

@psyapathy What version of elasticsearch have you installed?

The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1.

mayya-sharipova on 7 Mar 2019

👍1

@mayya-sharipova Thank you for reply!
it's happy and sad at the same time.
is there an alternative still under development?

arpsyapathy on 11 Mar 2019

@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint?

cailurus on 22 May 2019

@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2
From 7.2 two functions as a part of script_score will be available cosineSimilarity and dotProduct

mayya-sharipova on 22 May 2019

@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions?

prem6667 on 26 Jun 2019

@prem6667 Sorry, we have decided to move these functions starting from 7.3.
Adding these functions manually involves non-trivial amount of work as besides painless functions, we need to add classes for supporting Doc and script values.
Also, please be aware, that these features are still experimental, and may change.

mayya-sharipova on 27 Jun 2019

@mayya-sharipova is this feature published in 7.3? But I didn't fint it.

LiuGangR on 1 Aug 2019

I believe I's mentioned here:
https://www.elastic.co/blog/elasticsearch-7-3-0-released
see "Built-in vector similarity functions for document script scoring"

On Thu, Aug 1, 2019 at 10:11 AM LiuGangR notifications@github.com wrote:

@mayya-sharipova https://github.com/mayya-sharipova is this feature
published in 7.3? But I didn't fint it.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/31615?email_source=notifications&email_token=ABGGISCQ7E5OHPKAVKHR4JDQCKECPA5CNFSM4FHHO5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JRVNA#issuecomment-517151412,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABGGISBR3GUTIL44D6OK2C3QCKECPANCNFSM4FHHO5NQ
.

lior-k on 1 Aug 2019

👍1

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

adouib on 3 Oct 2019

@adouib this seems like a good question for our discuss forums, would you be able to create a discuss post and we can continue the conversation there? We usually try to keep GitHub focused on development efforts like bug reports and feature requests.

jtibshirani on 4 Oct 2019

👍1

@mayya-sharipova I do not quite understand why we need to encode vector as sorted hash. Why do we have to do so? and what does it mean that Vectors are encoded as binaries.

dragon-warrior-nyc on 26 Jan 2020

@dragon-warrior-nyc Please refer to our official documentation.

The details on this PR are potential implementations we have considered that may not be relevant any more.

"Vectors are encoded as binaries" means that vectors are encoded as Lucene BinaryDocValues.

mayya-sharipova on 27 Jan 2020

@mayya-sharipova got it and thanks for the explanation!

dragon-warrior-nyc on 29 Jan 2020

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@mayya-sharipova got it and thanks for the explanation!

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>