Elasticsearch: Please undo sparse vector deprecation

Created on 11 Sep 2020  路  6Comments  路  Source: elastic/elasticsearch

@mayya-sharipova - in this thread you mentioned there were no valid use cases in the community to justify keeping sparse vector fields alive. https://discuss.elastic.co/t/please-dont-deprecate-sparse-vector-fields/219063

I wanted to humbly suggest that you reconsider this feature and use-case, which seems broadly applicable to a number of e-commerce settings, and valuable overall.

Consider any problem that can be expressed as a matrix of user/product relationships, such as the output of ALS from a Spark job or a user-row from LightFM or really any matrix factorization job. Matrix Factorization(MF) is still a defacto standard in recommendation systems, which are large and growing by the day in terms of interest/application/implementation in companies by data science teams.

The reality of most recommendation systems, is that most of them ultimately compute a sparse vector at some point. (Typically dense at first, but converted to sparse to save space).

The value of supporting sparse vectors in ES is that it is widely understood that once MF jobs are done and vectors computed, the similarity of one product or user to any other product or user is largely computable by finding its cosine distance.

This means that by supporting sparse vectors, a user could dump the vectors from MF recommendation jobs in to an ES instance and be able to offer real-time recommendations without storing the typically k=>v rows for every user/product combination and querying potentially millions of rows of data, which are static and do not extend to real-time recommendations effectively. This presents enormous value to a user and significantly simplifies their code and db architecture.

The challenge in an environment where products or users are increasing daily, is that the size of the output vector also changes daily (or hourly, or by minute, etc.). This makes dense vectors that must have their size defined upfront untenable for such a task in ES. Sparse vectors, however, perfectly meet the bill. While sparse are limited to 1024 fields too, that is an order of magnitude more storage capacity when dealing with large and very sparse data like many matrices are in recommendation systems.

Finally - since ES supports cosine distance natively now, this entire effort of recommending products to users literally becomes a simple query with simple data points underlying it.

This is distinctly different than the rank_feature field type in that we are comparing vectors pairwise by their values, not by their individual field values/importance. Rank feature also only supports positive values, where vectors used in recommendation can often by negative. We can L2 normalize of course, but not everyone will have the knowledge/skillset to know that.

Please consider reimplementing this feature to significantly reduce the complexity of building modern product recommendation systems (and a whole host of other pairwise vector comparison search problems).

:SearcRanking >enhancement Search team-discuss

All 6 comments

Pinging @elastic/es-search (:Search/Ranking)

I was considering using this functionality when I saw it was getting deprecated.

We are building sparse vectors for named entities in articles where the value for a given named entity is its weight (some calculated relevance for that entity in the article), e.g.

{
  "Donald_Trump": 0.9,
  "Mike_pence": 0.4,
  "White_House": 0.3,
  ...
}

Then given a document I was planning to use the cosineSimilaritySparse function to find the closest articles (kind of an easy way to find "nearest neighbours") in the index to that one (there would be other features, including dense vectors, but this was going to be one of the main ones).

I don't think this can be achieved with rank_features but please correct me if I'm wrong.

@diegomansua Thank you for letting us know about your use case. I actually think rank_features can be useful in your case.
You can model your entities as rank_features:

PUT my_index
{
  "mappings": {
    "properties": {
      "topics": {
        "type": "rank_features"
      }
    }
  }
}

PUT my_index/_doc/1?refresh
{
  "topics": {
    "Donald_Trump": 0.9,
    "Mike_pence": 0.4,
    "White_House": 0.3
  }
}

PUT my_index/_doc/2?refresh
{
  "topics": {
    "Donald_Trump": 0.1,
    "Mike_pence": 0.9,
    "White_House": 0.2
  }
}

And use a bool query with a number of rank_feature query for each topic as a proxy to find nearest neighbours. Documents that have the highest entity weights on these topics, will also be scored higher. An additional bonus, this query should be very fast.

GET my_index/_search
{
  "query": {
    "bool" : {
      "should" : [
          {
            "rank_feature" : {
              "field": "topics.White_House"
            }
          },
          {
            "rank_feature" : {
              "field": "topics.Mike_pence"
            }
          }
        ]
    }
  }
}

Thank you @mayya-sharipova I will give it a go.

@mayya-sharipova I was running some tests yesterday. And whereas it's true that rank_features can help, the problem I have is that I don't think it can replicate a characteristic that the sparse_vector datatype plus cosineSimilaritySparse function have: the cosine similarity of a document with itself is 1.

This is important to us because as I mentioned we were planning to use this in combination with other features. More specifically after experimentation we have found that to get the best results when measuring the similarity of two documents in our case is to get the weighted average of:

  • Cosine similarity of BERT-based sentence (dense) vectors (weight 0.8).
  • Cosine similarity of named entity salience (sparse) vectors (weight 0.2).

As I say this works well for us (and similarity of a document with itself is 1) and it's what we're doing outside ElasticSearch with a clustering algorithm to group similar documents together. We were planning to use ES in the exact same way but rather than to group documents together we wanted to be able to find the nearest neighbours of a given document.

I see that the cosineSimilaritySparse function has been removed along the sparse_vector datatype so if I want to 100% replicate our current logic inside ES I don't think I have other choice than to store the sparse vectors in a (non-dynamic) object and then recreate the cosineSimilaritySparse function in a Painless script. Ugly, I know!

We had a discussion within our search team, and have made a decision not to reintroduce sparse vectors.
The main reason is poor performance of sparse vectors involving scanning across all documents, and there are currently no ways or plans to optimize this. They are not suited for production scale.

Some other alternatives we are looking into:

For this reason I am closing this issue.

@wmelton @diegomansua Thank you for providing your use cases. Please submit new feature requests if you have other suggestions besides sparse vectors.

Was this page helpful?
0 / 5 - 0 ratings