Elasticsearch: Make similarities dynamically updatable where possible

Created on 4 Jul 2014  路  16Comments  路  Source: elastic/elasticsearch

The core similarities can be swapped in dynamically on an existing index, as long as discount_overlaps is the same. Currently we disallow updating similarities, because custom similarities may not be compatible.

The logic for deciding whether a similarity can be changed should be more fine grained.

:SearcMapping >enhancement Search help wanted high hanging fruit

Most helpful comment

Hello - is this still on the plate? Changing b and k1 on the fly for BM25 would be really nice. It seems a waste to reindex if no actual on-disk data would change.

All 16 comments

This would be a really nice addition - for an example we have now millions of documents, and would like to experiment with the scoring to deliver that last 10%, but we need to reindex the whole lot each time we want to change the similarity.. This means that we probably end up with as many clusters as there are scoring algos available, which costs time, money, effort and motivation.

Maybe you should add interface to the Similarity implementations and add

/**
 * Returns the type names that are compatible metadata wise with this Similarity.
 **/
String[] getMetadataCompatibleTypeNames()

This would allow logic to be added to determine if the change should be allowed or not, solving the custom similarity compatibility problem. (As obviously the custom implementations should implement it too.) Downside for this is that you'd need a wrapper implementation for the Lucene provided Similarities and it would break existing custom ones. (Though a default wrapper that declares compatibility with none would allow everything to work as now.)

I'm not familiar enough with the Lucene classes to know if there is already a built in way to infer that knowledge though. Why this is not solved in Lucene level btw?

Because its not an issue with Lucene. You just call IndexWriterConfig.setSimilarity() and that is what IndexWriter will use to encode normalization factors.

Similarity impls can shove whatever it wants in there, up to 64-bits of stuff encoded in whatever form it wants. So ES does the right thing to prevent you from changing this here (in general). It is the same as changing index analyzer for a field, its generally just an unsafe thing to do.

But the core similarities introduced in lucene 4.0 have a special property, in that by default they all encode the index-time information (normalization factor) in a backwards compatible way as DefaultSimilarity historically did: as 1/sqrt(length) with a certain single-byte encoding.

This was done intentionally to make experimentation and "simple" testing of these ranking algorithms easier. It should not be enforced with any interface or anything like that, because subclasses and even setter methods can easily break it. It is just a way to quickly experiment with different ranking algorithms without reindexing.

I think its nice to expose (safely) this optimization to users of ES, too, so they can play in the same way. But it does not need any additional APIs for experts or custom implementations, that is misleading and dangerous.

If you are really trying to get the last 10% then I don't think this issue is really relevant: its just not going to hold for "tuning". If you are really tuning, you will likely break this property yourself anyway: the default encoding used here is very general purpose and must support a crazy range for documents large and small and various values for index-time boost. If those assumptions don't hold, in many cases you can tweak normalization to be better by adjusting the encoding.

Hi, thanks for the response!

I meant that I'm trying to cater a better search experience for the end users, and tuning the relevance ranking, which for me as the user of ES, is the last 10%. (It does pretty well with the defaults, but there are cases where simple per field/query boosting is not delivering. Hence the need to experiment with similarities.)

I'd love to have this exposed to ES users too if possible, though I understand that I'm asking here the permission to (possibly) shoot myself to the foot :)

The API thingy was just a proposal to formalize the now unofficial contract which similarities are interchangeable, but as said I don't know if it makes sense or not. (Well, now I do know that it does not.)

I don't think we should give users the ability to shoot themselves in the foot, ever. Its easily prevented.

A common use case for this issue would be to allow someone to switch from the default similarity to BM25 and then tweak k1 and b parameter values all without reindexing. This is totally safe, and expert enough!

Having a custom similarity (subclass) is a much more expert thing and we don't need to make things complicated for that. If you already know enough to make your own similarity class, then you already have an expert way to tune without reindexing: you can tune your parameters by changing some constant in your code and ES is none the wiser.

I don't think we should give users the ability to shoot themselves in the foot, ever. Its easily prevented.

:+1:

I agree with rob here and I don't see really a need to do much on this issue.

Did I understand it correctly that you wish to close this as won't fix?

@villeapvirtanen yeah that is what I propose

I think the simple case is nice to have for the core similarities from lucene (see my BM25 example above). But i have no idea how tricky it is to implement this.

The similarity parameters are set outside of the mappings (they are in a parallel section called "similarity"). But glancing at the code, I cannot see how it is possible they are updated (or even adding new ones after index creation). I agree this should be fixed: like with mappings, you should be able to tweak the parameters of the similarity, but not change the type, for a given name.

like with mappings, you should be able to tweak the parameters of the similarity, but not change the type, for a given name.

Its not like mappings at all though.

Changing DefaultSimilarity to BM25Similarity is ok: the on-disk encoding is the same.
Changing BM25Similarity k1/b parameters is ok: the on-disk encoding is the same.
Changing BM25Similarity.discountOverlaps is not ok, you need to reindex.

cc @elastic/es-search-aggs

Hello - is this still on the plate? Changing b and k1 on the fly for BM25 would be really nice. It seems a waste to reindex if no actual on-disk data would change.

I found my way here for the same reason, I would like to experiment with tweaking k1 for BM25 but it currently requires a full reindex.

To ease the pain, I found that if you define a custom similarity, then you can later change the parameters (after closing the index, using the API). So it is safest to add a custom similarity with the stock parameters before indexing.

Once indexed, you can easily change the parameters.

Excellent thank you! Sounds like that solves my issue.

I've added some examples of how to achieve this in this PR.

Was this page helpful?
0 / 5 - 0 ratings