Elasticsearch: Could we combine multiple indices together?

Created on 20 May 2020 · 7Comments · Source: elastic/elasticsearch

_shrink currently allows merging multiple shards of the same index together. Could we extend _shrink to also support merging multiple indices together? This obviously comes with additional requirements, such as requiring the same creation version, similar mappings, no overlap of _ids, etc.

The use-case I have in mind is the new indexing strategy that uses separate indices per dataset. Rolling over purely based on size would likely cause issues by requiring data to stay in the hot tier for very long compared to higher throughput datasets. But rolling over on time means creating smaller indices. Would it help if we were able to shrink indices once they start accumulating?

I'm using the index terminology to write this description but what I really have in mind is data streams. If this feature only worked on data streams, I think that it would be good enough.

:CorFeatureData streams >enhancement CorFeatures

Source

jpountz

👍7 🚀1

Most helpful comment

I chatted with @jpountz about this issue and merging indices together is also interesting from a mappings (and other metadata) perspective. It could reduce the number of repetitive mappings significantly (if mappings are compatible with each other). Also we would need less node level heap memory for things like the IndexService and MapperService.

This is different than the intent of this issue, which is reducing the number of shards. Also merging on the metadata level shouldn't mean that shards are merged, it is a different kind of operation, which probably should be exposed via a different api. Merging compatible IndexMetaData instances should probably leave the shards as is, but should just make shards of index X be part of index Y (when merging index X into index Y).

martijnvg on 10 Aug 2020

👍2

All 7 comments

Pinging @elastic/es-core-features (:Core/Features/Data streams)

elasticmachine on 20 May 2020

I think the backing indices of a data stream are a perfect candidate for merging indices.

I think this does rely on #44794, because otherwise there could be overlap in ids due to shippers sending duplicates after a data stream has been rolled over.

martijnvg on 27 May 2020

We discussed this today and while we didn't come to a conclusion, we had a number of points that we should share:

This sounds like a useful idea, especially with the new indexing strategy where the number of indices is growing. This would allow us to take a data stream and merge its backing indices to reduce the shard overhead.
A concern is how to treat indices that are not compatible, for example, if their mappings differed?
How would we treat the creation date of the newly created index, should it be as old as the oldest index, or brand new, or the newest index date?
Should this be in the shrink API, or in a separate API (merge?) where it can be differentiated a bit more from the existing shrink API that only works on a single index

After that discussion we decided to move this to the core/features team for discussion.

dakrone on 27 Jul 2020

martijnvg on 10 Aug 2020

👍2

I like this idea of decoupling the merging of indices from the merging of shards @martijnvg, this would be more flexible than my original proposal.

jpountz on 11 Aug 2020

We discussed this topic during last week's core/features sync and we agreed that a new api should be introduced that merged compatible indices together. After that the regular shrink api can be used to reduce the number of shards. We should probably name this different than shrink (shards) or merge (segments) because that is already used for other concepts. Maybe combine indices api? Also how the combining of indices should be integrated into the ILM workflow is also a big question.

martijnvg on 18 Aug 2020

👍1

I updated the title to reflect the current thinking that we wouldn't actually shrink indices together but only combine index medatada.

jpountz on 2 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cleanup Elasticsearch configuration files support

jasontedor · 3Comments

Improve test coverage for bucket and metric aggregations

martijnvg · 3Comments

Secure Settings

rjernst · 3Comments

Should range aggregations support the `missing` option?

jpountz · 3Comments

Bad geopoint field should throw error

clintongormley · 3Comments