Elasticsearch: Could we combine multiple indices together?

Created on 20 May 2020  路  7Comments  路  Source: elastic/elasticsearch

_shrink currently allows merging multiple shards of the same index together. Could we extend _shrink to also support merging multiple indices together? This obviously comes with additional requirements, such as requiring the same creation version, similar mappings, no overlap of _ids, etc.

The use-case I have in mind is the new indexing strategy that uses separate indices per dataset. Rolling over purely based on size would likely cause issues by requiring data to stay in the hot tier for very long compared to higher throughput datasets. But rolling over on time means creating smaller indices. Would it help if we were able to shrink indices once they start accumulating?

I'm using the index terminology to write this description but what I really have in mind is data streams. If this feature only worked on data streams, I think that it would be good enough.

:CorFeatureData streams >enhancement CorFeatures

Most helpful comment

I chatted with @jpountz about this issue and merging indices together is also interesting from a mappings (and other metadata) perspective. It could reduce the number of repetitive mappings significantly (if mappings are compatible with each other). Also we would need less node level heap memory for things like the IndexService and MapperService.

This is different than the intent of this issue, which is reducing the number of shards. Also merging on the metadata level shouldn't mean that shards are merged, it is a different kind of operation, which probably should be exposed via a different api. Merging compatible IndexMetaData instances should probably leave the shards as is, but should just make shards of index X be part of index Y (when merging index X into index Y).

All 7 comments

Pinging @elastic/es-core-features (:Core/Features/Data streams)

I think the backing indices of a data stream are a perfect candidate for merging indices.

I think this does rely on #44794, because otherwise there could be overlap in ids due to shippers sending duplicates after a data stream has been rolled over.

We discussed this today and while we didn't come to a conclusion, we had a number of points that we should share:

  • This sounds like a useful idea, especially with the new indexing strategy where the number of indices is growing. This would allow us to take a data stream and merge its backing indices to reduce the shard overhead.
  • A concern is how to treat indices that are not compatible, for example, if their mappings differed?
  • How would we treat the creation date of the newly created index, should it be as old as the oldest index, or brand new, or the newest index date?
  • Should this be in the shrink API, or in a separate API (merge?) where it can be differentiated a bit more from the existing shrink API that only works on a single index

After that discussion we decided to move this to the core/features team for discussion.

I chatted with @jpountz about this issue and merging indices together is also interesting from a mappings (and other metadata) perspective. It could reduce the number of repetitive mappings significantly (if mappings are compatible with each other). Also we would need less node level heap memory for things like the IndexService and MapperService.

This is different than the intent of this issue, which is reducing the number of shards. Also merging on the metadata level shouldn't mean that shards are merged, it is a different kind of operation, which probably should be exposed via a different api. Merging compatible IndexMetaData instances should probably leave the shards as is, but should just make shards of index X be part of index Y (when merging index X into index Y).

I like this idea of decoupling the merging of indices from the merging of shards @martijnvg, this would be more flexible than my original proposal.

We discussed this topic during last week's core/features sync and we agreed that a new api should be introduced that merged compatible indices together. After that the regular shrink api can be used to reduce the number of shards. We should probably name this different than shrink (shards) or merge (segments) because that is already used for other concepts. Maybe combine indices api? Also how the combining of indices should be integrated into the ILM workflow is also a big question.

I updated the title to reflect the current thinking that we wouldn't actually shrink indices together but only combine index medatada.

Was this page helpful?
0 / 5 - 0 ratings