I met this issue on ES-2.2.0, all ES data nodes have multiple hard disks and no RAID scheme is utilized, they are just a bunch of disks. The usable spaces of those disks on same host are different due to some strange implementation details in XFS (even two XFS fs have same capacity, same size directories and files, they don't always have same free space, I guess it's produced by file creations and deletions).
I pre-create many indicies with replica set to 0, 50 shards for 50 data nodes. The initial empty shard is very small, just several KB disk usage, ES surprisingly creates a lot of shards on the least used disks, for example, if one of ten disks on a data node has 10KB more free space, then it will contain about 8 more shards than other disks. The least used disks will be run out because I use hourly index pattern, all shards for a continuous hour range will go to the same disk. But other disks on that host still have a lot of free space.
I checked a little the code, https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/IndexService.java#L327, the IndexService instance is for per index, not a singleton, so "dataPathToShardCount" will always be empty due to zero replica for this index, so at https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/shard/ShardPath.java#L231, "count" will always be null, class "ShardPath" always selects the least used disk to allocate shard, but unluckily the empty shard is very small, so a lot of shards will be allocated to the least used disk.
This issue isn't limited to zero replica use case, the current ES logic will always prefer the least used disk and the very small empty shard always produces uneven distribution.
I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.
My workaround is to create RAID 0 for those disks on a single node, but RAID 0 is too risky if a host has about 10 disks, maybe I should choose RAID 6.
I suspect that pre-allocating the indices is not working as well with the prediction of shard sizes added in https://github.com/elastic/elasticsearch/pull/11185, @mikemccand does that sound like a good culprit to you?
https://github.com/elastic/elasticsearch/pull/12947 may also be at work here too
@mikemccand does that sound like a good culprit to you?
Yes.
the IndexService instance is for per index
I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.
Oh this is bad: I agree, the dataPathToShardCount should (I and I intended it to be originally in #11185!) be across all indices, not just this one index. Hrmph.
However, I think another fix by @s1monw allowed ES shard balancing to "see" individual path.data on a single node, and move shards off on of a node's path.data that was filling up even if the other path.datas on the node had plenty of free space?
@mikemccand, the ability to move shards among different disks on a node is good to have, we always can't assure different indices have similar sizes, in fact they usually vary a lot due to hourly or daily traffic pattern.
But an initial even distribution is still very nice, it can avoid as more shard moving as possible, I wouldn't like to see ES node suddenly competes disk access with itself for shard moving, segment refresh, segment merge.
Oh, maybe I should just go to RAID :-)
@Dieken Yeah I understand ... if you have ideas on how to fix the dataPathToShardCount to be across all indices instead, that would be great too ;)
Any updates on this? I'm working with a write-heavy cluster and I value distributing disk throughput over file size. New indexes were just allocated with all shards a single path.disk out of the 4 due to unbalance in disk usage.
An API to relocate shards between path.datas would be enough to deal with this. Right now it looks like I have to shut down nodes and manually move folders.
Thanks
With #26654 and #27039, the distribution problem should be fixed now.
This doesn't add the API to relocate shards between path.data paths though, that would be a different thing.
@dakrone I'm tempted to close this as there hasn't been overwhelming interest in it and I don't think we are planning on working on sub-node allocation decisions in the foreseeable future. WDYT?
@DaveCTurner I agree, I think this can be closed for now as we've addressed a portion of it.
If anyone disagrees let us know and we can revisit!
Most helpful comment
Any updates on this? I'm working with a write-heavy cluster and I value distributing disk throughput over file size. New indexes were just allocated with all shards a single path.disk out of the 4 due to unbalance in disk usage.
An API to relocate shards between path.datas would be enough to deal with this. Right now it looks like I have to shut down nodes and manually move folders.
Thanks