Elasticsearch: Hot swappable path.data disks

Created on 11 May 2016 · 16Comments · Source: elastic/elasticsearch

It seems that when making use of path.data over multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.

It would be great if Elasticsearch could:

Automatically recover when disks are removed
Automatically make use of a newly returned empty disk

Steps to Test / Reproduce:

1) Set up path.data over 2 disks, and start 2 elasticsearch nodes locally

path.data: ["/Volumes/KINGSTON", "/Volumes/SDCARD"]

2) Index some data over 5 shards.

index    shard prirep state   docs  store ip        node
test1003 4     r      STARTED    2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED    2 10.1kb 127.0.0.1 Vindicator
test1003 3     r      STARTED    6 24.4kb 127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED    6 24.5kb 127.0.0.1 Vindicator
test1003 1     r      STARTED   10 40.6kb 127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED   10 45.5kb 127.0.0.1 Vindicator
test1003 2     r      STARTED    2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED    2 10.1kb 127.0.0.1 Vindicator
test1003 0     r      STARTED    3 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED    3 10.1kb 127.0.0.1 Vindicator

3) Remove the disk that contains most/all of the data

Exceptions start to show in logs

2016-05-11 11:50:18,961][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
    at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
    at org.elasticsearch.index.store.Store.stats(Store.java:321)
    at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
    at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
    at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
    ... 18 more
[2016-05-11 11:50:26,796][WARN ][monitor.fs               ] [Vindicator] Failed to fetch fs stats - returning empty instance

but _cat/shards shows everything is OK

index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator

3) Post a _refresh

No change

4) Index some data

{
   "error": {
      "root_cause": [
         {
            "type": "index_failed_engine_exception",
            "reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
            "index_uuid": "01ABN7pTQDCoTa80WMdAvg",
            "shard": "1",
            "index": "test1003"
         }
      ],
      "type": "index_failed_engine_exception",
      "reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
      "index_uuid": "01ABN7pTQDCoTa80WMdAvg",
      "shard": "1",
      "index": "test1003",
      "caused_by": {
         "type": "i_o_exception",
         "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_a.cfs\") [slice=_a_Lucene50_0.tim]",
         "caused_by": {
            "type": "i_o_exception",
            "reason": "Input/output error"
         }
      }
   },
   "status": 500
}

Logs show an exception

[2016-05-11 11:52:26,911][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
    at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
    at org.elasticsearch.index.store.Store.stats(Store.java:321)
    at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
    at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
    at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
    ... 18 more

_cat/shards still show all shards STARTED

index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator

5) Wait 5 minutes, Search some data:

No change

{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 3,
      "failed": 2,
      "failures": [
         {
            "shard": 0,
            "index": "test1003",
            "node": "AMr_NWrVSFCuNV-YCOfsVg",
            "reason": {
               "type": "i_o_exception",
               "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index/_0.cfs\") [slice=_0.fdt]",
               "caused_by": {
                  "type": "i_o_exception",
                  "reason": "Input/output error"
               }
            }
         },
         {
            "shard": 1,
            "index": "test1003",
            "node": "wK5mnEIaT82Wz3wdTAjv6Q",
            "reason": {
               "type": "i_o_exception",
               "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/1/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_2.cfs\") [slice=_2.fdt]",
               "caused_by": {
                  "type": "i_o_exception",
                  "reason": "Input/output error"
               }
            }
         }
      ]
   },
   "hits": {
      "total": 23,
      "max_score": 1,
      "hits": []
   }
}

index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator

:CorInfrResiliency >enhancement resiliency

Source

PhaedrusTheGreek

👍2

Most helpful comment

I think we need to add some resiliency here:

we should check if we can write on the datapath before we allocate
we should fail the engine if we hit an IOException in any case it's really crazy that we don't do that. There should not be any IOException here

I will take care of this

s1monw on 13 May 2016

❤6

All 16 comments

Related to #18217

clintongormley on 12 May 2016

While I think there may be improvements that can be made when a disk dies, if you want hot swapping etc I think you need a proper RAID system or LVS

clintongormley on 12 May 2016

I think we need to add some resiliency here:

we should check if we can write on the datapath before we allocate
we should fail the engine if we hit an IOException in any case it's really crazy that we don't do that. There should not be any IOException here

I will take care of this

s1monw on 13 May 2016

❤6

yeah I am torn on the hot-swapping. I think we can potentially take things out of the loop internally but if you are pluggin in a new disk and we should auto-detect that a datapath is good again I think you should restart the node instead?

s1monw on 13 May 2016

Definitely we don't want to _introduce_ any resiliency issues. Some manual intervention makes sense, but restarting a node can sometimes take a long time. Should there be something like delayed allocation on marking a path.data as failed? - there is the case of something like NFS, where a network problem might make the drive appear to come and go.

PhaedrusTheGreek on 13 May 2016

I think if you loose a disk you need to restart the node. I can totally improve along the lines of failing shards quicker but we shouldn't try to be fancy here. I think we should take the node out of the cluster somehow but that's something that needs more thought.

s1monw on 14 May 2016

Multiple disks on path.data offers some added benefit over RAID0, in that IO is spread out over all disks, theoretically matching RAID0 performance, but while not causing a total volume failure on a single disk loss.

Restarting a node is much easier than re-building a logical volume, and much less data is lost, so either way we are ahead.

PhaedrusTheGreek on 17 May 2016

I think if you loose a disk you need to restart the node. I can totally improve along the lines of failing shards quicker but we shouldn't try to be fancy here. I think we should take the node out of the cluster somehow but that's something that needs more thought.

In general this makes sense but it would be nice if you could apply something like a transient setting to tell that node that a disk has died and to temporarily stop trying to perform I/O on it. That would still require manual intervention, but it would allow to apply a temporary hotfix if a node restart is not immediately feasible.

neuroticnetworks on 10 Aug 2016

Had this issue come up against last night.

Our logging nodes have 4 SSDs. We've passed an array to the path.data in elasticsearch.yaml.

Over the weekend, one of the file systems on one of the disks one one of the ES servers became corrupt. Over the next 12 hours, ES spewed 500GB of errors like the following into the logs, filling up the root partition and eventually alerting us (because we alert on disk usage but we didn't at the time have alerts on ES log file size / growth)

[2016-10-22 00:00:04,017][WARN ][cluster.action.shard     ] [deliverability_master02-es02] [logstash-delivery-2016.10.14.09][0] received shard failed for target shard [[logstash-delivery-2016.10.14.09][0], node[J_Wws-cKQPKPJjIE7lEacw], relocating [IIKJ3BHGRlG0IYmZ3GLeNA], [R], v[8192], s[INITIALIZING], a[id=HwzksPLITruZz94vsNTMvg, rId=6DS2pI5FS3uih0a1yvRJFw], expected_shard_size[25697352067]], indexUUID [RL1zWoD6SN6_ZmpjPGM0Yw], message [failed to create shard], failure [ElasticsearchException[failed to create shard]; nested: NotSerializableExceptionWrapper[file_system_exception: /storage/sdd1/deliverability/nodes/0/indices/logstash-delivery-2016.10.14.09/0/_state: Input/output error]; ]
[logstash-delivery-2016.10.14.09][[logstash-delivery-2016.10.14.09][0]] ElasticsearchException[failed to create shard]; nested: NotSerializableExceptionWrapper[file_system_exception: /storage/sdd1/deliverability/nodes/0/indices/logstash-delivery-2016.10.14.09/0/_state: Input/output error];
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:620)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:520)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:177)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: NotSerializableExceptionWrapper[file_system_exception: /storage/sdd1/deliverability/nodes/0/indices/logstash-delivery-2016.10.14.09/0/_state: Input/output error]
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:257)
    at org.elasticsearch.index.shard.ShardPath.loadShardPath(ShardPath.java:122)
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)
    ... 10 more

There are 12 data nodes in this cluster with 4 SSDs, 3 dedicated masters, and we run a replication factor of 2 using hourly indices with 2 primary shards.

During the time that this happened, Elasticsearch continued to place primary shards on the failed storage/sdd1 drive. Because the writes to the primary failed, and because we were only alerted of the problem (interesting to note as well that the cluster remained green the entire time and none of our monitoring and alerting of /_cluster and _nodes stats caught it.. which is our fault, but still important to note) because the errors in the logs filled up the root disk.

As a result of Elasticsearch continuing to place primary shards on the failed disk, we lost half of the log data for 9 out of the 12 hours that this disk was unreachable (because 9 out of 12 times it attempted to place at least one of each hour's primary shards on the unreachable disk; the writes to primary failed, the primary was never moved elsewhere).

I suspect, although I did not dig into it or write a test case to prove it, that the process whereby Elasticsearch determines which nodes are eligible to get a write and which disk to write to once it gets there might also bias further writes towards the drive that failed. In our case, we had 9 data nodes that were eligible to accept writes, each having 4 eligible disks that had not exceeded any water marks or otherwise were unwritable. Over 12 hours, 9 of the 24 primary shards created were allocated to the node with the disk failure and it routed them to the unreachable disk. As a result of being unwritable for several hours, that disk also was less full than the other disks on the cluster. Again, I don't know that a disk failure like the one we had biases shard placement in favor of writing to the unreachable disk. But we did see an abnormally high number of shards placed on one machine, and on one disk on one machine.... abnormal enough to make me wonder if that wasn't just a coincidence.

All of which to say.... I think this issue is extremely important. I also think @s1monw is right to suggest that ensuring a filepath is writable before placing a shard (especially a primary shard) will go a long way towards adding resiliency.

neuroticnetworks on 24 Oct 2016

https://github.com/elastic/elasticsearch/issues/18279#issuecomment-218994716 describe two things that need to happen to resolve this issue. The first has been done in https://github.com/elastic/elasticsearch/pull/16745 . The second (failing the shard) is very easy. I opened https://github.com/elastic/elasticsearch/issues/29008 to highlight it as an adopt me and a low hanging fruit. Closing this one as superseded by these two issues.

bleskes on 13 Mar 2018

@bleskes would you consider reopening this ticket as a high hanging fruit, as per https://github.com/elastic/elasticsearch/issues/29008#issuecomment-372852685? Or, if you feel it should remain closed, can you share a bit more of your thinking about why? I don't feel like https://github.com/elastic/elasticsearch/pull/16745 and https://github.com/elastic/elasticsearch/issues/18279#issuecomment-218994716 are talking about the same thing

neuroticnetworks on 14 Mar 2018

@evanv I agree it's not the same thing. As the discussion above indicates, we feel adding hot swappiness on the path level will come at a too high of a price. Elasticsearch currently works on the level of a node - shard copies are spread up across nodes and if a shard fails the master will try to assign it to another node. We can do better there and start tracking failures per node so we can stop allocating to it (we don't do that now) but adding another conceptual layer isn't worth it. LVM or RAID are much more mature solutions to achieve that part. That said, there were a few things we can do that came out of the discussion. One is done and the other is tracked by the another issue, which is why I closed this one.

bleskes on 15 Mar 2018

👍1

Thank you for explaining. I see what you're saying.

I feel like this ticket shouldn't be called "Hot swappable data paths" and instead be a bug report along the lines of "ES shouldn't allocate shards to dead disks." I think the later is still true, albeit far more complicated, to your point. I also feel like the docs recommending multiple file paths should be caveated that RAID0 might be a better option, depending on your needs (I'm happy to submit an update to the docs along these lines, if you'd be open to accepting it).

You're definitely right that ES shouldn't be responsible for replacing RAID or LVM. Focusing on the issues you did makes sense as a better solution than currently exists. Not to beat a dead horse, but I do feel that ES should be capable of not trying to allocate shards to dead disks. That is how I viewed this original issue, and it sounds like we both agree that https://github.com/elastic/elasticsearch/issues/29008 doesn't quite cover that. Would you be open to adding an issue along the lines of "ES Shouldn't Allocate Shard to Dead Disks" and/or renaming this one and orienting the scope of it around that, not hot swappable disks?

neuroticnetworks on 15 Mar 2018

Pinging @elastic/es-core-infra

elasticmachine on 15 Mar 2018

I also feel like the docs recommending multiple file paths should be caveated that RAID0 might be a better option, depending on your needs (I'm happy to submit an update to the docs along these lines, if you'd be open to accepting it).

Yes please, though I tried to find what you meant and couldn't.

Would you be open to adding an issue along the lines of "ES Shouldn't Allocate Shard to Dead Disks" and/or renaming this one and orienting the scope of it around that, not hot swappable disks?

I think this one https://github.com/elastic/elasticsearch/issues/18417 covers it? If you agree, feel free to comment there.

bleskes on 16 Mar 2018

👍2

Yes please, though I tried to find what you meant and couldn't.

I may be recalling incorrectly, or it may have been a blog post. In any event, I'll poke around and add a note to the docs on "things to watch out for" vis a vis multiple data paths.

https://github.com/elastic/elasticsearch/issues/18417 does cover my concern yes. Thanks for taking the time to explain your reasoning on this one. I wasn't following you at first, but it's very clear now what you're thinking and how you're breaking down the work on this task. Much appreciated.

neuroticnetworks on 16 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings