Elasticsearch version: 2.3.3
JVM version:1.8.0_60
OS version: centos-7.0
Description of the problem including expected versus actual behavior:
my cluster consists of 10 data node, when one node stuck on disk io(because of hardware problem), the whole cluster write stuck for several minutes(> 10 minutes), the bad node was not removed from cluster automatically.
Steps to reproduce:
1.
2.
3.
Provide logs (if relevant):
Describe the feature:
there may be a thread to monitor disk io timeout, if it happened and exceed a configured threshold, remove the bad node from cluster.
Hi @curu
We discussed this in FixItFriday. Removing a node because of slow disk I/O is quite an aggressive decision to make. I would be hesitant to have Elasticsearch make this decision. eg you remove one node, so the other nodes have to do shard recovery. now they're slow too, so we remove another node, etc.
Instead, we could log warnings about things like slow disk I/O and a monitoring system could pick up on these warnings and alert the sysadmin.
Hi @clintongormley, I think we could expose the i/o wait % on the node stats api. Might be a good metric to watch for i/o problems and can the api can be easily read by external monitoring systems.
It is worth mentioning that I/O wait was considered in #15915 but we ultimately pulled it out.
@clintongormley , agreed.
how to deal with that when confront this issue ? help !!
I don't think logging is the way to go here, instead monitoring. We can consider enhancing stats to include disk I/O wait, but that is a different enhancement altogether. Therefore, I am closing this one.
Most helpful comment
Hi @clintongormley, I think we could expose the i/o wait % on the node stats api. Might be a good metric to watch for i/o problems and can the api can be easily read by external monitoring systems.