Elasticsearch: Add timeout for Search Network Action to Improve Cluster Resistance

Created on 22 Jul 2020 · 10Comments · Source: elastic/elasticsearch

Issue

Currently, the coordinate node sends Query and Fetch network action to remote data nodes without any timeout options.

    public <T extends TransportResponse> void sendChildRequest(final Transport.Connection connection, final String action,
                                                               final TransportRequest request, final Task parentTask,
                                                               final TransportResponseHandler<T> handler) {
        sendChildRequest(connection, action, request, parentTask, TransportRequestOptions.EMPTY, handler);
    }

It has a very bad impact, when one of the data nodes' machine is in disk failure, it can't handle I/0 operations like reading or writing data from disk but it is still connected with other nodes. This node acts as a black hole in the cluster, it stuck every shard search request from the coordinate node. Cumulative requests are increasing and consuming a lot of memory in the coordinate node, soon it will cause the coordinate node to fullGC.

We have maintained a Production Environment for about 300 nodes, and Disk Failure is very common. We try to set a timeout in search request body, like https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html#global-search-timeout. But it doesn't take effect in my situation since the timeout is mainly used for Lucene, as discussed in https://github.com/elastic/elasticsearch/issues/9156.

So I have added the request body timeout for the query, fetch, and write network action. It seems to have a very great impact on improving cluster resistance on the Overload or Disk Failure of a node. I wonder if the solution is good enough and there is a better solution instead?

:SearcSearch >bug Search feedback_needed

Source

boicehuang

Most helpful comment

The bad consequences of a disk failure are likely much more widespread than searches alone. However I think the problem described here will be improved in 7.9 thanks to https://github.com/elastic/elasticsearch/pull/52680 which will remove a node from the cluster fairly promptly if its disk fails. Would you confirm or otherwise that this improves your experience?

DaveCTurner on 22 Jul 2020

👍2

All 10 comments

DaveCTurner on 22 Jul 2020

👍2

Pinging @elastic/es-search (:Search/Search)

elasticmachine on 22 Jul 2020

@DaveCTurner. Thanks for your reply. I think read-only filesystems is not enough. We also need to handle the Overload situation. Although the disk is heavily loaded and its %util may reach 95%, it can still handle write or read operation very slowly. Will a heavy-load disk be treated as disk failure?

boicehuang on 22 Jul 2020

This is interesting I opened https://github.com/elastic/elasticsearch/issues/59824. Not sure if that helps with your case @boicehuang

Bukhtawar on 22 Jul 2020

In my experience, concurrent aggregation requests stuck in coordinate nodes can easily cause a lot of nodes of the cluster to full GC and become unresponsive in the situation of disk heavy-loads or failure. That is why I mentioned search requests as above.

boicehuang on 22 Jul 2020

@DaveCTurner Thanks for your reply. Let me explain more about my experience. The process is as follows.

A big aggregation request is sent to the coordinating node with a timeout and split into 5 shard search queries. The Shard search requests were sent to the data nodes without any timeout.
4 of the data shards returned shard search responses quickly, but one data node machines became heavily loaded that elasticsearch handled shard search requests very slowly and returned its response after 5 minutes. (In my experience, some type of disk failure just slows down i/o speed.)
The coordinate node was still waiting for response of Shard 1. The client waited for 5s but still can't get any partial results. It discarded the request but the aggregation request is still stuck in elasticsearch with 4 shard search response and waiting for the only shard search response left.

after 5 minutes the left shard finally returned its response and the aggregation finally finished.

My question:

1) Can we end up the aggr earlier and responded to the request with partial results, instead of waiting for a response or node health checker?

boicehuang on 24 Jul 2020

Can we end up the aggr earlier and responded to the request with partial results, instead of waiting for a response or node health checker?

In theory we _could_ but it doesn't really help much. We'd carry on sending traffic to the bad node, all of which will have to wait for the same timeout, so latency would be terrible and a backlog would still form. Much better to stop sending requests to the node ASAP, i.e. to remove it from the cluster.

In my experience, some type of disk failure just slows down i/o speed.

This is the bit I am not understanding. Faulty disks fail IO requests quickly IME (assuming they're properly configured and don't retry for ages first) and you can detect that they're failing and remove the node from service yourself much earlier and more reliably than Elasticsearch can (e.g. by looking at SMART metrics).

Given that this is not an Elasticsearch-specific problem I've looked at how other similar systems handle this (e.g. MongoDB, Cassandra, CockroachDB). As far as I can tell, none of them have any special handling or health checks of this nature. I've spoken with our syseng folks internally and we don't see our own clusters failing like this either. I haven't been able to find any other research or documentation that suggests this is something that needs addressing at the application level either. Please help us understand what we've missed here.

DaveCTurner on 24 Jul 2020

@DaveCTurner. Thanks for your reply.

I haven't been able to find any other research or documentation that suggests this is something that needs addressing at the application level either. Please help us understand what we've missed here.

As far as I know, in HBase we limit how long a single RPC call can run by setting hbase.rpc.timeout. In MySQL, we can also use max_exection_time to set a session-wide timeout, long-running query execution would be interrupted after timeout.

boicehuang on 27 Jul 2020

Neither of those look relevant to the question of detecting and handling disk failures with a timeout. Repeating what I said above: timing out individual (sub-)requests does not prevent future (sub-)requests from going to the same faulty node. Much better to remove broken nodes from the cluster.

In terms of the more general question of timing out a search, you can already do this today in the client (#43332) and we're already discussing returning partial results on a timeout (#30897) too. You may also be looking for adaptive replica selection to steer searches away from nodes that are simply busy. I don't think these are useful in the case of a genuinely broken node, the subject of this thread, but maybe they are helpful to you.

DaveCTurner on 27 Jul 2020

No further feedback received. @boicehuang if you have any more information for us please add it in a comment and we can look at re-opening this issue.