Elasticsearch: cross cluster search - entire response failure although only one cluster down

Created on 9 Aug 2017 · 7Comments · Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.1, Build: 19c13d0/2017-07-18T20:44:24.823Z, JVM: 1.8.0_131

JVM version (java -version):
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux zbook-15-G3 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
There are 3 elastic clusters and cluster_1 is configured to be able to search across clusters: cluster_2 and cluster_3.
When cluster_3 is down (cluster_1, cluster_2 are up) wildcard cluster search returns failure:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "connect_transport_exception",
        "reason" : "[][127.0.0.1:9303] connect_timeout[30s]"
      }
    ],
    "type" : "transport_exception",
    "reason" : "unable to communicate with remote cluster [cluster_3]",
    "caused_by" : {
      "type" : "connect_transport_exception",
      "reason" : "[][127.0.0.1:9303] connect_timeout[30s]",
      "caused_by" : {
        "type" : "annotated_connect_exception",
        "reason" : "Connection refused: /127.0.0.1:9303",
        "caused_by" : {
          "type" : "connect_exception",
          "reason" : "Connection refused"
        }
      }
    }
  },
  "status" : 500
}

I would expect that result of all up clusters is returned.

Steps to reproduce:

Add data to cluster_2

curl -XPUT 'localhost:9202/test-2/data/1?pretty' -H 'Content-Type: application/json' -d'
{
    "user" : "user2",
    "message" : "test2"
}
'

Add data to cluster_3

curl -XPUT 'localhost:9203/test-3/data/1?pretty' -H 'Content-Type: application/json' -d'
{
    "user" : "user3",
    "message" : "test3"
}
'

Set cross cluster search on cluster_1

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_2": {
          "seeds": [
            "127.0.0.1:9302"
          ]
        },
        "cluster_3": {
          "seeds": [
            "127.0.0.1:9303"
          ]
        }
      }
    }
  }
}
'

Down cluster_3
Get all test-* indexes from cluster_2 and cluster_3

curl -XPOST 'localhost:9200/*:test-*/data/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

Please also see console output from the test case:
test.console.txt

:SearcSearch >enhancement

Source

kopnok

👍1

Most helpful comment

Have a way to check if the query results contain data from the "optional" remote clusters

I think we can do something about it. For instance we can maybe even make this default and add a _cluster part to the response ie.:

  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "_clusters" : {
    "total" : 2,
    "successful": 1,
    "skipped": 1
  }

this way we can simply let the user decide what they want to do and it's consistent with the _shards. @javanna WDYT?

s1monw on 19 Sep 2017

👍5

All 7 comments

The same question was recently asked in our forums.

The problem you are facing is that none of the nodes are available in one of the remote clusters.

When doing a cross cluster search, one node per remote cluster is initially contacted to find out where the shards for such request are located. If this step cannot be completed, the whole search request fails. If you had at least one node available in each of the clusters, the search would rather go on although some other nodes, potentially nodes holding relevant shards, are down. This would just cause partial results, but not a total failure.

I don’t see a way to work around this other than checking upfront which remote clusters are online and avoid querying the ones that are completely down. In fact cross cluster search wasn't thought to query multiple clusters where some of them can just be shut down completely. Not too sure whether we want to change this, @s1monw what do you think?

javanna on 9 Aug 2017

I don’t see a way to work around this other than checking upfront which remote clusters are online and avoid querying the ones that are completely down. In fact cross cluster search wasn't thought to query multiple clusters where some of them can just be shut down completely. Not too sure whether we want to change this, @s1monw what do you think?

I think the problem is not as simple as it's presented here. What happens if a cluster is not connected toady is that we try to reconnect. If we _check_ up-front if a cluster is not connected then we need to skip this step? What should the behavior be? I think we can look into something like an additional setting that simply treats the cluster as not present if it can't be reconnected but I want to make sure that such a setting doesn't complicate the code too much. I think it would be confusing if we'd not fail here it's really a different issue than an unavailable shard.

Another maybe more consistent option would be to skip those clusters when we resolve wildcards. We would still need to figure out when we try to reconnect to them since we usually do that as part of the search request.

s1monw on 14 Aug 2017

👍2

An use case would be to give a user access to disjoint data sets in different data centers.
For this use case we would like to be able to query the clusters that are reachable and ignore the rest.

Also from the DevOps standpoint it would not be nice to have dependencies between data centers.
Automation can be done in many different ways, CM tools run (Chef, Ansible, etc) and/or cloud or container-based deployments (OpenStack HEAT stacks, docker swarm, etc). It should be possible to start all elasticsearch nodes local to the datacenter and those nodes to accept writes and local reads and hopefully cross-cluster searches for all reachable clusters.

acristu on 25 Aug 2017

👍2

@acristu sorry for coming back to this so late. I took a closer look at it and my current plan is to allow you to set a setting search.remote.$clustername.ignore_disconnected that defaults to false and that allows you to set this per cluster you are connecting to. such that the default behavior is still unchanged but it allows you to basically skip clusters entirely if they are optional. This also allows for mandatory clusters which I personally see as a valid usecase here as well. WDYT /cc @javanna

s1monw on 19 Sep 2017

👍3

@s1monw thank you, sounds good, this new setting hopefully will allow:

Elasticsearch process to start even if the remote cluster is unreachable
Queries to return data from the local cluster and the reachable remote clusters
Have a way to check if the query results contain data from the "optional" remote clusters
- Don't know yet what would be a straight forward way to do this. The _index field contains the cluster alias, checking this point for each query is not feasible. Perhaps just polling using a health-check API call is the best we can do
related to the previous point, the ability to display the status of the remote clusters on a Kibana dashboard to make it clear to the user that the data is incomplete

acristu on 19 Sep 2017

@s1monw ++ I like the idea

javanna on 19 Sep 2017

Have a way to check if the query results contain data from the "optional" remote clusters

I think we can do something about it. For instance we can maybe even make this default and add a _cluster part to the response ie.:

  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "_clusters" : {
    "total" : 2,
    "successful": 1,
    "skipped": 1
  }

this way we can simply let the user decide what they want to do and it's consistent with the _shards. @javanna WDYT?

s1monw on 19 Sep 2017

👍5

Was this page helpful?

0 / 5 - 0 ratings