Elasticsearch: cross cluster search - entire response failure although only one cluster down

Created on 9 Aug 2017  路  7Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.1, Build: 19c13d0/2017-07-18T20:44:24.823Z, JVM: 1.8.0_131

JVM version (java -version):
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux zbook-15-G3 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
There are 3 elastic clusters and cluster_1 is configured to be able to search across clusters: cluster_2 and cluster_3.
When cluster_3 is down (cluster_1, cluster_2 are up) wildcard cluster search returns failure:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "connect_transport_exception",
        "reason" : "[][127.0.0.1:9303] connect_timeout[30s]"
      }
    ],
    "type" : "transport_exception",
    "reason" : "unable to communicate with remote cluster [cluster_3]",
    "caused_by" : {
      "type" : "connect_transport_exception",
      "reason" : "[][127.0.0.1:9303] connect_timeout[30s]",
      "caused_by" : {
        "type" : "annotated_connect_exception",
        "reason" : "Connection refused: /127.0.0.1:9303",
        "caused_by" : {
          "type" : "connect_exception",
          "reason" : "Connection refused"
        }
      }
    }
  },
  "status" : 500
}

I would expect that result of all up clusters is returned.

Steps to reproduce:

  1. Add data to cluster_2
curl -XPUT 'localhost:9202/test-2/data/1?pretty' -H 'Content-Type: application/json' -d'
{
    "user" : "user2",
    "message" : "test2"
}
'
  1. Add data to cluster_3
curl -XPUT 'localhost:9203/test-3/data/1?pretty' -H 'Content-Type: application/json' -d'
{
    "user" : "user3",
    "message" : "test3"
}
'
  1. Set cross cluster search on cluster_1
curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_2": {
          "seeds": [
            "127.0.0.1:9302"
          ]
        },
        "cluster_3": {
          "seeds": [
            "127.0.0.1:9303"
          ]
        }
      }
    }
  }
}
'
  1. Down cluster_3
  2. Get all test-* indexes from cluster_2 and cluster_3
curl -XPOST 'localhost:9200/*:test-*/data/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

Please also see console output from the test case:
test.console.txt

:SearcSearch >enhancement

Most helpful comment

Have a way to check if the query results contain data from the "optional" remote clusters

I think we can do something about it. For instance we can maybe even make this default and add a _cluster part to the response ie.:

  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "_clusters" : {
    "total" : 2,
    "successful": 1,
    "skipped": 1
  }

this way we can simply let the user decide what they want to do and it's consistent with the _shards. @javanna WDYT?

All 7 comments

The same question was recently asked in our forums.

The problem you are facing is that none of the nodes are available in one of the remote clusters.

When doing a cross cluster search, one node per remote cluster is initially contacted to find out where the shards for such request are located. If this step cannot be completed, the whole search request fails. If you had at least one node available in each of the clusters, the search would rather go on although some other nodes, potentially nodes holding relevant shards, are down. This would just cause partial results, but not a total failure.

I don鈥檛 see a way to work around this other than checking upfront which remote clusters are online and avoid querying the ones that are completely down. In fact cross cluster search wasn't thought to query multiple clusters where some of them can just be shut down completely. Not too sure whether we want to change this, @s1monw what do you think?

I don鈥檛 see a way to work around this other than checking upfront which remote clusters are online and avoid querying the ones that are completely down. In fact cross cluster search wasn't thought to query multiple clusters where some of them can just be shut down completely. Not too sure whether we want to change this, @s1monw what do you think?

I think the problem is not as simple as it's presented here. What happens if a cluster is not connected toady is that we try to reconnect. If we _check_ up-front if a cluster is not connected then we need to skip this step? What should the behavior be? I think we can look into something like an additional setting that simply treats the cluster as not present if it can't be reconnected but I want to make sure that such a setting doesn't complicate the code too much. I think it would be confusing if we'd not fail here it's really a different issue than an unavailable shard.

Another maybe more consistent option would be to skip those clusters when we resolve wildcards. We would still need to figure out when we try to reconnect to them since we usually do that as part of the search request.

An use case would be to give a user access to disjoint data sets in different data centers.
For this use case we would like to be able to query the clusters that are reachable and ignore the rest.

Also from the DevOps standpoint it would not be nice to have dependencies between data centers.
Automation can be done in many different ways, CM tools run (Chef, Ansible, etc) and/or cloud or container-based deployments (OpenStack HEAT stacks, docker swarm, etc). It should be possible to start all elasticsearch nodes local to the datacenter and those nodes to accept writes and local reads and hopefully cross-cluster searches for all reachable clusters.

@acristu sorry for coming back to this so late. I took a closer look at it and my current plan is to allow you to set a setting search.remote.$clustername.ignore_disconnected that defaults to false and that allows you to set this per cluster you are connecting to. such that the default behavior is still unchanged but it allows you to basically skip clusters entirely if they are optional. This also allows for mandatory clusters which I personally see as a valid usecase here as well. WDYT /cc @javanna

@s1monw thank you, sounds good, this new setting hopefully will allow:

  • Elasticsearch process to start even if the remote cluster is unreachable
  • Queries to return data from the local cluster and the reachable remote clusters
  • Have a way to check if the query results contain data from the "optional" remote clusters

    • Don't know yet what would be a straight forward way to do this. The _index field contains the cluster alias, checking this point for each query is not feasible. Perhaps just polling using a health-check API call is the best we can do

  • related to the previous point, the ability to display the status of the remote clusters on a Kibana dashboard to make it clear to the user that the data is incomplete

@s1monw ++ I like the idea

Have a way to check if the query results contain data from the "optional" remote clusters

I think we can do something about it. For instance we can maybe even make this default and add a _cluster part to the response ie.:

  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "_clusters" : {
    "total" : 2,
    "successful": 1,
    "skipped": 1
  }

this way we can simply let the user decide what they want to do and it's consistent with the _shards. @javanna WDYT?

Was this page helpful?
0 / 5 - 0 ratings