Cloud-on-k8s: TestMutationLessNodes is flaky

Created on 18 Sep 2019  路  4Comments  路  Source: elastic/cloud-on-k8s

12:24:01      --- FAIL: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green#01 (300.00s)
12:24:01          require.go:794: 
12:24:01                Error Trace:    utils.go:80
12:24:01                Error:          Received unexpected error:
12:24:01                                health is red
12:24:01                Test:           TestMutationLessNodes/ES_cluster_health_should_eventually_be_green#01

I'm not sure what's going on here. This test is supposed to test a downscale from 3 to 1 master.
I think it happened once on my workstation:

data-integrity-check 1 p STARTED    4 3.5kb 10.233.65.17 test-mutation-less-nodes-6snf-es-masterdata-0
data-integrity-check 2 p UNASSIGNED
data-integrity-check 0 p STARTED    0  283b 10.233.65.17 test-mutation-less-nodes-6snf-es-masterdata-0
curl -k -u elastic:$PASSWORD https://localhost:9200/_cluster/allocation/explain\?pretty
{
  "index" : "data-integrity-check",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2019-09-18T08:59:21.568Z",
    "details" : "node_left [DUsKggcYQ9Gy-gyrfkIflA]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "EzWbSghXTvObgJtlPRXu_Q",
      "node_name" : "test-mutation-less-nodes-6snf-es-masterdata-0",
      "transport_address" : "10.233.65.17:9300",
      "node_attributes" : {
        "ml.machine_memory" : "2147483648",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "kKGUEcpNTGucBURVH5gKYQ"
      }
    }
  ]
}

I have run this test a lot of time locally but I never managed to reproduce it more than one time.

>flaky_test >test v1.0.0-beta1

Most helpful comment

I think this what is happening:

  1. A new cluster of 3 nodes is created, an observer is started and eventually the cluster state becomes Green

  2. ~ 2 seconds after the green state is detected some data are injected into the clusters _(5 documents, 3 shards)_:

    --- PASS: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green (30.08s)
    --- PASS: TestMutationLessNodes/ES_cluster_UUID_should_eventually_appear_in_the_ES_status (0.00s)
    --- PASS: TestMutationLessNodes/Elastic_password_should_be_available (0.00s)
    --- PASS: TestMutationLessNodes/Elasticsearch_data_volumes_should_be_of_the_specified_type (0.01s)
    --- PASS: TestMutationLessNodes/ES_cluster_health_endpoint_should_eventually_be_reachable (0.17s)
    --- PASS: TestMutationLessNodes/ES_version_should_be_the_expected_one#01 (0.12s)
    --- PASS: TestMutationLessNodes/ES_endpoint_should_eventually_be_reachable (0.13s)
    --- PASS: TestMutationLessNodes/ES_nodes_topology_should_eventually_be_the_expected_one (0.09s)
    --- PASS: TestMutationLessNodes/Add_some_data_to_the_cluster_before_starting_the_mutation (1.28s)
  1. Right after the documents have been injected, the scale down to 1 node is started:
    --- PASS: TestMutationLessNodes/Start_querying_Elasticsearch_cluster_health_while_mutation_is_going_on (0.02s)
    --- PASS: TestMutationLessNodes/Start_tracking_master_additions_and_removals (0.00s)
    --- PASS: TestMutationLessNodes/Retrieve_Elasticsearch_cluster_UUID_for_comparison_purpose (0.00s)
    --- PASS: TestMutationLessNodes/Applying_the_Elasticsearch_mutation_should_succeed (0.02s)
  1. The operator checks if some shards are migrating on the first node to be deleted. In order to do that it is using the cluster states as provided by the observer :
{
    "cluster_name": "test-mutation-less-nodes-mvnb",
    "cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
    "version": 0,
    "master_node": "-_HtipvQSKeHtDbK55RWUQ",
    "nodes": {
        "-_HtipvQSKeHtDbK55RWUQ": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
            "ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
            "transport_address": "10.56.1.97:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "3pFLU1gXSjOIvg6yffXhZA": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-2",
            "ephemeral_id": "8a6X6754QQmbUDbAz8h-RA",
            "transport_address": "10.56.0.80:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "qyJt8NwQS4uILehVvhNWnw": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
            "ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
            "transport_address": "10.56.2.93:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        }
    },
    "routing_table": {
        "indices": {}
    }
}

According to the cluster state there're no shards at all on the cluster, the first Pod is deleted 馃挜

  1. In the next reconciliation loop the operator does the same check. The observer, which is updating the cluster state every 10s, has updated the cluster state:

    {
    "cluster_name": "test-mutation-less-nodes-mvnb",
    "cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
    "version": 0,
    "master_node": "-_HtipvQSKeHtDbK55RWUQ",
    "nodes": {
        "-_HtipvQSKeHtDbK55RWUQ": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
            "ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
            "transport_address": "10.56.1.97:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "qyJt8NwQS4uILehVvhNWnw": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
            "ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
            "transport_address": "10.56.2.93:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        }
    },
    "routing_table": {
        "indices": {
            "data-integrity-check": {
                "shards": {
                    "0": [{
                        "index": "data-integrity-check",
                        "shard": 0,
                        "primary": true,
                        "state": "STARTED",
                        "node": "-_HtipvQSKeHtDbK55RWUQ"
                    }],
                    "1": [{
                        "index": "data-integrity-check",
                        "shard": 1,
                        "primary": true,
                        "state": "UNASSIGNED",
                        "node": ""
                    }],
                    "2": [{
                        "index": "data-integrity-check",
                        "shard": 2,
                        "primary": true,
                        "state": "STARTED",
                        "node": "-_HtipvQSKeHtDbK55RWUQ"
                    }]
                }
            }
        }
    }
    }
    

You can note here that it's too late, a primary shard is UNASSIGNED.

TLDR; the frequency of observation of the observer is too slow while the e2e test is scaling down the cluster right after the shards have been created.

I think we have to deal with a "may be stale state" observation, the question here is maybe how far do we want to be from the last observation before doing a risky action.

Using _(or reusing)_ an instance of ESState could solve this issue since it's providing a more up-to-date view of the cluster.

All 4 comments

I think this what is happening:

  1. A new cluster of 3 nodes is created, an observer is started and eventually the cluster state becomes Green

  2. ~ 2 seconds after the green state is detected some data are injected into the clusters _(5 documents, 3 shards)_:

    --- PASS: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green (30.08s)
    --- PASS: TestMutationLessNodes/ES_cluster_UUID_should_eventually_appear_in_the_ES_status (0.00s)
    --- PASS: TestMutationLessNodes/Elastic_password_should_be_available (0.00s)
    --- PASS: TestMutationLessNodes/Elasticsearch_data_volumes_should_be_of_the_specified_type (0.01s)
    --- PASS: TestMutationLessNodes/ES_cluster_health_endpoint_should_eventually_be_reachable (0.17s)
    --- PASS: TestMutationLessNodes/ES_version_should_be_the_expected_one#01 (0.12s)
    --- PASS: TestMutationLessNodes/ES_endpoint_should_eventually_be_reachable (0.13s)
    --- PASS: TestMutationLessNodes/ES_nodes_topology_should_eventually_be_the_expected_one (0.09s)
    --- PASS: TestMutationLessNodes/Add_some_data_to_the_cluster_before_starting_the_mutation (1.28s)
  1. Right after the documents have been injected, the scale down to 1 node is started:
    --- PASS: TestMutationLessNodes/Start_querying_Elasticsearch_cluster_health_while_mutation_is_going_on (0.02s)
    --- PASS: TestMutationLessNodes/Start_tracking_master_additions_and_removals (0.00s)
    --- PASS: TestMutationLessNodes/Retrieve_Elasticsearch_cluster_UUID_for_comparison_purpose (0.00s)
    --- PASS: TestMutationLessNodes/Applying_the_Elasticsearch_mutation_should_succeed (0.02s)
  1. The operator checks if some shards are migrating on the first node to be deleted. In order to do that it is using the cluster states as provided by the observer :
{
    "cluster_name": "test-mutation-less-nodes-mvnb",
    "cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
    "version": 0,
    "master_node": "-_HtipvQSKeHtDbK55RWUQ",
    "nodes": {
        "-_HtipvQSKeHtDbK55RWUQ": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
            "ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
            "transport_address": "10.56.1.97:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "3pFLU1gXSjOIvg6yffXhZA": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-2",
            "ephemeral_id": "8a6X6754QQmbUDbAz8h-RA",
            "transport_address": "10.56.0.80:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "qyJt8NwQS4uILehVvhNWnw": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
            "ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
            "transport_address": "10.56.2.93:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        }
    },
    "routing_table": {
        "indices": {}
    }
}

According to the cluster state there're no shards at all on the cluster, the first Pod is deleted 馃挜

  1. In the next reconciliation loop the operator does the same check. The observer, which is updating the cluster state every 10s, has updated the cluster state:

    {
    "cluster_name": "test-mutation-less-nodes-mvnb",
    "cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
    "version": 0,
    "master_node": "-_HtipvQSKeHtDbK55RWUQ",
    "nodes": {
        "-_HtipvQSKeHtDbK55RWUQ": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
            "ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
            "transport_address": "10.56.1.97:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        },
        "qyJt8NwQS4uILehVvhNWnw": {
            "name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
            "ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
            "transport_address": "10.56.2.93:9300",
            "attributes": {
                "ml.machine_memory": "2147483648",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": ""
            }
        }
    },
    "routing_table": {
        "indices": {
            "data-integrity-check": {
                "shards": {
                    "0": [{
                        "index": "data-integrity-check",
                        "shard": 0,
                        "primary": true,
                        "state": "STARTED",
                        "node": "-_HtipvQSKeHtDbK55RWUQ"
                    }],
                    "1": [{
                        "index": "data-integrity-check",
                        "shard": 1,
                        "primary": true,
                        "state": "UNASSIGNED",
                        "node": ""
                    }],
                    "2": [{
                        "index": "data-integrity-check",
                        "shard": 2,
                        "primary": true,
                        "state": "STARTED",
                        "node": "-_HtipvQSKeHtDbK55RWUQ"
                    }]
                }
            }
        }
    }
    }
    

You can note here that it's too late, a primary shard is UNASSIGNED.

TLDR; the frequency of observation of the observer is too slow while the e2e test is scaling down the cluster right after the shards have been created.

I think we have to deal with a "may be stale state" observation, the question here is maybe how far do we want to be from the last observation before doing a risky action.

Using _(or reusing)_ an instance of ESState could solve this issue since it's providing a more up-to-date view of the cluster.

Quick summary around a discussion we had around this issue:

One of the main concern is our ability to "drive" several Elasticsearch clusters in one instance of the operator: API calls are synchronous I/O, if an API request is performed in a reconciliation loop then it blocks any other reconciliation.

Two mains ideas to improve this situation and the scalability:

  1. Reconcile several clusters in parallel (one go routine per Elasticsearch cluster)
  2. Pro: if a cluster is not responding it is only blocking the reconciliation of that particular cluster
  3. Cons: if the user changes its mind or do a new update on the spec ... it may still block for a while

  4. Use a task API:

  5. Pro: It overcomes the situation where the user changes its mind or update the spec, the current request can be canceled.
  6. Cons: It sounds like an optimization over the "one goroutine per cluster" solution, does it worth it ?

A common drawback of any solution is that even if an API call is done synchronously it is still a view of the past, things may change right after the call is done.

Short term decision:

  • Remove the cluster state API call from the Observer
  • Do a synchronous call to get the shards
  • Do not rely on the cluster state API

Long term could be:

  • Progressively remove the Observer _(or at least reduce its usage, maybe we still have to observe in some scenarios)_
  • Switch to one of the 2 solutions outlined.
k get elasticsearch,pods --all-namespaces
NAMESPACE           NAME                                                                       HEALTH   NODES   VERSION   PHASE   AGE
e2e-x8bbc-mercury   elasticsearch.elasticsearch.k8s.elastic.co/test-mutation-less-nodes-4gcl   red      1       6.8.2     Ready   5m

NAMESPACE                  NAME                                                                 READY   STATUS    RESTARTS   AGE
e2e-x8bbc-elastic-system   pod/e2e-x8bbc-global-operator-0                                      1/1     Running   0          46m
e2e-x8bbc-mercury          pod/test-mutation-less-nodes-4gcl-es-masterdata-0                    1/1     Running   0          5m16s
e2e-x8bbc-ns-operators     pod/e2e-x8bbc-mercury-ns-operator-0                                  1/1     Running   0          46m
e2e-x8bbc-ns-operators     pod/e2e-x8bbc-venus-ns-operator-0                                    1/1     Running   0          46m
e2e-x8bbc                  pod/eck-e2e-x8bbc-nmt47                                              1/1     Running   0          46m

Fixed by #1840

Was this page helpful?
0 / 5 - 0 ratings

Related issues

deepaksinghcs14 picture deepaksinghcs14  路  4Comments

anyasabo picture anyasabo  路  3Comments

Sakib37 picture Sakib37  路  3Comments

pebrc picture pebrc  路  3Comments

sebgl picture sebgl  路  3Comments