12:24:01 --- FAIL: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green#01 (300.00s)
12:24:01 require.go:794:
12:24:01 Error Trace: utils.go:80
12:24:01 Error: Received unexpected error:
12:24:01 health is red
12:24:01 Test: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green#01
I'm not sure what's going on here. This test is supposed to test a downscale from 3 to 1 master.
I think it happened once on my workstation:
data-integrity-check 1 p STARTED 4 3.5kb 10.233.65.17 test-mutation-less-nodes-6snf-es-masterdata-0
data-integrity-check 2 p UNASSIGNED
data-integrity-check 0 p STARTED 0 283b 10.233.65.17 test-mutation-less-nodes-6snf-es-masterdata-0
curl -k -u elastic:$PASSWORD https://localhost:9200/_cluster/allocation/explain\?pretty
{
"index" : "data-integrity-check",
"shard" : 2,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-09-18T08:59:21.568Z",
"details" : "node_left [DUsKggcYQ9Gy-gyrfkIflA]",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions" : [
{
"node_id" : "EzWbSghXTvObgJtlPRXu_Q",
"node_name" : "test-mutation-less-nodes-6snf-es-masterdata-0",
"transport_address" : "10.233.65.17:9300",
"node_attributes" : {
"ml.machine_memory" : "2147483648",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"store" : {
"in_sync" : false,
"allocation_id" : "kKGUEcpNTGucBURVH5gKYQ"
}
}
]
}
I have run this test a lot of time locally but I never managed to reproduce it more than one time.
I think this what is happening:
A new cluster of 3 nodes is created, an observer is started and eventually the cluster state becomes Green
~ 2 seconds after the green state is detected some data are injected into the clusters _(5 documents, 3 shards)_:
--- PASS: TestMutationLessNodes/ES_cluster_health_should_eventually_be_green (30.08s)
--- PASS: TestMutationLessNodes/ES_cluster_UUID_should_eventually_appear_in_the_ES_status (0.00s)
--- PASS: TestMutationLessNodes/Elastic_password_should_be_available (0.00s)
--- PASS: TestMutationLessNodes/Elasticsearch_data_volumes_should_be_of_the_specified_type (0.01s)
--- PASS: TestMutationLessNodes/ES_cluster_health_endpoint_should_eventually_be_reachable (0.17s)
--- PASS: TestMutationLessNodes/ES_version_should_be_the_expected_one#01 (0.12s)
--- PASS: TestMutationLessNodes/ES_endpoint_should_eventually_be_reachable (0.13s)
--- PASS: TestMutationLessNodes/ES_nodes_topology_should_eventually_be_the_expected_one (0.09s)
--- PASS: TestMutationLessNodes/Add_some_data_to_the_cluster_before_starting_the_mutation (1.28s)
--- PASS: TestMutationLessNodes/Start_querying_Elasticsearch_cluster_health_while_mutation_is_going_on (0.02s)
--- PASS: TestMutationLessNodes/Start_tracking_master_additions_and_removals (0.00s)
--- PASS: TestMutationLessNodes/Retrieve_Elasticsearch_cluster_UUID_for_comparison_purpose (0.00s)
--- PASS: TestMutationLessNodes/Applying_the_Elasticsearch_mutation_should_succeed (0.02s)
observer :{
"cluster_name": "test-mutation-less-nodes-mvnb",
"cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
"version": 0,
"master_node": "-_HtipvQSKeHtDbK55RWUQ",
"nodes": {
"-_HtipvQSKeHtDbK55RWUQ": {
"name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
"ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
"transport_address": "10.56.1.97:9300",
"attributes": {
"ml.machine_memory": "2147483648",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": ""
}
},
"3pFLU1gXSjOIvg6yffXhZA": {
"name": "test-mutation-less-nodes-mvnb-es-masterdata-2",
"ephemeral_id": "8a6X6754QQmbUDbAz8h-RA",
"transport_address": "10.56.0.80:9300",
"attributes": {
"ml.machine_memory": "2147483648",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": ""
}
},
"qyJt8NwQS4uILehVvhNWnw": {
"name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
"ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
"transport_address": "10.56.2.93:9300",
"attributes": {
"ml.machine_memory": "2147483648",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": ""
}
}
},
"routing_table": {
"indices": {}
}
}
According to the cluster state there're no shards at all on the cluster, the first Pod is deleted 馃挜
In the next reconciliation loop the operator does the same check. The observer, which is updating the cluster state every 10s, has updated the cluster state:
{
"cluster_name": "test-mutation-less-nodes-mvnb",
"cluster_uuid": "dsNI7sMCSzyFoMezaC7KpA",
"version": 0,
"master_node": "-_HtipvQSKeHtDbK55RWUQ",
"nodes": {
"-_HtipvQSKeHtDbK55RWUQ": {
"name": "test-mutation-less-nodes-mvnb-es-masterdata-0",
"ephemeral_id": "t8X4ahnqQhWDOXI1JrSZGg",
"transport_address": "10.56.1.97:9300",
"attributes": {
"ml.machine_memory": "2147483648",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": ""
}
},
"qyJt8NwQS4uILehVvhNWnw": {
"name": "test-mutation-less-nodes-mvnb-es-masterdata-1",
"ephemeral_id": "2l1e1FoMTiGig_kxPUIJiw",
"transport_address": "10.56.2.93:9300",
"attributes": {
"ml.machine_memory": "2147483648",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": ""
}
}
},
"routing_table": {
"indices": {
"data-integrity-check": {
"shards": {
"0": [{
"index": "data-integrity-check",
"shard": 0,
"primary": true,
"state": "STARTED",
"node": "-_HtipvQSKeHtDbK55RWUQ"
}],
"1": [{
"index": "data-integrity-check",
"shard": 1,
"primary": true,
"state": "UNASSIGNED",
"node": ""
}],
"2": [{
"index": "data-integrity-check",
"shard": 2,
"primary": true,
"state": "STARTED",
"node": "-_HtipvQSKeHtDbK55RWUQ"
}]
}
}
}
}
}
You can note here that it's too late, a primary shard is UNASSIGNED.
TLDR; the frequency of observation of the observer is too slow while the e2e test is scaling down the cluster right after the shards have been created.
I think we have to deal with a "may be stale state" observation, the question here is maybe how far do we want to be from the last observation before doing a risky action.
Using _(or reusing)_ an instance of ESState could solve this issue since it's providing a more up-to-date view of the cluster.
Quick summary around a discussion we had around this issue:
One of the main concern is our ability to "drive" several Elasticsearch clusters in one instance of the operator: API calls are synchronous I/O, if an API request is performed in a reconciliation loop then it blocks any other reconciliation.
Two mains ideas to improve this situation and the scalability:
Cons: if the user changes its mind or do a new update on the spec ... it may still block for a while
Use a task API:
A common drawback of any solution is that even if an API call is done synchronously it is still a view of the past, things may change right after the call is done.
Short term decision:
Long term could be:
k get elasticsearch,pods --all-namespaces
NAMESPACE NAME HEALTH NODES VERSION PHASE AGE
e2e-x8bbc-mercury elasticsearch.elasticsearch.k8s.elastic.co/test-mutation-less-nodes-4gcl red 1 6.8.2 Ready 5m
NAMESPACE NAME READY STATUS RESTARTS AGE
e2e-x8bbc-elastic-system pod/e2e-x8bbc-global-operator-0 1/1 Running 0 46m
e2e-x8bbc-mercury pod/test-mutation-less-nodes-4gcl-es-masterdata-0 1/1 Running 0 5m16s
e2e-x8bbc-ns-operators pod/e2e-x8bbc-mercury-ns-operator-0 1/1 Running 0 46m
e2e-x8bbc-ns-operators pod/e2e-x8bbc-venus-ns-operator-0 1/1 Running 0 46m
e2e-x8bbc pod/eck-e2e-x8bbc-nmt47 1/1 Running 0 46m
Fixed by #1840
Most helpful comment
I think this what is happening:
A new cluster of 3 nodes is created, an
observeris started and eventually the cluster state becomes Green~ 2 seconds after the green state is detected some data are injected into the clusters _(5 documents, 3 shards)_:
observer:According to the cluster state there're no shards at all on the cluster, the first Pod is deleted 馃挜
In the next reconciliation loop the operator does the same check. The
observer, which is updating the cluster state every 10s, has updated the cluster state:You can note here that it's too late, a primary shard is
UNASSIGNED.TLDR; the frequency of observation of the
observeris too slow while the e2e test is scaling down the cluster right after the shards have been created.I think we have to deal with a "may be stale state" observation, the question here is maybe how far do we want to be from the last observation before doing a risky action.
Using _(or reusing)_ an instance of
ESStatecould solve this issue since it's providing a more up-to-date view of the cluster.