Cloud-on-k8s: Do not cache Elasticsearch settings in annotations

Created on 11 Apr 2020 · 7Comments · Source: elastic/cloud-on-k8s

attemptDownscale may delete some nodes which are not excluded.

For example, let's consider the following situation where 2 StatefulSets with an initial replicas: 2 are downscaled to replicas: 1:

Initial state

sset1:
- sset1-0
- sset1-1
sset2:
- sset2-0
- sset2-1

Iteration 1

actualStatefulSets=[sset1,sset2]
sset1-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is successful
- updateAllocationExcludeAnnotation(es, ["sset1-1"]) is successful
sset1-1 is not deleted yet because it is holding some data

Iteration 2

actualStatefulSets=[sset2,sset1] - note the inconsistent order, this is because there is no guarantee regarding the order of the StatefulSets retrieved from the K8S API
sset2-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset2-1"]) is successful
- updateAllocationExcludeAnnotation(es, ["sset2-1"]) fails _(e.g.: Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-dtvx\": the object has been modified;...)_
_at this point it is possible that sset1-1 does not hold some data anymore_

At the end of this iteration there is an inconsistency between the excluded nodes in ES and the annotation set on the Elasticsearchobject: sset1-1 is not excluded any more even if the annotation on ES is saying the opposite 💣 💣💣

Iteration 3

actualStatefulSets=[sset1,sset2] - note that now the order is the same than the one at iteration 1
sset1-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is NOT called because of the stale annotation:

    if exclusions == allocationExcludeFromAnnotation(es) {
        return nil
    }

calculatePerformableDownscale does not detect any shards on sset1-1 because it has been excluded in the past.
sset1-1 is deleted ... but since it is not excluded anymore there is no guarantee that we have not delete some data with it 💥

Could be the root cause of #2788 in which 2 StatefulSets are downscaled/deleted.

>bug v1.1.0

Source

barkbay

Most helpful comment

Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.

barkbay on 14 Apr 2020

👍2

All 7 comments

The pattern outlined above may actually happen as soon as an annotation is used to store a copy of the Elasticsearch state:

Iteration `n`

value1 should be use in an API call
Call Es API with value1
Save value1 in an annotation

Iteration `n+1`

A new value, value2, should be use in an API call
Call Es API with value2
Annotation can't be updated with the new value

Iteration `n+2`

value1 from previous iteration n should be used again in an API call
value1 is in the annotation: API is not called
value2 is actually still in used

There's also an other case where we should cleanup and not rely on these annotations: when the API call to Elasticsearch fails, in case of a timeout for instance _(maybe the call succeded ... maybe not)_

This may happen when ECK deals with:

Allocation settings
Voting config exclusions
Zen1 minimum master nodes
Remote clusters

I'm not sure there's an easy way to deal with these "inconsistencies". We could stop using annotations and cache these values in memory.
The drawback is that the state would be lost at restart, which would generate some API calls to initialize the cache during ECK startup. I think it might be acceptable now that we are using concurrent reconciliation.

Happy to implement a PR to not use annotations anymore, let me know your thoughts.

barkbay on 14 Apr 2020

👍2

We have decided to remove any kind of cache between ECK and the Elasticsearch API.
For most of the API calls it is just about doing an HTTP PUT which will replace the existing settings.
It should be noted that it is not the case for remote clusters: remote clusters are managed with a map, and deleting a remote cluster requires to explicitly set its map entry _(the seeds)_ to nil

This means that in the case of the remote clusters we have to first retrieve the list of the existing remote clusters to know if some of them have to be deleted. But a side effect is that ECK will delete any unknown remote cluster.

barkbay on 15 Apr 2020

But a side effect is that ECK will delete any unknown remote cluster.

That seems very undesirable especially given that for remote cluster setups that connect to a cluster outside of k8s manual configuration is the only option and we would destroy such setups.

A few things come to mind:

keeping the current annotation system for remote clusters (can we mitigate the bug maybe by setting the annotation first, and comparing always with the current settings)
using a naming convention for the remote cluster settings we introduce -eck or similar

pebrc on 15 Apr 2020

If the annotation is set first there is no guarantee that the clusters will be removed.
I would be 👍 to prefix the name with something like eck-. _(even if I don't like hiding some information in a name)_

barkbay on 15 Apr 2020

The disadvantage of the prefix/suffix is that the remote cluster name would not be the one defined by the user in the spec.

I was thinking of using the annotation as a history of remote clusters ECK has created. Something like:

| Spec | Annotation | ES |
| ------------- | ------------- |--|
| A | A | (none because we errored out)|
| A | A | A |
| B | A | A (error on annotation update) |
| B | A,B | A (error on ES request) |
| B | A, B| B (success on subsequent iteration)|
| B | B | B (clean up history on next iteration) |

On each reconciliation we would GET _cluster/settings compare and use the annotation only as a history mechanism to know which remote clusters were created by us at some point.

pebrc on 15 Apr 2020

If there's no error the workflow of the annotation is [A] -> [A,B] -> [B] Which means a last reconcile loop is needed to make the annotation consistent (this is the "clean up history" step in your proposal).

However it seems to solve our problem and sounds cleaner than the naming convention. I'll work on a draft.

barkbay on 15 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Could not update cluster license: failed to revert to basic

thbkrkr · 5Comments

Unsupported Distribution in Elastic Search 7.11 ECK 1.3.0

deepaksinghcs14 · 4Comments

Reconciliation is blocked when a Pod can't be created

barkbay · 5Comments

Custom elasticsearch configuration option should be available

Sakib37 · 3Comments

Kibana does not support rolling upgrades

pebrc · 3Comments

Cloud-on-k8s: Do not cache Elasticsearch settings in annotations

Initial state

Iteration 1

Iteration 2

Iteration 3

Most helpful comment

All 7 comments

Iteration n

Iteration n+1

Iteration n+2

Related issues

Iteration `n`

Iteration `n+1`

Iteration `n+2`