Cloud-on-k8s: Do not cache Elasticsearch settings in annotations

Created on 11 Apr 2020  路  7Comments  路  Source: elastic/cloud-on-k8s

attemptDownscale may delete some nodes which are not excluded.

For example, let's consider the following situation where 2 StatefulSets with an initial replicas: 2 are downscaled to replicas: 1:

Initial state

  • sset1:

    • sset1-0

    • sset1-1

  • sset2:

    • sset2-0

    • sset2-1

Iteration 1

  • actualStatefulSets=[sset1,sset2]
  • sset1-1 is scheduled to be deleted:

    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is successful

    • updateAllocationExcludeAnnotation(es, ["sset1-1"]) is successful

  • sset1-1 is not deleted yet because it is holding some data

Iteration 2

  • actualStatefulSets=[sset2,sset1] - note the inconsistent order, this is because there is no guarantee regarding the order of the StatefulSets retrieved from the K8S API
  • sset2-1 is scheduled to be deleted:

    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset2-1"]) is successful

    • updateAllocationExcludeAnnotation(es, ["sset2-1"]) fails _(e.g.: Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-dtvx\": the object has been modified;...)_

  • _at this point it is possible that sset1-1 does not hold some data anymore_

At the end of this iteration there is an inconsistency between the excluded nodes in ES and the annotation set on the Elasticsearchobject: sset1-1 is not excluded any more even if the annotation on ES is saying the opposite 馃挘 馃挘馃挘

Iteration 3

  • actualStatefulSets=[sset1,sset2] - note that now the order is the same than the one at iteration 1
  • sset1-1 is scheduled to be deleted:

    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is NOT called because of the stale annotation:

    if exclusions == allocationExcludeFromAnnotation(es) {
        return nil
    }
  • calculatePerformableDownscale does not detect any shards on sset1-1 because it has been excluded in the past.
  • sset1-1 is deleted ... but since it is not excluded anymore there is no guarantee that we have not delete some data with it 馃挜

Could be the root cause of #2788 in which 2 StatefulSets are downscaled/deleted.

>bug v1.1.0

Most helpful comment

Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.

All 7 comments

The pattern outlined above may actually happen as soon as an annotation is used to store a copy of the Elasticsearch state:

Iteration n

  • value1 should be use in an API call
  • Call Es API with value1
  • Save value1 in an annotation

Iteration n+1

  • A new value, value2, should be use in an API call
  • Call Es API with value2
  • Annotation can't be updated with the new value

Iteration n+2

  • value1 from previous iteration n should be used again in an API call
  • value1 is in the annotation: API is not called
  • value2 is actually still in used

There's also an other case where we should cleanup and not rely on these annotations: when the API call to Elasticsearch fails, in case of a timeout for instance _(maybe the call succeded ... maybe not)_

This may happen when ECK deals with:

  • Allocation settings
  • Voting config exclusions
  • Zen1 minimum master nodes
  • Remote clusters

I'm not sure there's an easy way to deal with these "inconsistencies". We could stop using annotations and cache these values in memory.
The drawback is that the state would be lost at restart, which would generate some API calls to initialize the cache during ECK startup. I think it might be acceptable now that we are using concurrent reconciliation.

Happy to implement a PR to not use annotations anymore, let me know your thoughts.

Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.

We have decided to remove any kind of cache between ECK and the Elasticsearch API.
For most of the API calls it is just about doing an HTTP PUT which will replace the existing settings.
It should be noted that it is not the case for remote clusters: remote clusters are managed with a map, and deleting a remote cluster requires to explicitly set its map entry _(the seeds)_ to nil

This means that in the case of the remote clusters we have to first retrieve the list of the existing remote clusters to know if some of them have to be deleted. But a side effect is that ECK will delete any unknown remote cluster.

But a side effect is that ECK will delete any unknown remote cluster.

That seems very undesirable especially given that for remote cluster setups that connect to a cluster outside of k8s manual configuration is the only option and we would destroy such setups.

A few things come to mind:

  • keeping the current annotation system for remote clusters (can we mitigate the bug maybe by setting the annotation first, and comparing always with the current settings)
  • using a naming convention for the remote cluster settings we introduce -eck or similar

If the annotation is set first there is no guarantee that the clusters will be removed.
I would be 馃憤 to prefix the name with something like eck-. _(even if I don't like hiding some information in a name)_

The disadvantage of the prefix/suffix is that the remote cluster name would not be the one defined by the user in the spec.

I was thinking of using the annotation as a history of remote clusters ECK has created. Something like:

| Spec | Annotation | ES |
| ------------- | ------------- |--|
| A | A | (none because we errored out)|
| A | A | A |
| B | A | A (error on annotation update) |
| B | A,B | A (error on ES request) |
| B | A, B| B (success on subsequent iteration)|
| B | B | B (clean up history on next iteration) |

On each reconciliation we would GET _cluster/settings compare and use the annotation only as a history mechanism to know which remote clusters were created by us at some point.

If there's no error the workflow of the annotation is [A] -> [A,B] -> [B] Which means a last reconcile loop is needed to make the annotation consistent (this is the "clean up history" step in your proposal).

However it seems to solve our problem and sounds cleaner than the naming convention. I'll work on a draft.

Was this page helpful?
0 / 5 - 0 ratings