attemptDownscale may delete some nodes which are not excluded.
For example, let's consider the following situation where 2 StatefulSets with an initial replicas: 2 are downscaled to replicas: 1:
actualStatefulSets=[sset1,sset2]sset1-1 is scheduled to be deleted:allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is successfulupdateAllocationExcludeAnnotation(es, ["sset1-1"]) is successfulsset1-1 is not deleted yet because it is holding some dataactualStatefulSets=[sset2,sset1] - note the inconsistent order, this is because there is no guarantee regarding the order of the StatefulSets retrieved from the K8S APIsset2-1 is scheduled to be deleted:allocationSetter.ExcludeFromShardAllocation(ctx, ["sset2-1"]) is successfulupdateAllocationExcludeAnnotation(es, ["sset2-1"]) fails _(e.g.: Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-dtvx\": the object has been modified;...)_sset1-1 does not hold some data anymore_At the end of this iteration there is an inconsistency between the excluded nodes in ES and the annotation set on the Elasticsearchobject: sset1-1 is not excluded any more even if the annotation on ES is saying the opposite 馃挘 馃挘馃挘
actualStatefulSets=[sset1,sset2] - note that now the order is the same than the one at iteration 1sset1-1 is scheduled to be deleted:allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is NOT called because of the stale annotation: if exclusions == allocationExcludeFromAnnotation(es) {
return nil
}
calculatePerformableDownscale does not detect any shards on sset1-1 because it has been excluded in the past.sset1-1 is deleted ... but since it is not excluded anymore there is no guarantee that we have not delete some data with it 馃挜 Could be the root cause of #2788 in which 2 StatefulSets are downscaled/deleted.
The pattern outlined above may actually happen as soon as an annotation is used to store a copy of the Elasticsearch state:
nvalue1 should be use in an API callvalue1value1 in an annotationn+1value2, should be use in an API callvalue2n+2value1 from previous iteration n should be used again in an API callvalue1 is in the annotation: API is not calledvalue2 is actually still in usedThere's also an other case where we should cleanup and not rely on these annotations: when the API call to Elasticsearch fails, in case of a timeout for instance _(maybe the call succeded ... maybe not)_
This may happen when ECK deals with:
I'm not sure there's an easy way to deal with these "inconsistencies". We could stop using annotations and cache these values in memory.
The drawback is that the state would be lost at restart, which would generate some API calls to initialize the cache during ECK startup. I think it might be acceptable now that we are using concurrent reconciliation.
Happy to implement a PR to not use annotations anymore, let me know your thoughts.
Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.
We have decided to remove any kind of cache between ECK and the Elasticsearch API.
For most of the API calls it is just about doing an HTTP PUT which will replace the existing settings.
It should be noted that it is not the case for remote clusters: remote clusters are managed with a map, and deleting a remote cluster requires to explicitly set its map entry _(the seeds)_ to nil
This means that in the case of the remote clusters we have to first retrieve the list of the existing remote clusters to know if some of them have to be deleted. But a side effect is that ECK will delete any unknown remote cluster.
But a side effect is that ECK will delete any unknown remote cluster.
That seems very undesirable especially given that for remote cluster setups that connect to a cluster outside of k8s manual configuration is the only option and we would destroy such setups.
A few things come to mind:
-eck or similarIf the annotation is set first there is no guarantee that the clusters will be removed.
I would be 馃憤 to prefix the name with something like eck-. _(even if I don't like hiding some information in a name)_
The disadvantage of the prefix/suffix is that the remote cluster name would not be the one defined by the user in the spec.
I was thinking of using the annotation as a history of remote clusters ECK has created. Something like:
| Spec | Annotation | ES |
| ------------- | ------------- |--|
| A | A | (none because we errored out)|
| A | A | A |
| B | A | A (error on annotation update) |
| B | A,B | A (error on ES request) |
| B | A, B| B (success on subsequent iteration)|
| B | B | B (clean up history on next iteration) |
On each reconciliation we would GET _cluster/settings compare and use the annotation only as a history mechanism to know which remote clusters were created by us at some point.
If there's no error the workflow of the annotation is [A] -> [A,B] -> [B] Which means a last reconcile loop is needed to make the annotation consistent (this is the "clean up history" step in your proposal).
However it seems to solve our problem and sounds cleaner than the naming convention. I'll work on a draft.
Most helpful comment
Also I think we should sort the list of
StatefulSetsfrom RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.