Forking the discussion from https://github.com/elastic/cloud-on-k8s/pull/2000#discussion_r336373978.
Finalizers are a great way to ensure we properly garbage-collect some things related a particular resource. For example, by setting our keystore finalizer to the Elasticsearch resource, we ensure the operator removes all dynamic watches on user-provided keystore secrets, before the apiserver can safely delete the Elasticsearch resource.
I think this is especially useful when finalizers trigger an operation on an external system. For example, when dealing with filesystem volumes or external databases.
The major drawback of the finalizer approach is that you cannot delete a resource when the operator is not running. For example, running kubectl delete elasticsearch es-sample will hang forever until the operator runs again. Which can be confusing for many users (especially if they are uninstalling ECK). An additional minor concern is that it requires updating the corresponding resource with a field that must be backward-compatible with next ECK versions (one more thing for which we need to maintain backward-compatibility).
In our situation, I think all finalizers we have so far are used for cleaning some in-memory state. For example:
I think (it's hard to be 100% sure?) the apiserver/client-go implementation we use does guarantee that we do not miss a deletion event (an interesting comment in the controller-runtime project). Of course, if the operator is not running, then the deletion event is missed.
So either:
I asked on #kubebuilder (Slack channel), it seems like we can indeed not worry about missing deletion events: https://kubernetes.slack.com/archives/CAR30FCJZ/p1571410760147500
Would be nice to look deeper into how the watches are handled and what kind of guarantee we have that we do not miss an event.
Worst case scenario if we miss the deletion event: the operator uses more memory than it should, which is not super bad.
We discussed this issue with the team: we should do it.
Finalizers are used for 2 main purposes in ECK:
Secrets used by the APM Server or Kibana to setup users in Elasticsearch.This last case is a little bit tricky to manage because the Secret is created in the Elasticsearch namespace and thus is owned by Elasticsearch, not by the APM Server or Kibana. The reason of that choice was the following limitation:
Namespace-scoped dependents can only specify owners in the same namespace (ref.)
As a consequence if we remove the Finalizer the Secrets will not be deleted anymore.
If we want to handle the case where a Delete event is received but is not correctly handled by the operator _(e.g. in case of a bug, or if the operator is stopped right after the event is received)_ the only solution I see is to have our own GC. This could just be a go routine, regularly listing the user secrets, watching for any orphaned secret.
Thoughts ?
Could we garbage collect those Secrets as part of the "normal" reconciliation done by the association controllers?
The Kibana-ES association controller responsible for creating the secret is triggered automatically on any change in Kibana/ES, and also when the operator (re)starts. As such, it could inspect and garbage collect any useless association secret without requiring a separate concurrent goroutine.
_Edit_: this would require watching all ES resources, not only the ones having an association set up.
_Edit_ (again): actually this cannot work since we're enqueuing reconciliations based on the Kibana NamespacedName, which we won't get from the unassociated Elasticsearch resource. Sorry for the noise, I'm still leaving this comment here just in case.
I agree with @barkbay's comments. We have enough metadata on the user secret created by the association controllers to do a fairly generic 'garbage collection' in the association controller itself
Just to illustrate a case where the association controller will not delete a remaining secret:
Inital state:
defaultns1User deletes Kibana:
ns1 are removed by the K8S api/controllersecret/ns1-kb-sample-kibana-user remains in the Elasticsearch namespaceOperator is restarted
secret/ns1-kb-sample-kibana-user still remains in the Elasticsearch namespaceI think we have 2 solutions here:
I think I'm +1 with the first one, @pebrc sorry, I think I didn't get your preference
Association controller watches all ES resources as suggested by @sebgl, it would trigger a reconciliation at step 3 and we would need to fetch all user secrets and check if they are orphaned.
The reconciliation is currently triggered on a Kibana NamespacedName: what would it be in this situation? I guess we'd need to trigger a different reconciliation with no particular NamespacedName that would go through all secrets.
If we consider we cannot miss a deletion event, maybe we could do the following:
I don鈥檛 think the first one (watching all es resources ) works. Because the association controller watches Kibanas. So you would have to write what amounts to a new controller (GC controller if you will) that watches Elasticsearch and checks for orphaned secrets, which is basically what I suggested the only difference being that the watch on secrets is probably less efficient than the watch on Elasticsearch so +1 on that.
I guess we'd need to trigger a different reconciliation with no particular NamespacedName that would go through all secrets.
That sounds a bit hacky to me. Could we not implement a separate controller/routine that just does just GC and is keyed on the type we are interested in?
Or we could consider doing it in the elasticsearch_controller? That would be conceptually wrong because we are maintaining some separation of concerns here, but on the other hand we could implement a GC that handles all present and future user secrets created by different association controllers at once (APM/Kibana/$FUTURE_APP_WE_WILL_SUPPORT) as long as the secrets follow a certain metadata scheme and if there is a way in the k8s API to do a generic existence check e.g. by UID (not sure there is actually)?
That sounds a bit hacky to me
Agreed, that's why I proposed something different with a simple GC function when the controller starts (and then, gc as part of regular reconciliations):
I think we achieve the same with your proposal of the separate gc goroutine/controller, but it may be slightly more complicated due to setting watches etc.?
I think we achieve the same with your proposal of the separate gc goroutine/controller, but it may be slightly more complicated due to setting watches etc.?
Yes, agreed, I am not even sure if it's possible to do with the k8s client in the way I had in mind. 馃憤 on doing it the way you suggested. My only worry would be that we are slowing down the startup of the controller, but I guess we could use a label selector to reduce the number of secrets we actually have to list. Also I assume when you wrote 'all secrets' you meant all secrets created by the association controller. Finally we have to build this in such a way that we can do at least code sharing between the different association controllers as each would have to do its own user secret GC.
My only worry would be that we are slowing down the startup of the controller
Yes. I guess we could run it in a separate short-lived goroutine (and still move on with the normal startup process)? It may conflict with gc done by the regular reconciliation but that would just be a "normal" conflict.
but I guess we could use a label selector to reduce the number of secrets we actually have to list
We currently set these labels:
"labels": {
"kibanaassociation.k8s.elastic.co/name": "kibana-sample",
"kibanaassociation.k8s.elastic.co/namespace": "default"
},
I was hoping we would have a more "generic" one, because the List API with label filtering requires key + value. We know the key we're looking for here but not the value. Should we add another label in order to do that?
Yes. I guess we could run it in a separate short-lived goroutine (and still move on with the normal startup process)?
I think we may have a race condition here:
I think we may have a race condition here
Yes, but it will be fixed at next reconciliation, right? Eventual consistency to the rescue!
Anyway, I think we don't have to solve slow startup issues we don't have. And may never have unless we manage 5k Kibanas, at which point we may also have other issues. I think that's fine doing things sequentially for now.
Yes, but it will be fixed at next reconciliation, right? Eventual consistency to the rescue!
I don't think we have any guarantee about that because we only watch secrets controlled by the association controller:
https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/kibanaassociation/watch.go#L35-L41
We also need to improve the watches if we want to do that.