Cloud-on-k8s: Operator 1.0.0-beta1 stuck in error loop when es object's secureSettings is a map. All progress is stopped on all the other es objects.

Created on 22 Oct 2019  路  10Comments  路  Source: elastic/cloud-on-k8s

Bug Report

What did you do?
Upgrade the operator from 0.9.0 to 1.0.0-beta1 with a soon-to-be-deleted elasticsearch cluster (instanciated with 0.9.0) still running.
The idea was to leave the old cluster running, create a new one with the operator 1.0.0, reindex the data in the new one, and delete the old one.
When I tried creating the new one, with the updated yaml file, nothing happened.

What did you expect to see?
I expected the old elasticsearch object to be silently ignored, and the new one to be created.

What did you see instead? Under which circumstances?
The operator was stuck in an infinite loop of error because the old elasticsearch object was using the old syntax for secureSettings (when it was a map and not a list) and couldn't do any work on the new cluster's object.

  • ECK version:
    1.0.0-beta1-84792e30

  • Kubernetes information:
    Google's GKE version: 1.14.6-gke.13

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6-gke.13", GitCommit:"acdb9a03a6dc0f7f62d7acdda75c9a9faca50fee", GitTreeState:"clean", BuildDate:"2019-09-20T23:13:58Z", GoVersion:"go1.12.9b4", Compiler:"gc", Platform:"linux/amd64"}
  • Resource definition:
  secureSettings:
    secretName: backups-gcs-secrets-es
  • Logs:
    I don't have the original logs of my migration anymore but it's very easy to reproduce on a new cluster by removing the dash - in front of the secretNames.
E1022 08:59:46.163171       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:47.166139       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:48.169029       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:49.172013       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:50.175490       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:51.178085       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:52.181030       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:53.184773       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
{"level":"debug","@timestamp":"2019-10-22T08:59:53.285Z","logger":"observer","message":"Retrieving cluster state","ver":"1.0.0-beta1-84792e30","es_name":"elastic-test-r2","namespace":"dbre"}
E1022 08:59:54.187693       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...
E1022 08:59:55.191085       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1beta1.Elasticsearch: v1beta1.ElasticsearchList.Items: []v1beta1.Elasticsearch: v1beta1.Elasticsearch.Spec: v1beta1.ElasticsearchSpec.SecureSettings: []v1beta1.SecretSource: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ettings":{"secretNam|..., bigger context ...|},"storageClassName":"ssd"}}]}],"secureSettings":{"secretName":"backups-gcs-secrets-es"},"updateStra|...

All 10 comments

This is unfortunate but it is expected. In order to upgrade the operator from 0.9.0 to 1.0.0-beta1, you must delete old CRDs from the Kubernetes cluster first. Deleting the CRDs will not delete existing Elasticsearch, Kibana or APM servers. See https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-upgrading-eck.html for more information.

Hi @charith-elastic and thanks for your time,
Sorry, I didn't express myself correctly, when I said CRD, I meant the CRD instances, like Elasticsearch objects.
I followed the documentation you linked to.
I updated my wording.
Could you please reread the issue and reopen the bug?
Thanks

Hi @PaulGrandperrin, I should have been clearer with my previous answer: there are several breaking changes between 0.9.0 and 1.0.0-beta1 such as the data type change you are referring to in this issue. Unfortunately, due to various technical reasons, v1beta1 resources and v1alpha1 resources cannot co-exist in the same cluster. This is why our recommended approach is to start with a fresh Kubernetes cluster or to completely uninstall the old operator and associated CRDs and custom resources (CR) first. Unfortunately, the latter involves some downtime because you cannot delete a CRD without first deleting the associated CRs. This is why we suggest using snapshots for data migration if you are attempting an in-place upgrade.

I hope that helps.

The error seems to happen at the resource parsing level (before our reconciliation kicks in), it's pretty hard to fix in ECK, we'd probably have to look into the reflector used by client-go.

For future releases (especially starting GA), we do plan to correctly handle backward compatibility with conversion hooks, to make sure this does not happen.
The other thing we want to invest in in the future is structural schemas. So any resource with the wrong schema gets rejected at creation/update time.

Sorry for the inconvenience @PaulGrandperrin :/

Thanks for the detailed explanation, it helps :-)
I have no problem with this state of affairs but then I think the documentation should be updated to remove those paragraphs:

 If you wish to install ECK into an existing Kubernetes cluster that has a previous version of the operator installed, it is important to consider the following:

The old operator will be replaced by the new operator during the installation process.
Existing Elasticsearch, Kibana and APM Server resources created by an old version of the operator will continue to work but they will not be managed by the new operator. This means that the orchestration benefits provided by the operator such as rolling upgrades will no longer be available to those resources.
If the old operator is replaced without removing old resources first, you will have to manually disable finalizers to delete them later.

and

The 1.0.0-beta version of the operator does not delete resources created by older versions of the operator, but it also does not manage them.

or at least document that old CRs might freeze the operator and prevent any kind of operations.

I think those paragraphs make it looks like it is supported to keep temporarily old CRs from the 0.9 operator.

This error loop also means that if someone using a k8s cluster pushes a CR with invalid syntax, it will break the control-plane (the elastic-operator functions) of all the other elastic clusters in all the other namespaces.

I created https://github.com/elastic/cloud-on-k8s/pull/2043 to at least add a note in the relevant doc section.

This error loop also means that if someone using a k8s cluster pushes a CR with invalid syntax, it will break the control-plane (the elastic-operator functions) of all the other elastic clusters in all the other namespaces.

Indeed, great point. It's very unfortunate. But at least kind-of explicit in the operator logs. Subtle things are not exposed at all in the logs: for example if you specify secureSettings at the wrong level in the spec (eg. within a NodeSet), it is silently ignored.
Structural schemas and resources validation should solve both cases.

Thanks @sebgl and @charith-elastic for the answers and actions!
This is a high-quality operator and support you are doing :-)

@PaulGrandperrin investigating further we spotted a problem in our CRDs openAPIV3Schema validation. See https://github.com/elastic/cloud-on-k8s/issues/2044.
Thanks for reporting :)

Was this page helpful?
0 / 5 - 0 ratings