Operator-sdk: Helm operator keeps creating secrets when performing helm upgrade

Created on 12 Oct 2020  路  9Comments  路  Source: operator-framework/operator-sdk

Bug Report

Helm operator keeps creating secrets when performing helm upgrade if the upgrade job fails due to ImagePullBackOff

What did you do?

I have a helm operator, it works well for install/uninstall. I have installed the operator using OLM
Then I added a job to helm chart to support upgrade scenario:

  1. create a upgrade job with below annotation, I use this job to do some pre-steps before helm upgrade
  annotations:
    "helm.sh/hook": pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  1. re-create helm operator using operator-sdk
  2. update csv to update operator image to latest (we use csv to configure all images), the operator pod is upgraded successfully to use latest image
  3. update csv to update the image which the upgrade job is using to an invalid value,
    Observed result:
    the upgrade job is created, but failed with ImagePullBackOff due to invalid image, this is as expected.
    But the helm operator keeps creating secrets of type 'kubernetes.io/dockercfg' and 'kubernetes.io/service-account-token' for the service accounts that helm chart created.
    It seems like helm operator is trying to recreate the upgrade job, meanwhile always try to create new secrets for the service accounts.

What did you expect to see?

Expected result: even if upgrade job fails due to ImagePullBackOff error, the helm operator should not keep creating new secrets for the service accounts.

What did you see instead? Under which circumstances?

Environment

Operator type:



/language helm

Kubernetes cluster type:
OpenShift

$ operator-sdk version
v0.17.2

$ kubectl version
v1.18.3

Possible Solution

Additional context

If using correct image in step 4, then the upgrade is performed successfully.

languaghelm triagneeds-information triagsupport

All 9 comments

Hi @grantliu3,

Thank you for raise this issue. First, at all, I'd like to recommend you migrate your project to the new layout (1.0+). See; https://sdk.operatorframework.io/docs/building-operators/helm/migration/. Note that the version 0.19.x can receive only bug fixes and the previous versions are no longer supported.

It seems like helm operator is trying to recreate the upgrading job, meanwhile always try to create new secrets for the service accounts.

I understand that helm will try to create the service-account and the secret to installing/upgrading the chart for all times that this action is trigged. So, my first thought is that it would be expected. @joelanford could you give a hand and please let me know if I misunderstood something here?

However, see that the version v0.17.2 uses helm.sh/helm/v3 v3.1.2 and the version 0.19.4 and 1.0.1 uses helm.sh/helm/v3 v3.2.4. Could you also please upgrade your project to use the latest ones and let us know if you see any change in its behaviour (I do not think that it would change, however, would be very helpful ensure it and work with the latest ones instead of the previous versions )?

@grantliu3 I agree with @camilamacedo86 that this sounds like expected behavior. The helm-operator is constantly trying to reconcile your CR, so during every reconciliation it compares the currently deployed release manifest with a candidate release manifest (using a dry-run upgrade). If there is a difference it runs the upgrade. If the upgrade fails, it performs a rollback (which is effectively another upgrade) and attempts reconciliation again after some backoff period, starting the loop over.

The helm operator keeps creating secrets of type 'kubernetes.io/dockercfg' and 'kubernetes.io/service-account-token' for the service accounts that helm chart created.

Are these secrets part of your chart? If not, the helm-operator wouldn't be creating them. It could be that the helm-operator creates the service accounts and then something else (probably the main kube-controller-manager) is creating the secrets based on the existence of the service accounts.

When the service accounts are deleted, I would expect these secrets to also be deleted. So what may be happening is that the constant upgrade attempts are causing the service accounts to be deleted and re-created over and over again, which is causing those secrets to be created and deleted over and over again as well.

@camilamacedo86, I didn't get a chance to try out 0.19.4 and 1.0.1, I'll have a try with latest version when available.
@joelanford, I understand the reconcile logic tries to rollback and reconciliation again, but in my case, the helm operator failed to rollback, (refer to below error logs in operator pod), thus the secrets have never been deleted (rollback) and helm operator keeps creating new ones, if this problem was not noticed and resolved in time, there will be a huge number of secrets created which might be affect the cluster. Our testing team has ever encountered 40000 secrets created in 3 days, and this caused some services of the cluster to be unhealthy.

There are many similar error logs in operator pod like this:
{"level":"error","ts":1602183709.727247,"logger":"helm.controller","msg":"Release failed","namespace":"kube-system","name":"my-release","apiVersion":"sretooling.management.xxx.com/v1alpha1","kind":"Bastion","release":"my-release","error":"failed update (pre-upgrade hooks failed: failed to deploy my-release-bastion-upgrade-job) and failed rollback: failed to replace object: Service \"my-release-bastion-backend-svc\" is invalid: spec.clusterIP: Invalid value: \"\": field is immutable && failed to replace object: Job.batch \"my-release-bastion-job\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-bastion-job\", \"helm.sh/chart\":\"bastion-backend\"}:selectordoes not match templatelabels, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-bastion-job\", \"helm.sh/chart\":\"bastion-backend\"}, Annotations:map[string]string{\"productMetric\":\"MANAGED_VIRTUAL_SERVER\", \"productName\":\"Product Name\", \"productVersion\":\"2.0\"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"my-release-bastion-init-job\", Image:\"cp.icr.io/cp/cp4mcm/bastion@sha256:7851a94c62317997106baa46ad92ea374113d3cdb2b994a1667934819a1b5c4c\", Command:[]string{\"/etc/bastion/script/run_jobs.sh\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:\"RELEASE_NAME\", Value:\"my-release\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"COMMON_SERVICE_NAMESPACE\", Value:\"common-services\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"SERVICE_ADDRESS\", Value:\":8080\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"SESSION_TIMEOUT\", Value:\"600\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_RELEASE_NAME\", Value:\"sre-inventory\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_SERVICE_PORT\", Value:\"4010\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_NAMESPACE\", Value:\"kube-system\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"PATH_PREFIX\", Value:\"/bastion\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"OAUTH2_CLIENT_REGISTRATION_SECRET\", Value:\"\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"UI_STATIC_DIR\", Value:\"/etc/bastion/static\", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:300, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"300m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:100, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"100m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"FallbackToLogsOnError\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(0xc056533360), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc0615453f0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"my-release-bastion-backend-svcacc\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05e9223f0), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable] && failed to replace object: Job.batch \"my-release-postgresql-init-job\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-postgresql-init-job\", \"helm.sh/chart\":\"postgresql\"}:selectordoes not match templatelabels, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-postgresql-init-job\", \"helm.sh/chart\":\"postgresql\"}, Annotations:map[string]string{\"productMetric\":\"MANAGED_VIRTUAL_SERVER\", \"productName\":\"Product Name\", \"productVersion\":\"2.0\"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume{core.Volume{Name:\"config\", VolumeSource:core.VolumeSource{HostPath:(*core.HostPathVolumeSource)(nil), EmptyDir:(*core.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*core.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*core.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*core.GitRepoVolumeSource)(nil), Secret:(*core.SecretVolumeSource)(nil), NFS:(*core.NFSVolumeSource)(nil), ISCSI:(*core.ISCSIVolumeSource)(nil), Glusterfs:(*core.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*core.PersistentVolumeClaimVolumeSource)(nil), RBD:(*core.RBDVolumeSource)(nil), Quobyte:(*core.QuobyteVolumeSource)(nil), FlexVolume:(*core.FlexVolumeSource)(nil), Cinder:(*core.CinderVolumeSource)(nil), CephFS:(*core.CephFSVolumeSource)(nil), Flocker:(*core.FlockerVolumeSource)(nil), DownwardAPI:(*core.DownwardAPIVolumeSource)(nil), FC:(*core.FCVolumeSource)(nil), AzureFile:(*core.AzureFileVolumeSource)(nil), ConfigMap:(*core.ConfigMapVolumeSource)(0xc0347b8800), VsphereVolume:(*core.VsphereVirtualDiskVolumeSource)(nil), AzureDisk:(*core.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*core.PhotonPersistentDiskVolumeSource)(nil), Projected:(*core.ProjectedVolumeSource)(nil), PortworxVolume:(*core.PortworxVolumeSource)(nil), ScaleIO:(*core.ScaleIOVolumeSource)(nil), StorageOS:(*core.StorageOSVolumeSource)(nil), CSI:(*core.CSIVolumeSource)(nil)}}}, InitContainers:[]core.Container{core.Container{Name:\"check-pqsql-online\", Image:\"cp.icr.io/cp/cp4mcm/postgresql@sha256:2b52f8cb564920205e5e646cd8ff5ab774759fb330ae90de5a5d525e1864fbf8\", Command:[]string{\"sh\", \"-c\", \"until pg_isready --host my-release-postgresql-svc --port 5432; do echo waiting for database; sleep 2; done;\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:500, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"500m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:250, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"250m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, Containers:[]core.Container{core.Container{Name:\"my-release-init-db\", Image:\"cp.icr.io/cp/cp4mcm/postgresql@sha256:2b52f8cb564920205e5e646cd8ff5ab774759fb330ae90de5a5d525e1864fbf8\", Command:[]string{\"psql\", \"--host\", \"my-release-postgresql-svc\", \"--port\", \"5432\", \"-U\", \"postgres\", \"-f\", \"/tmp/init-db/db-init.sql\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:\"PGPASSWORD\", Value:\"postgres\", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:500, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"500m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:250, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"250m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount{core.VolumeMount{Name:\"config\", ReadOnly:true, MountPath:\"/tmp/init-db\", SubPath:\"\", MountPropagation:(*core.MountPropagationMode)(nil), SubPathExpr:\"\"}}, VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"FallbackToLogsOnError\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(0xc05624e2d0), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc066093070), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"my-release-postgresql-svcacc\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05e3ed030), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/helm/controller.HelmOperatorReconciler.Reconcile\n\tsrc/github.com/operator-framework/operator-sdk/pkg/helm/controller/reconcile.go:247\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?

+1 for the @joelanford questions and also I'd like to raise a few ones:

@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?

@joelanford sorry I can't share my charts, but I think it's easy to reproduce, I'm going to have a try to use memcached-operator to reproduce it, if reproduced I'll send you the updated operator code for reproduce/investigation.

+1 for the @joelanford questions and also I'd like to raise a few ones:

* If you use only helm (helm --upgrade) to upgrade the chart in this scenario, would not you face the same?

* would not the option --force to upgrade the chart solve your scenario? The --force option would give up on trying to patch if it fails, destroy and re-create the resource. See for 1.0+ https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/annotations/ and for the 0.19 the annotation would be `helm.operator-sdk/upgrade-force`. See: https://v0-19-x.sdk.operatorframework.io/docs/helm/reference/advanced_features/

@camilamacedo86

  1. if using helm --upgrade to upgrade directly, the helm upgrade will be hung, and exits when timeout.
  2. I tried to add the annotation to my CR like below and re-execute the steps in the beginning of this issue, and still encounter the problem in https://github.com/operator-framework/operator-sdk/issues/4012#issuecomment-712159193
metadata:
  annotations:
    helm.operator-sdk/upgrade-force: "True"

@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?

@joelanford sorry I can't share my charts, but I think it's easy to reproduce, I'm going to have a try to use memcached-operator to reproduce it, if reproduced I'll send you the updated operator code for reproduce/investigation.

I'm not able to reproduce this problem when using memcached-operator, still investigating the key steps that are required to reproduce it, will update this issue if any progress.

Was this page helpful?
0 / 5 - 0 ratings