Helm operator keeps creating secrets when performing helm upgrade if the upgrade job fails due to ImagePullBackOff
I have a helm operator, it works well for install/uninstall. I have installed the operator using OLM
Then I added a job to helm chart to support upgrade scenario:
annotations:
"helm.sh/hook": pre-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
Expected result: even if upgrade job fails due to ImagePullBackOff error, the helm operator should not keep creating new secrets for the service accounts.
Operator type:
/language helm
Kubernetes cluster type:
OpenShift
$ operator-sdk version
v0.17.2
$ kubectl version
v1.18.3
If using correct image in step 4, then the upgrade is performed successfully.
Hi @grantliu3,
Thank you for raise this issue. First, at all, I'd like to recommend you migrate your project to the new layout (1.0+). See; https://sdk.operatorframework.io/docs/building-operators/helm/migration/. Note that the version 0.19.x can receive only bug fixes and the previous versions are no longer supported.
It seems like helm operator is trying to recreate the upgrading job, meanwhile always try to create new secrets for the service accounts.
I understand that helm will try to create the service-account and the secret to installing/upgrading the chart for all times that this action is trigged. So, my first thought is that it would be expected. @joelanford could you give a hand and please let me know if I misunderstood something here?
However, see that the version v0.17.2 uses helm.sh/helm/v3 v3.1.2 and the version 0.19.4 and 1.0.1 uses helm.sh/helm/v3 v3.2.4. Could you also please upgrade your project to use the latest ones and let us know if you see any change in its behaviour (I do not think that it would change, however, would be very helpful ensure it and work with the latest ones instead of the previous versions )?
@grantliu3 I agree with @camilamacedo86 that this sounds like expected behavior. The helm-operator is constantly trying to reconcile your CR, so during every reconciliation it compares the currently deployed release manifest with a candidate release manifest (using a dry-run upgrade). If there is a difference it runs the upgrade. If the upgrade fails, it performs a rollback (which is effectively another upgrade) and attempts reconciliation again after some backoff period, starting the loop over.
The helm operator keeps creating secrets of type 'kubernetes.io/dockercfg' and 'kubernetes.io/service-account-token' for the service accounts that helm chart created.
Are these secrets part of your chart? If not, the helm-operator wouldn't be creating them. It could be that the helm-operator creates the service accounts and then something else (probably the main kube-controller-manager) is creating the secrets based on the existence of the service accounts.
When the service accounts are deleted, I would expect these secrets to also be deleted. So what may be happening is that the constant upgrade attempts are causing the service accounts to be deleted and re-created over and over again, which is causing those secrets to be created and deleted over and over again as well.
@camilamacedo86, I didn't get a chance to try out 0.19.4 and 1.0.1, I'll have a try with latest version when available.
@joelanford, I understand the reconcile logic tries to rollback and reconciliation again, but in my case, the helm operator failed to rollback, (refer to below error logs in operator pod), thus the secrets have never been deleted (rollback) and helm operator keeps creating new ones, if this problem was not noticed and resolved in time, there will be a huge number of secrets created which might be affect the cluster. Our testing team has ever encountered 40000 secrets created in 3 days, and this caused some services of the cluster to be unhealthy.
There are many similar error logs in operator pod like this:
{"level":"error","ts":1602183709.727247,"logger":"helm.controller","msg":"Release failed","namespace":"kube-system","name":"my-release","apiVersion":"sretooling.management.xxx.com/v1alpha1","kind":"Bastion","release":"my-release","error":"failed update (pre-upgrade hooks failed: failed to deploy my-release-bastion-upgrade-job) and failed rollback: failed to replace object: Service \"my-release-bastion-backend-svc\" is invalid: spec.clusterIP: Invalid value: \"\": field is immutable && failed to replace object: Job.batch \"my-release-bastion-job\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-bastion-job\", \"helm.sh/chart\":\"bastion-backend\"}:selectordoes not match templatelabels, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-bastion-job\", \"helm.sh/chart\":\"bastion-backend\"}, Annotations:map[string]string{\"productMetric\":\"MANAGED_VIRTUAL_SERVER\", \"productName\":\"Product Name\", \"productVersion\":\"2.0\"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"my-release-bastion-init-job\", Image:\"cp.icr.io/cp/cp4mcm/bastion@sha256:7851a94c62317997106baa46ad92ea374113d3cdb2b994a1667934819a1b5c4c\", Command:[]string{\"/etc/bastion/script/run_jobs.sh\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:\"RELEASE_NAME\", Value:\"my-release\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"COMMON_SERVICE_NAMESPACE\", Value:\"common-services\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"SERVICE_ADDRESS\", Value:\":8080\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"SESSION_TIMEOUT\", Value:\"600\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_RELEASE_NAME\", Value:\"sre-inventory\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_SERVICE_PORT\", Value:\"4010\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"INVENTORY_NAMESPACE\", Value:\"kube-system\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"PATH_PREFIX\", Value:\"/bastion\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"OAUTH2_CLIENT_REGISTRATION_SECRET\", Value:\"\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\"UI_STATIC_DIR\", Value:\"/etc/bastion/static\", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:300, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"300m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:100, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"100m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"FallbackToLogsOnError\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(0xc056533360), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc0615453f0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"my-release-bastion-backend-svcacc\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05e9223f0), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable] && failed to replace object: Job.batch \"my-release-postgresql-init-job\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-postgresql-init-job\", \"helm.sh/chart\":\"postgresql\"}:selectordoes not match templatelabels, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/instance\":\"my-release\", \"app.kubernetes.io/managed-by\":\"Helm\", \"app.kubernetes.io/name\":\"my-release-postgresql-init-job\", \"helm.sh/chart\":\"postgresql\"}, Annotations:map[string]string{\"productMetric\":\"MANAGED_VIRTUAL_SERVER\", \"productName\":\"Product Name\", \"productVersion\":\"2.0\"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume{core.Volume{Name:\"config\", VolumeSource:core.VolumeSource{HostPath:(*core.HostPathVolumeSource)(nil), EmptyDir:(*core.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*core.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*core.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*core.GitRepoVolumeSource)(nil), Secret:(*core.SecretVolumeSource)(nil), NFS:(*core.NFSVolumeSource)(nil), ISCSI:(*core.ISCSIVolumeSource)(nil), Glusterfs:(*core.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*core.PersistentVolumeClaimVolumeSource)(nil), RBD:(*core.RBDVolumeSource)(nil), Quobyte:(*core.QuobyteVolumeSource)(nil), FlexVolume:(*core.FlexVolumeSource)(nil), Cinder:(*core.CinderVolumeSource)(nil), CephFS:(*core.CephFSVolumeSource)(nil), Flocker:(*core.FlockerVolumeSource)(nil), DownwardAPI:(*core.DownwardAPIVolumeSource)(nil), FC:(*core.FCVolumeSource)(nil), AzureFile:(*core.AzureFileVolumeSource)(nil), ConfigMap:(*core.ConfigMapVolumeSource)(0xc0347b8800), VsphereVolume:(*core.VsphereVirtualDiskVolumeSource)(nil), AzureDisk:(*core.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*core.PhotonPersistentDiskVolumeSource)(nil), Projected:(*core.ProjectedVolumeSource)(nil), PortworxVolume:(*core.PortworxVolumeSource)(nil), ScaleIO:(*core.ScaleIOVolumeSource)(nil), StorageOS:(*core.StorageOSVolumeSource)(nil), CSI:(*core.CSIVolumeSource)(nil)}}}, InitContainers:[]core.Container{core.Container{Name:\"check-pqsql-online\", Image:\"cp.icr.io/cp/cp4mcm/postgresql@sha256:2b52f8cb564920205e5e646cd8ff5ab774759fb330ae90de5a5d525e1864fbf8\", Command:[]string{\"sh\", \"-c\", \"until pg_isready --host my-release-postgresql-svc --port 5432; do echo waiting for database; sleep 2; done;\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:500, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"500m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:250, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"250m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, Containers:[]core.Container{core.Container{Name:\"my-release-init-db\", Image:\"cp.icr.io/cp/cp4mcm/postgresql@sha256:2b52f8cb564920205e5e646cd8ff5ab774759fb330ae90de5a5d525e1864fbf8\", Command:[]string{\"psql\", \"--host\", \"my-release-postgresql-svc\", \"--port\", \"5432\", \"-U\", \"postgres\", \"-f\", \"/tmp/init-db/db-init.sql\"}, Args:[]string(nil), WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:\"PGPASSWORD\", Value:\"postgres\", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:500, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"500m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:250, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"250m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:268435456, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount{core.VolumeMount{Name:\"config\", ReadOnly:true, MountPath:\"/tmp/init-db\", SubPath:\"\", MountPropagation:(*core.MountPropagationMode)(nil), SubPathExpr:\"\"}}, VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"FallbackToLogsOnError\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(0xc05624e2d0), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc066093070), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"my-release-postgresql-svcacc\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05e3ed030), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/helm/controller.HelmOperatorReconciler.Reconcile\n\tsrc/github.com/operator-framework/operator-sdk/pkg/helm/controller/reconcile.go:247\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?
+1 for the @joelanford questions and also I'd like to raise a few ones:
helm.operator-sdk/upgrade-force. See: https://v0-19-x.sdk.operatorframework.io/docs/helm/reference/advanced_features/@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?
@joelanford sorry I can't share my charts, but I think it's easy to reproduce, I'm going to have a try to use memcached-operator to reproduce it, if reproduced I'll send you the updated operator code for reproduce/investigation.
+1 for the @joelanford questions and also I'd like to raise a few ones:
* If you use only helm (helm --upgrade) to upgrade the chart in this scenario, would not you face the same? * would not the option --force to upgrade the chart solve your scenario? The --force option would give up on trying to patch if it fails, destroy and re-create the resource. See for 1.0+ https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/annotations/ and for the 0.19 the annotation would be `helm.operator-sdk/upgrade-force`. See: https://v0-19-x.sdk.operatorframework.io/docs/helm/reference/advanced_features/
@camilamacedo86
helm --upgrade to upgrade directly, the helm upgrade will be hung, and exits when timeout.metadata:
annotations:
helm.operator-sdk/upgrade-force: "True"
@grantliu3 I'm having trouble understanding what the root cause of this error is. Can you share your chart? Does your chart or pre-upgrade job create these secrets?
@joelanford sorry I can't share my charts, but I think it's easy to reproduce, I'm going to have a try to use memcached-operator to reproduce it, if reproduced I'll send you the updated operator code for reproduce/investigation.
I'm not able to reproduce this problem when using memcached-operator, still investigating the key steps that are required to reproduce it, will update this issue if any progress.