When applying an update (typically a new container image tag from our CI/CD pipeline) to a ScaledObject with scaleType: job terminates all running jobs.
This does not seem to fit well with the run-to-completion nature of jobs, and we have to make sure deploying new code does not interrupt our long running simulations (the main reason for choosing jobs over deployments).
Already started jobs run to completion with the configuration as it was when started.
New jobs triggered (e.g. by new incoming queue messages) should run with the new configuration.
Already running jobs and associated pods are terminated and deleted.
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
name: my-long-running-scaled-job
namespace: default
spec:
scaleType: job
pollingInterval: 10 # Optional. Default: 30 seconds
maxReplicaCount: 15 # Optional. Default: 100
minReplicaCount: 0 # Optional. Default: 0
cooldownPeriod: 30 # Optional. Default: 300 seconds
jobTargetRef:
parallelism: 1 # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
completions: 1 # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
activeDeadlineSeconds: 900 # Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
backoffLimit: 6 # Specifies the number of retries before marking this job failed. Defaults to 6
template:
# describes the [job template](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/)
metadata:
labels:
jobgroup: somejobgroupthing
spec:
containers:
- name: busybox-looping
image: busybox
command: ['sh', '-c', 'x=1;while [ $x -le 100 ]; do let y=x*2; let z=x*3; let a=x*4; echo $x $y $z $a ; sleep 1; let x=x+1;done']
env:
- name: THE_QUEUE
value: mytestqueuethatijustaddamessageto
- name: STORAGE_ACCOUNT_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: my-secrets
key: STORAGE_ACCOUNT_CONNECTION_STRING
restartPolicy: Never
triggers:
- type: azure-queue
metadata:
queueName: mytestqueuethatijustaddamessageto
queueLength: '20' # Optional. Queue length target for HPA. Default: 5 messages
connection: STORAGE_ACCOUNT_CONNECTION_STRING
kubectl apply -f my-busybox-job-test.yamlspec.jobTargetRef.template.spec.containers.image, or command,We are seeing this as well with our long running jobs and it does not play nice with the continuous delivery nature of our code bases that are using containers being scaled by KEDA.
The other alternative of course is to ensure that all of your batched jobs running via KEDA jobs are using some kind of saga pattern so when they do get interrupted, if they are driven off a queue with a visibility window, then the job will be kicked off again and you can resume close to where you were. However this depends on the nature of the work being done and is not always possible.
@TsuyoshiUshio Is this behavior the same with 2.0?
I have upgraded to keda-2.0.0-beta on our test cluster now, and as far as I can see, this issue seems to be fixed there. Thanks!
I am happy to close this issue then, unless you would like to address this somehow for 1.x as well (behavior and/or its docs or something).
Let's close this then indeed, we don't have concrete plans to ship a new 1.x version.