Kubernetes: There's no way to stop a job from restarting forever on failure, documentation is wrong

Created on 8 Jan 2017 · 6Comments · Source: kubernetes/kubernetes

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): jobs, activeDeadlineSeconds, restartPolicy

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:22:15Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}

Environment:

Cloud provider or hardware configuration: GCE
OS (e.g. from /etc/os-release): macos
Kernel (e.g. uname -a): darwin
Install tools: N/A
Others: N/A

What happened:
Created a job, specified activeDeadlineSeconds as per documentation which explicitly says:

However, if you prefer not to retry forever, you can set a deadline on the job. Do this by setting the spec.activeDeadlineSeconds field of the job to a number of seconds. The job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.

What happened instead:

2017-01-07 at 9 12 pm

What you expected to happen:

Job to not restart when DeadlineExceeded exceeded as per documentation.

How to reproduce it (as minimally and precisely as possible):

Create a job that exits with greater than exit code 0.

Anything else do we need to know:

More...

Source

btipling

👍2

Most helpful comment

Yeah my thing is to just run a pod that uses cron and avoid jobs and and kubernetes crons all together. I don't understand the kubernetes cron and job use case, but I love kubernetes and it's made by smart people so I'm sure there's a lot of thought that went into it snark aside.

btipling on 25 Jan 2017

👍2

All 6 comments

My Yaml was:

apiVersion: batch/v1
kind: Job
metadata:
  name: sphela-letsencrypt
spec:
  template:
    metadata:
      labels:
        run: sphela-letsencrypt
        app: sphela
    spec:
      activeDeadlineSeconds: 1
      imagePullSecrets:
      - name: docker-registry-secret
      containers:
      - name: letsencrypt
        image: gcr.io/sphela-153202/sphela-letsencrypt:v6
        ports:
        - containerPort: 80
        imagePullPolicy: Always
        command: [
          "/usr/local/bin/encrypt-script.sh"
        ]
      restartPolicy: Never

I set it to 1 just to see what would happen. I would have expected it to fail and never retry based on the documentation.

/usr/local/bin/encrypt-script.sh is a script that exits with an exit code of greater than 0.

The error was my fault, but it resulted in me exhausting my letsencrypt cert creation attempts for a week.

btipling on 8 Jan 2017

nm I noticed this:

Note that both the Job Spec and the Pod Template Spec within the Job have a field with the same name. Set the one on the Job.

And I had it on the spec instead of the job. Why are there two of them?

btipling on 8 Jan 2017

I am still seeing this issue in minikube version: v0.15.0. I am trying to create a job for functional/perf tests, but if it returns non-zero exit code, it just keeps restarting.

Shubham-Sakhuja-Bose on 25 Jan 2017

Yes @Shubham-Sakhuja-Bose this is by design. I think it's a bad design, but it's intentional. I've been told that if I don't want something to "run to completion" to not use a job. Jobs can only be perfectly written bug free code that always succeeds. ;P

btipling on 25 Jan 2017

😕10 🎉1

@btipling thanks for the quick response! That's unfortunate, but pods it is!

Shubham-Sakhuja-Bose on 25 Jan 2017

👎1

btipling on 25 Jan 2017

👍2

Was this page helpful?

0 / 5 - 0 ratings