Test-infra: Prowjob should have first class timeout

Created on 6 Mar 2018 · 29Comments · Source: kubernetes/test-infra

pod can be stuck in running state, we never know, sinker does not clean running prowjobs, so it might be necessary to give prowjob a timeout field/annotation/label?

/area prow

areprow help wanted lifecyclrotten

Source

krzyzacy

👍2

Most helpful comment

I would like to take this issue up.
A way that I can think of is to also check for the prowjob's age when checking the condition !prowJob.Complete() for presubmit/postsubmit & periodics in sinker.
If age is more than MaxPodRunningAge then mark it for deletion

sipian on 10 Jul 2018

👍2

All 29 comments

cc @cjwagner @BenTheElder @kargakis @stevekuznetsov @dims

krzyzacy on 6 Mar 2018

Thought was that we could add this to the entrypoint wrapper.

stevekuznetsov on 6 Mar 2018

/assign

stevekuznetsov on 6 Mar 2018

Sounds reasonable. Is this obsoleting pod_pending_timeout in plank?

kargakis on 7 Mar 2018

Is this obsoleting pod_pending_timeout in plank?

No, we will still want both config options. pod_pending_timeout limits the time pods spend in the pending state which is distinct from the actual duration of the entire job which is what the proposed field would limit.

cjwagner on 7 Mar 2018

We'd also like to set this per-pod. Some jobs actually do take 12+ hours but most should terminate after say, 2.

BenTheElder on 7 Mar 2018

/open

this isn't exactly fixed yet

BenTheElder on 7 Mar 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 5 Jun 2018

/remove-lifecycle stale
we've seen more pods stuck in a running state

Edit: ideally sinker/plank would be able to terminate pods that are stuck in running for an excessively long period, which should be more reliable than relying on timeouts within the container (though we also should leverage those for clean exit if possible)

BenTheElder on 5 Jun 2018

sipian on 10 Jul 2018

👍2

Sounds reasonable to me @sipian

stevekuznetsov on 10 Jul 2018

@sipian any update on this?

spiffxp on 6 Oct 2018

@spiffxp
Sorry. I got busy with some other things.
If it is alright, I'll try to finish this up in the coming few weeks.

sipian on 6 Oct 2018

we already have this with podutils?

krzyzacy on 6 Oct 2018

https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-testing/kubetest-canaries.yaml#L146-L147 works after some fix.

/close

krzyzacy on 6 Oct 2018

@krzyzacy That timeout is implemented at the wrapper process level whereas the timeout suggested by this issue would occur at the ProwJob level. A PJ level timeout could help us abort jobs that fail to schedule or start the wrapper process for whatever reason. The pod utilities' timeout would not catch these issues.

cjwagner on 8 Oct 2018

I think the original issue was pod can be stuck forever, is the podutil timeout wraps entrypoint? Maybe we should utilize that for aborting fail to start jobs, since I feel like add another timeout will be confusing.

krzyzacy on 8 Oct 2018

I thought that we had intended to have multiple levels of timeouts, but I can't find the issue where that was discussed. (Maybe @stevekuznetsov knows or remembers discussing?)
The pod utils could share a timeout with the ProwJob so long as the pod utils retain the configuration for the grace period, but if we did that we would want to move the timeout field to become a first class ProwJob field rather than a decoration config field. The grace period needs to be a decoration config field, because the ProwJob level timeout won't have a concept of a graceful termination signal.

cjwagner on 8 Oct 2018

because the ProwJob level timeout won't have a concept of a graceful termination signal.

If plank or sinker or whatever does a DELETE on the Pod, the default behavior of the kubelet is to send graceful deletion if the client doesn't ask for force delete, so in a way, it will work anyway. But I agree that we need to have one a all levels -- one for ginkgo, one for the entrypoint, one for PJ -- all orchestrated. I can't remember where we had that conversation, may have been in a meeting

stevekuznetsov on 9 Oct 2018

agree that each part will need its own timeout, maybe prow should by default populate these timeouts base off the podutil timeouts in some manner, rather let user set four different timeouts in the config?

krzyzacy on 9 Oct 2018

Sorry, I can't seem to find out time for this.
I am releasing this issue. Anyone else interested can take this up.

sipian on 4 Nov 2018

Also would like to see controllers enforce a timeout on the prowjob

fejta on 4 Nov 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Feb 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 4 Mar 2019

/remove-lifecycle rotten

stevekuznetsov on 4 Mar 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Jun 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 2 Jul 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 2 Aug 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.