Test-infra: Prowjob should have first class timeout

Created on 6 Mar 2018  路  29Comments  路  Source: kubernetes/test-infra

pod can be stuck in running state, we never know, sinker does not clean running prowjobs, so it might be necessary to give prowjob a timeout field/annotation/label?

/area prow

areprow help wanted lifecyclrotten

Most helpful comment

I would like to take this issue up.
A way that I can think of is to also check for the prowjob's age when checking the condition !prowJob.Complete() for presubmit/postsubmit & periodics in sinker.
If age is more than MaxPodRunningAge then mark it for deletion

All 29 comments

cc @cjwagner @BenTheElder @kargakis @stevekuznetsov @dims

Thought was that we could add this to the entrypoint wrapper.

/assign

Sounds reasonable. Is this obsoleting pod_pending_timeout in plank?

Is this obsoleting pod_pending_timeout in plank?

No, we will still want both config options. pod_pending_timeout limits the time pods spend in the pending state which is distinct from the actual duration of the entire job which is what the proposed field would limit.

We'd also like to set this per-pod. Some jobs actually do take 12+ hours but most should terminate after say, 2.

/open

this isn't exactly fixed yet

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale
we've seen more pods stuck in a running state

Edit: ideally sinker/plank would be able to terminate pods that are stuck in running for an excessively long period, which should be more reliable than relying on timeouts within the container (though we also should leverage those for clean exit if possible)

I would like to take this issue up.
A way that I can think of is to also check for the prowjob's age when checking the condition !prowJob.Complete() for presubmit/postsubmit & periodics in sinker.
If age is more than MaxPodRunningAge then mark it for deletion

Sounds reasonable to me @sipian

@sipian any update on this?

@spiffxp
Sorry. I got busy with some other things.
If it is alright, I'll try to finish this up in the coming few weeks.

we already have this with podutils?

@krzyzacy That timeout is implemented at the wrapper process level whereas the timeout suggested by this issue would occur at the ProwJob level. A PJ level timeout could help us abort jobs that fail to schedule or start the wrapper process for whatever reason. The pod utilities' timeout would not catch these issues.

I think the original issue was pod can be stuck forever, is the podutil timeout wraps entrypoint? Maybe we should utilize that for aborting fail to start jobs, since I feel like add another timeout will be confusing.

I thought that we had intended to have multiple levels of timeouts, but I can't find the issue where that was discussed. (Maybe @stevekuznetsov knows or remembers discussing?)
The pod utils could share a timeout with the ProwJob so long as the pod utils retain the configuration for the grace period, but if we did that we would want to move the timeout field to become a first class ProwJob field rather than a decoration config field. The grace period needs to be a decoration config field, because the ProwJob level timeout won't have a concept of a graceful termination signal.

because the ProwJob level timeout won't have a concept of a graceful termination signal.

If plank or sinker or whatever does a DELETE on the Pod, the default behavior of the kubelet is to send graceful deletion if the client doesn't ask for force delete, so in a way, it will work anyway. But I agree that we need to have one a all levels -- one for ginkgo, one for the entrypoint, one for PJ -- all orchestrated. I can't remember where we had that conversation, may have been in a meeting

agree that each part will need its own timeout, maybe prow should by default populate these timeouts base off the podutil timeouts in some manner, rather let user set four different timeouts in the config?

Sorry, I can't seem to find out time for this.
I am releasing this issue. Anyone else interested can take this up.

Also would like to see controllers enforce a timeout on the prowjob

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

/remove-lifecycle rotten

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings