Test-infra: Prow should abort jobs stuck in 'ImagePullBackoff' after a configurable amount of time.

Created on 4 Oct 2018 · 15Comments · Source: kubernetes/test-infra

Alternatively (or additionally) we could alert on this as it is likely more of an infra problem than a PR related problem.

We really just need to expose some signal that a job is stuck in ImagePullBackoff. Currently the failure is very opaque and persists indefinitely if the image never becomes available.

/cc @amwat @BenTheElder @stevekuznetsov @kargakis
/area prow
/area prow/plank
/kind bug

areprow areproplank kinbug

Source

cjwagner

Most helpful comment

ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

👍 Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource
not the YAML job configuration.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE
.

kargakis on 4 Oct 2018

👍3

All 15 comments

Currently there are three ways to work around this problem if you encounter it:

Publish an image to the appropriate location so that the ImagePull can succeed.
Delete the ProwJob. (note that just deleting the stuck pod is insufficient)
Update the image reference in the prowjob to point to an image that can be pulled and delete the stuck pod. A new pod will be created with the new image ref.

cjwagner on 4 Oct 2018

/cc

now we need to parse the actual state/event, and set the prowjob as aborted & clean up the pod?
another thought: how about have a presubmit that makes sure your prow can pull the image?

krzyzacy on 4 Oct 2018

Dont' we have an existing issue for this?

I definitely think we need to handle this. It would be nice if kubernetes did, but I don't think that's likely to change anytime soon as it would be a behavioral change and this issue is going to happen a lot for prow users.

Plank or sinker should handle this by making the prowjob failed or aborted in some fashion.

how about have a presubmit that makes sure your prow can pull the image?

no, because it can still fail when you schedule the job. this is skipping around the actual problem.

BenTheElder on 4 Oct 2018

We should also extend this to "ContainerCreating".
Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.

amwat on 4 Oct 2018

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

👍 Yes. If init containers fail or volumes cannot be provisioned we should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource not the YAML job configuration.

cjwagner on 4 Oct 2018

👍1

InitContainer failure will fail the Pod, but mount failures due to missing Secrets or ConfigMaps is valid -- but why delete the ProwJob? Would it not be better to mark it as failed so it can be reported out? Otherwise users will just see missing tests and contexts stuck in pending?

stevekuznetsov on 4 Oct 2018

It should be marked failed, talk of deleting it is about mitigations that
are manually possible today but undesireable vs prow marking it failed for
us.

On Wed, Oct 3, 2018, 23:36 Steve Kuznetsov notifications@github.com wrote:

InitContainer failure will fail the Pod, but mount failures due to
missing Secrets or ConfigMaps is valid -- but why delete the ProwJob?
Would it not be better to mark it as failed so it can be reported out?
Otherwise users will just see missing tests and contexts stuck in pending?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426901710,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq2FeiT1z3-mHwdkdHQSZY6SFOj8Rks5uhaxugaJpZM4XHGkE
.

BenTheElder on 4 Oct 2018

👍1

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

👍 Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource
not the YAML job configuration.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE
.

kargakis on 4 Oct 2018

👍3

Any TTL should be generous so as to ensure the kubelet has time to pull all the images -- every part of pod utilities, the test image, etc.

stevekuznetsov on 4 Oct 2018

+1. We see pull flakes often enough, it needs to be some ttl-until-running.
If a prowjob takes an three hours to start (ok realistically less than
that) we should go ahead and consider it failed / aborted.

Any opinions on plank / sinker / ...?

On Wed, Oct 3, 2018, 23:51 Michalis Kargakis notifications@github.com
wrote:

ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

👍 Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster
resource
not the YAML job configuration.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426904964,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bqyj6YikRPOML4Sk4tVT2RpsoRuy_ks5uha_lgaJpZM4XHGkE
.

BenTheElder on 4 Oct 2018

It seems appropriate to place into plank to ensure we have just one controller managing ProwJob.Status

stevekuznetsov on 4 Oct 2018

👍2

Looks like we're already doing this thanks to @krzyzacy in https://github.com/kubernetes/test-infra/commit/b5e0a45af555249f15279df075f67198870f6fed ...

https://github.com/kubernetes/test-infra/blob/0c8689c04169d754ff42570561cd60a0cc9cabf2/prow/plank/controller.go#L393-L405

However, by default the timeout is a full day:

https://github.com/kubernetes/test-infra/blob/0c8689c04169d754ff42570561cd60a0cc9cabf2/prow/config/config.go#L752-L760

So do we just want to tune the timeout down?

stevekuznetsov on 15 Oct 2018

:joy: I didn't even remember that change - https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L5 we do set a pod pending timeout as 1h though, so this should be... WAI?

krzyzacy on 15 Oct 2018

/close

Sounds like we've addressed this based on the above, please /reopen if I'm incorrect

spiffxp on 3 Jan 2019

@spiffxp: Closing this issue.

In response to this:

/close

Sounds like we've addressed this based on the above, please /reopen if I'm incorrect

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.