Test-infra: Prow should abort jobs stuck in 'ImagePullBackoff' after a configurable amount of time.

Created on 4 Oct 2018  ยท  15Comments  ยท  Source: kubernetes/test-infra

Alternatively (or additionally) we could alert on this as it is likely more of an infra problem than a PR related problem.

We really just need to expose some signal that a job is stuck in ImagePullBackoff. Currently the failure is very opaque and persists indefinitely if the image never becomes available.

/cc @amwat @BenTheElder @stevekuznetsov @kargakis
/area prow
/area prow/plank
/kind bug

areprow areproplank kinbug

Most helpful comment

ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

๐Ÿ‘ Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource
not the YAML job configuration.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE
.

All 15 comments

Currently there are three ways to work around this problem if you encounter it:

  • Publish an image to the appropriate location so that the ImagePull can succeed.
  • Delete the ProwJob. (note that just deleting the stuck pod is insufficient)
  • Update the image reference in the prowjob to point to an image that can be pulled and delete the stuck pod. A new pod will be created with the new image ref.

/cc

now we need to parse the actual state/event, and set the prowjob as aborted & clean up the pod?
another thought: how about have a presubmit that makes sure your prow can pull the image?

Dont' we have an existing issue for this?

I definitely think we need to handle this. It would be nice if kubernetes did, but I don't think that's likely to change anytime soon as it would be a behavioral change and this issue is going to happen a lot for prow users.

Plank or sinker should handle this by making the prowjob failed or aborted in some fashion.

how about have a presubmit that makes sure your prow can pull the image?

no, because it can still fail when you schedule the job. this is skipping around the actual problem.

We should also extend this to "ContainerCreating".
Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

๐Ÿ‘ Yes. If init containers fail or volumes cannot be provisioned we should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource not the YAML job configuration.

InitContainer failure will fail the Pod, but mount failures due to missing Secrets or ConfigMaps is valid -- but why delete the ProwJob? Would it not be better to mark it as failed so it can be reported out? Otherwise users will just see missing tests and contexts stuck in pending?

It should be marked failed, talk of deleting it is about mitigations that
are manually possible today but undesireable vs prow marking it failed for
us.

On Wed, Oct 3, 2018, 23:36 Steve Kuznetsov notifications@github.com wrote:

InitContainer failure will fail the Pod, but mount failures due to
missing Secrets or ConfigMaps is valid -- but why delete the ProwJob?
Would it not be better to mark it as failed so it can be reported out?
Otherwise users will just see missing tests and contexts stuck in pending?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426901710,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq2FeiT1z3-mHwdkdHQSZY6SFOj8Rks5uhaxugaJpZM4XHGkE
.

ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

๐Ÿ‘ Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource
not the YAML job configuration.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE
.

Any TTL should be generous so as to ensure the kubelet has time to pull all the images -- every part of pod utilities, the test image, etc.

+1. We see pull flakes often enough, it needs to be some ttl-until-running.
If a prowjob takes an three hours to start (ok realistically less than
that) we should go ahead and consider it failed / aborted.

Any opinions on plank / sinker / ...?

On Wed, Oct 3, 2018, 23:51 Michalis Kargakis notifications@github.com
wrote:

ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.

On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:

Dont' we have an existing issue for this?

I also thought we did, but couldn't find one when I searched.

We should also extend this to "ContainerCreating".

๐Ÿ‘ Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.

Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.

Correct. "Delete the ProwJob." also refers to the ProwJob cluster
resource
not the YAML job configuration.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE

.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426904964,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bqyj6YikRPOML4Sk4tVT2RpsoRuy_ks5uha_lgaJpZM4XHGkE
.

It seems appropriate to place into plank to ensure we have just one controller managing ProwJob.Status

:joy: I didn't even remember that change - https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L5 we do set a pod pending timeout as 1h though, so this should be... WAI?

/close

Sounds like we've addressed this based on the above, please /reopen if I'm incorrect

@spiffxp: Closing this issue.

In response to this:

/close

Sounds like we've addressed this based on the above, please /reopen if I'm incorrect

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BenTheElder picture BenTheElder  ยท  4Comments

xiangpengzhao picture xiangpengzhao  ยท  3Comments

zacharysarah picture zacharysarah  ยท  3Comments

MrHohn picture MrHohn  ยท  4Comments

sjenning picture sjenning  ยท  4Comments