Alternatively (or additionally) we could alert on this as it is likely more of an infra problem than a PR related problem.
We really just need to expose some signal that a job is stuck in ImagePullBackoff. Currently the failure is very opaque and persists indefinitely if the image never becomes available.
/cc @amwat @BenTheElder @stevekuznetsov @kargakis
/area prow
/area prow/plank
/kind bug
Currently there are three ways to work around this problem if you encounter it:
/cc
now we need to parse the actual state/event, and set the prowjob as aborted & clean up the pod?
another thought: how about have a presubmit that makes sure your prow can pull the image?
Dont' we have an existing issue for this?
I definitely think we need to handle this. It would be nice if kubernetes did, but I don't think that's likely to change anytime soon as it would be a behavioral change and this issue is going to happen a lot for prow users.
Plank or sinker should handle this by making the prowjob failed or aborted in some fashion.
how about have a presubmit that makes sure your prow can pull the image?
no, because it can still fail when you schedule the job. this is skipping around the actual problem.
We should also extend this to "ContainerCreating".
Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.
Dont' we have an existing issue for this?
I also thought we did, but couldn't find one when I searched.
We should also extend this to "ContainerCreating".
๐ Yes. If init containers fail or volumes cannot be provisioned we should abort/fail the ProwJob and expose the error message somehow.
Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply updating the job configmap doesn't help.
Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource not the YAML job configuration.
InitContainer failure will fail the Pod, but mount failures due to missing Secrets or ConfigMaps is valid -- but why delete the ProwJob? Would it not be better to mark it as failed so it can be reported out? Otherwise users will just see missing tests and contexts stuck in pending?
It should be marked failed, talk of deleting it is about mitigations that
are manually possible today but undesireable vs prow marking it failed for
us.
On Wed, Oct 3, 2018, 23:36 Steve Kuznetsov notifications@github.com wrote:
InitContainer failure will fail the Pod, but mount failures due to
missing Secrets or ConfigMaps is valid -- but why delete the ProwJob?
Would it not be better to mark it as failed so it can be reported out?
Otherwise users will just see missing tests and contexts stuck in pending?โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426901710,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq2FeiT1z3-mHwdkdHQSZY6SFOj8Rks5uhaxugaJpZM4XHGkE
.
ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.
On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:
Dont' we have an existing issue for this?
I also thought we did, but couldn't find one when I searched.
We should also extend this to "ContainerCreating".
๐ Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.Correct. "Delete the ProwJob." also refers to the ProwJob cluster resource
not the YAML job configuration.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE
.
Any TTL should be generous so as to ensure the kubelet has time to pull all the images -- every part of pod utilities, the test image, etc.
+1. We see pull flakes often enough, it needs to be some ttl-until-running.
If a prowjob takes an three hours to start (ok realistically less than
that) we should go ahead and consider it failed / aborted.
Any opinions on plank / sinker / ...?
On Wed, Oct 3, 2018, 23:51 Michalis Kargakis notifications@github.com
wrote:
ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote:
Dont' we have an existing issue for this?
I also thought we did, but couldn't find one when I searched.
We should also extend this to "ContainerCreating".
๐ Yes. If init containers fail or volumes cannot be provisioned we
should abort/fail the ProwJob and expose the error message somehow.Also, IIUC in point 3 you mean edit the prowjob in cluster i.e. simply
updating the job configmap doesn't help.Correct. "Delete the ProwJob." also refers to the ProwJob cluster
resource
not the YAML job configuration.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426854588
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ADuFfzvfNa9q0uhswdm7ImDLR_Lxx1Oiks5uhWJqgaJpZM4XHGkE.
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9694#issuecomment-426904964,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bqyj6YikRPOML4Sk4tVT2RpsoRuy_ks5uha_lgaJpZM4XHGkE
.
It seems appropriate to place into plank to ensure we have just one controller managing ProwJob.Status
Looks like we're already doing this thanks to @krzyzacy in https://github.com/kubernetes/test-infra/commit/b5e0a45af555249f15279df075f67198870f6fed ...
However, by default the timeout is a full day:
So do we just want to tune the timeout down?
:joy: I didn't even remember that change - https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L5 we do set a pod pending timeout as 1h though, so this should be... WAI?
/close
Sounds like we've addressed this based on the above, please /reopen if I'm incorrect
@spiffxp: Closing this issue.
In response to this:
/close
Sounds like we've addressed this based on the above, please /reopen if I'm incorrect
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
ContainerCreating is a normal state and you can't really tell if it is
stuck or not. Image pull errors can be considered transient in some
environments too. I think a ttl in the prowjob or a ttl-until-running seems
appropriate.
On Thu, Oct 4, 2018, 03:20 Cole Wagner notifications@github.com wrote: