https://github.com/kubernetes/test-infra/pull/10882
{
"component":"sidecar",
"error":"failed to upload to GCS: failed to upload to GCS: encountered errors during upload:
[
[Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
[Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
[Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
]",
"level":"error",
"msg":"Failed to report job status",
"time":"2019-01-22T22:05:04Z"
}
EDIT(cjwagner):
Failing to upload to GCS may be caused by CPU limitations, but it is worsened by a lack of retries which we should really have anyways so adding retries to the upload logic seems like the first step to addressing this.
This is the upload logic that should be retried: https://github.com/kubernetes/test-infra/blob/69cc5b2962487101421afee427c87eb1a8069114/prow/pod-utils/gcs/upload.go#L65-L72
Note that we'll need to create a new writer for each retry and that the group shouldn't be marked done until a retry has succeeded or all retries have been exhausted.
/area pod-utils
/area prow
/assign
/cc @stevekuznetsov
On the other hand seems like some PRs pass? https://prow.k8s.io/?repo=kubernetes%2Ftest-infra&type=presubmit&job=pull-test-infra-bazel
/help
@stevekuznetsov:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Uploads happen here and a retry loop at that level should be able to solve this issue, or at the very least alleviate it greatly.
/assign @jluhrsen
@jluhrsen: GitHub didn't allow me to assign the following users: jluhrsen.
Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide
In response to this:
/assign @jluhrsen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@fejta and I also thought about this and it's most likely the uploader running out of CPU, so if we set CPU requests for the sidecar we can probably work around this
@stevekuznetsov I submitted a PR to do the retries, although I did see your comment about the uploader
running out of CPU. I can try to address that as well (or instead of the PR just submitted), if you can
point me where that might be tweaked?
To tweak the CPU issues, you'd need a Prow cluster to inspect and look at average use -- I have that data and this week I aim to fix that. Retry might not be reasonable for the CPU starvation case but would be useful in other cases, so let's do it regardless.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
still an issue
net/http: TLS handshake timeout makes me think that you are running out of CPU on the container trying to do the upload, it can't calculate fast enough for the handshake. I would suggest using Prometheus to gather average cpu use for the initcontainer and add that as a request.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Is this still happening?
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Yes this is still happening and should be addressed with a retry loop and/or CPU limits on the sidecar container.
/open
/remove-lifecycle rotten
/help
/good-first-issue
/reopen
@cjwagner: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/help
/good-first-issue
@cjwagner:
This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/help
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
If you want this to be a good first issue can you please update the description to include some links to the code that should be changed and how?
Right now the description is too ambiguous
@alvaroaleman can you shed any light on this?
@alvaroaleman can you shed any light on this?
No, I don't think I've seen this before. Adding requests and retrying sounds like a good idea.
I thought we did add retries :|
@cjwagner do you have a plot of this over time? I'm surprised I don't see it at all on my build clusters. I'm still suspicious of the message, too, which often indicates some sort of CPU throttle
@cjwagner do you have a plot of this over time? I'm surprised I don't see it at all on my build clusters. I'm still suspicious of the message, too, which often indicates some sort of CPU throttle
I do not. Even if CPU throttling is the root cause of the issues we are seeing now, I think we should still add retries since we should really be retrying anyways and that will catch more problems than just CPU throttling. Networks are unreliable and the upload could fail for a lot of reasons.
Network operations are guaranteed to occasionally fail at scale, and we should be robust to this reality.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
+1
Could it be the same issue as https://github.com/knative/test-infra/issues/2081?
The error messages are slightly different, but they both have oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token
@chizhg This issue here really is a simple "We have a lot of jobs that use podutils and sometimes things fail for random reasons so we should retry".
Crier has retrying already, so its a different story there.
Hey, @alvaroaleman Is this issue addressed? If not, I would like to look into it. As from discussions I can see that we need to introduce retry logic for podutils, right?
@kushthedude sorry, I completely missed your comment. This issue hasn't been addressed and yes, this is about adding retry logic in the podutils
@alvaroaleman - One more clarification needed...
The last commit on this topic was https://github.com/kubernetes/test-infra/issues/10884#ref-commit-b79bc0a (https://github.com/jluhrsen/test-infra/commit/b79bc0aa7cd10221d3bc609cf16bf111a62d7371)
Can you please have a look at that and see if that was sufficient, cause I didn't find any pull request with comments on that change.
If that looks sufficient, it could be a good first time contributor's attempt to learn the contribution process - I could try it :wink:.
Edit: 01-Nov-2020
A bit more digging and I found the PR for this https://github.com/kubernetes/test-infra/pull/11133
It appears almost done. @fejta can you suggest what needs to be done to close this one as you were involved in that PR?
Hi @ajaygm .
That commit looks good, the spontaneous three comments I'd have are:
lock that serializes all uploads, we don't want that (We already added something to limit parallelization after that PR was created, but in order to limit memory usage)Info("Retrying upload") log in the end will be shown even when it won't retry anymore when all retries failedFixed in #19884
/close
@alvaroaleman: Closing this issue.
In response to this:
Fixed in #19884
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I do not. Even if CPU throttling is the root cause of the issues we are seeing now, I think we should still add retries since we should really be retrying anyways and that will catch more problems than just CPU throttling. Networks are unreliable and the upload could fail for a lot of reasons.