Test-infra: Pod-utils frequently flake

Created on 22 Jan 2019 · 43Comments · Source: kubernetes/test-infra

https://github.com/kubernetes/test-infra/pull/10882

{
  "component":"sidecar",
  "error":"failed to upload to GCS: failed to upload to GCS: encountered errors during upload:
    [
      [Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
      [Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
      [Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]
    ]",
  "level":"error",
  "msg":"Failed to report job status",
  "time":"2019-01-22T22:05:04Z"
}

EDIT(cjwagner):
Failing to upload to GCS may be caused by CPU limitations, but it is worsened by a lack of retries which we should really have anyways so adding retries to the upload logic seems like the first step to addressing this.
This is the upload logic that should be retried: https://github.com/kubernetes/test-infra/blob/69cc5b2962487101421afee427c87eb1a8069114/prow/pod-utils/gcs/upload.go#L65-L72
Note that we'll need to create a new writer for each retry and that the group shouldn't be marked done until a retry has succeeded or all retries have been exhausted.

areprow good first issue help wanted kinbug sitesting

Source

fejta

Most helpful comment

@cjwagner do you have a plot of this over time? I'm surprised I don't see it at all on my build clusters. I'm still suspicious of the message, too, which often indicates some sort of CPU throttle

I do not. Even if CPU throttling is the root cause of the issues we are seeing now, I think we should still add retries since we should really be retrying anyways and that will catch more problems than just CPU throttling. Networks are unreliable and the upload could fail for a lot of reasons.

cjwagner on 8 Nov 2019

👍3

All 43 comments

/area pod-utils
/area prow
/assign
/cc @stevekuznetsov

fejta on 22 Jan 2019

It seems to only be the pull-test-infra-bazel job:

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/test-infra/10879/pull-test-infra-verify-bazel/20702

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/test-infra/10882/pull-test-infra-bazel/23535

fejta on 22 Jan 2019

On the other hand seems like some PRs pass? https://prow.k8s.io/?repo=kubernetes%2Ftest-infra&type=presubmit&job=pull-test-infra-bazel

fejta on 22 Jan 2019

Officially a flake:

forever pending: https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/test-infra/10879/pull-test-infra-bazel/23533

pass: https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/test-infra/10879/pull-test-infra-bazel/23539

fejta on 22 Jan 2019

/help

stevekuznetsov on 23 Jan 2019

@stevekuznetsov:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 23 Jan 2019

Uploads happen here and a retry loop at that level should be able to solve this issue, or at the very least alleviate it greatly.

stevekuznetsov on 23 Jan 2019

/assign @jluhrsen

jluhrsen on 5 Feb 2019

@jluhrsen: GitHub didn't allow me to assign the following users: jluhrsen.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jluhrsen

k8s-ci-robot on 5 Feb 2019

@fejta and I also thought about this and it's most likely the uploader running out of CPU, so if we set CPU requests for the sidecar we can probably work around this

stevekuznetsov on 5 Feb 2019

@stevekuznetsov I submitted a PR to do the retries, although I did see your comment about the uploader
running out of CPU. I can try to address that as well (or instead of the PR just submitted), if you can
point me where that might be tweaked?

jluhrsen on 5 Feb 2019

🎉1

To tweak the CPU issues, you'd need a Prow cluster to inspect and look at average use -- I have that data and this week I aim to fix that. Retry might not be reasonable for the CPU starvation case but would be useful in other cases, so let's do it regardless.

stevekuznetsov on 6 Feb 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 7 May 2019

/remove-lifecycle stale

still an issue

krzyzacy on 16 May 2019

net/http: TLS handshake timeout makes me think that you are running out of CPU on the container trying to do the upload, it can't calculate fast enough for the handshake. I would suggest using Prometheus to gather average cpu use for the initcontainer and add that as a request.

stevekuznetsov on 16 May 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 14 Aug 2019

Is this still happening?

stevekuznetsov on 6 Sep 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 6 Oct 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 5 Nov 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 5 Nov 2019

Yes this is still happening and should be addressed with a retry loop and/or CPU limits on the sidecar container.
/open
/remove-lifecycle rotten
/help
/good-first-issue

cjwagner on 5 Nov 2019

/reopen

cjwagner on 5 Nov 2019

@cjwagner: Reopened this issue.

In response to this:

/reopen

k8s-ci-robot on 5 Nov 2019

/help
/good-first-issue

cjwagner on 5 Nov 2019

@cjwagner:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/help
/good-first-issue

k8s-ci-robot on 5 Nov 2019

If you want this to be a good first issue can you please update the description to include some links to the code that should be changed and how?

Right now the description is too ambiguous

fejta on 8 Nov 2019

👍1

@alvaroaleman can you shed any light on this?

stevekuznetsov on 8 Nov 2019

@alvaroaleman can you shed any light on this?

No, I don't think I've seen this before. Adding requests and retrying sounds like a good idea.

alvaroaleman on 8 Nov 2019

I thought we did add retries :|

stevekuznetsov on 8 Nov 2019

@cjwagner do you have a plot of this over time? I'm surprised I don't see it at all on my build clusters. I'm still suspicious of the message, too, which often indicates some sort of CPU throttle

stevekuznetsov on 8 Nov 2019

@cjwagner do you have a plot of this over time? I'm surprised I don't see it at all on my build clusters. I'm still suspicious of the message, too, which often indicates some sort of CPU throttle

cjwagner on 8 Nov 2019

👍3

Network operations are guaranteed to occasionally fail at scale, and we should be robust to this reality.

fejta on 8 Nov 2019

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 Feb 2020

/remove-lifecycle stale

alvaroaleman on 6 Feb 2020

TAOXUY on 22 Feb 2020

Could it be the same issue as https://github.com/knative/test-infra/issues/2081?

The error messages are slightly different, but they both have oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token

chizhg on 15 Jun 2020

@chizhg This issue here really is a simple "We have a lot of jobs that use podutils and sometimes things fail for random reasons so we should retry".

Crier has retrying already, so its a different story there.

alvaroaleman on 15 Jun 2020

Hey, @alvaroaleman Is this issue addressed? If not, I would like to look into it. As from discussions I can see that we need to introduce retry logic for podutils, right?

kushthedude on 1 Sep 2020

@kushthedude sorry, I completely missed your comment. This issue hasn't been addressed and yes, this is about adding retry logic in the podutils

alvaroaleman on 22 Oct 2020

@alvaroaleman - One more clarification needed...
The last commit on this topic was https://github.com/kubernetes/test-infra/issues/10884#ref-commit-b79bc0a (https://github.com/jluhrsen/test-infra/commit/b79bc0aa7cd10221d3bc609cf16bf111a62d7371)
Can you please have a look at that and see if that was sufficient, cause I didn't find any pull request with comments on that change.
If that looks sufficient, it could be a good first time contributor's attempt to learn the contribution process - I could try it :wink:.

Edit: 01-Nov-2020
A bit more digging and I found the PR for this https://github.com/kubernetes/test-infra/pull/11133
It appears almost done. @fejta can you suggest what needs to be done to close this one as you were involved in that PR?

ajaygm on 31 Oct 2020

Hi @ajaygm .

That commit looks good, the spontaneous three comments I'd have are:

It adds a lock that serializes all uploads, we don't want that (We already added something to limit parallelization after that PR was created, but in order to limit memory usage)
The Info("Retrying upload") log in the end will be shown even when it won't retry anymore when all retries failed
Before we retry, we should introduce an exponential delay

alvaroaleman on 2 Nov 2020

Fixed in #19884
/close

alvaroaleman on 10 Nov 2020

@alvaroaleman: Closing this issue.

In response to this: