Test-infra: flake data excludes pod-utils jobs

Created on 7 Oct 2019 · 23Comments · Source: kubernetes/test-infra

See http://storage.googleapis.com/k8s-metrics/flakes-latest.json etc. (metrics/ produced files)
and http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

The flake data is very misleading, for example pull-kubernetes-verify has "no flakes" which is definitely wrong.

What seems to be happening if we only include data from bootstrap.py results, not pod-utils (I think) possibly due to handling of the repos data (per @cjwagner)

We should fix this, not having flake data is a pretty big regression for managing kubernetes presubmits. I didn't realize that jobs I'd migrated were losing this.

aremetrics areprow kinbug

Source

BenTheElder

All 23 comments

What data/files is the flake analysis using? Are the pod-utils not uploading something they should?
/assign

stevekuznetsov on 8 Oct 2019

I'm not sure I fully understand the pipeline just yet, I took a quick look at it seems something like:

https://github.com/kubernetes/test-infra/tree/master/kettle -> fetch finished / started.json from each job, move into some schema in big query
https://github.com/kubernetes/test-infra/tree/master/metrics -> query the table, select flake data (which appears to be reading repos in started?)

I think the issue is that repos is now a JSON blob, so @fejta suggested something like making this a field in the database and then reading that instead of this https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml#L64

BenTheElder on 8 Oct 2019

I'm not actually sure which component is at fault here or exactly why this doesn't work, but looking at the data we produce jobs that are fully pod-utils are missing and jobs that migrated to pod-utils on newer branches have "0 flakes" when they definitely have non-zero flakes.

I would tend to suggest that the pipeline is a bit hairy and probably at fault, but the results are generally very useful for identifying sources of flakiness.

The data looks (?) present in pod-utils to me but I'm not fully familiar with that format or the big query pipeline...

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/started.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/started.json

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/finished.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/finished.json

... maybe it's reading repos from finished.json instead of started?

BenTheElder on 8 Oct 2019

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help if we can identify what the utils should be doing to be compliant

stevekuznetsov on 8 Oct 2019

Thanks.
I intend to take another look at this pipeline tomorrow to try to
understand what is different.

On Mon, Oct 7, 2019, 17:51 Steve Kuznetsov notifications@github.com wrote:

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help
if we can identify what the utils should be doing to be compliant

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/14643?email_source=notifications&email_token=AAHADK3ZVTJJ5COVEKJMMA3QNPKQRA5CNFSM4I6JCTWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEASIVGQ#issuecomment-539265690,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHADK2M6JBW4KA5ZI3DG3TQNPKQRANCNFSM4I6JCTWA
.

BenTheElder on 8 Oct 2019

This came up again today wrt pull-kubernetes-integration flakes but I don't really understand this pipeline and I'm pretty far over capacity.

It seems like we do bigquery quer{y,ies} and then pipe through jq? ... these are fairly gnarly.

BenTheElder on 2 Dec 2019

Who's an expert on that pipeline?

stevekuznetsov on 3 Dec 2019

Cole is the only person I can remember touching it in the past year or so.

BenTheElder on 3 Dec 2019

/assign
I'll take a look

Work that may overlap https://github.com/kubernetes/test-infra/issues/15469

spiffxp on 8 Jan 2020

The flakes query looks for version != 'unknown' for CI jobs and metadata.key == 'repos' for PR jobs https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml

looking at fields in the builds table for pr:pull-kubernetes-e2e-kind vs. pr:pull-kubernetes-e2e-gce,
kind has a null version and an empty metadata field, while gce's are populated

metadata comes from either finished.json or started.json
https://github.com/kubernetes/test-infra/blob/cdeb7eb64c641a48182c509b1a902a17292e4eec/kettle/make_json.py#L158-L163

version comes from finished.json
https://github.com/kubernetes/test-infra/blob/cdeb7eb64c641a48182c509b1a902a17292e4eec/kettle/make_json.py#L155-L156

spiffxp on 10 Jan 2020

example kind job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/85282/pull-kubernetes-e2e-kind/1215213961905967104

finished.json has no version field and no metadata field

{"timestamp":1578565960,"passed":false,"result":"FAILURE","revision":"05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}

started.json has repos in it, but no metadata field

{"timestamp":1578564680,"pull":"85282","repo-version":"49162743c0055b4395dd40bdf910f2c0472973b5","repos":{"kubernetes/kubernetes":"master:ef69bc910f0e47bbe3cf396d4bebf4f678cf6f3a,85282:05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}}

example gce job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/86450/pull-kubernetes-e2e-gce/1215395413390004227

finished.json has a version field and metadata with repos populated:

{
  "timestamp": 1578610945, 
  "version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "result": "FAILURE", 
  "passed": false, 
  "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "metadata": {
    "repo-commit": "64e0fc900b5b3fcd5e5a16cb76ed40b1b900df15", 
    "node_os_image": "cos-77-12371-89-0", 
    "repos": {
      "k8s.io/kubernetes": "master:aef336d71253d9897f83425e80a231763d1385e8,86450:91a6050b58898d14f48ef893733cff070b17c0db", 
      "k8s.io/release": "master"
    }, 
    "infra-commit": "dd307d2a7", 
    "repo": "k8s.io/kubernetes", 
    "master_os_image": "cos-77-12371-89-0", 
    "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
    "pod": "c130ee54-332c-11ea-9e6e-4a9fb1cbefb2", 
    "revision": "v1.18.0-alpha.1.550+64e0fc900b5b3f"
  }
}

spiffxp on 10 Jan 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 13 Apr 2020

/remove-lifecycle stale
There is lots of organic data munging in the existing pipeline, and assumptions on the use of bootstrap and/or scripts in k/k's hack directory.

I got as far as writing a google-doc proposal which proposed adding repo and repo_commit fields, which would more closely match what testgrid is going to support going forward. Unfortunately it looks like this would require plumbing through the job->pod-utils->gcs->kettle->bigquery->metrics-queries pipeline and touching most every part along the way.

I was left with the impression that if we need to touch every part of the pipeline, maybe we want to consider rewriting parts of it piecemeal.

spiffxp on 14 Apr 2020

👍1

@spiffxp lacking this data seems problematic. can we at least add some snippet we run in the wrapper script to dump this to e.g. metadata.json, or update the pipeline to consume the prowjob.json or ..?

BenTheElder on 12 May 2020

/assign
Going to need to build this into Flake efforts

MushuEE on 7 Jul 2020

https://github.com/kubernetes/test-infra/issues/19666 covers updating queries

spiffxp on 23 Oct 2020

Current status:

http://storage.googleapis.com/k8s-metrics/flakes-experiment-latest.json is results from the new query that uses data from pod-utils (or bootstrap)
http://storage.googleapis.com/k8s-metrics/flakes-latest.json is results from the old query that uses data from bootstrap only

We haven't decided whether to swap out the old for the new:

There are more jobs in the new set of results: this is expected, and good!
Most jobs have flakiest: null - is this expected?
For jobs that appear in both sets of results, the new results show more flakes and lower consistency. Do we know why? Do we care?

When we decide to swap out old for new, we should also look at updating other queries before calling this done (ref: https://github.com/kubernetes/test-infra/issues/20013)

spiffxp on 12 Jan 2021

❤1

/milestone v1.21

spiffxp on 12 Jan 2021

Thanks @spiffxp, I have only done one juxtaposition of job results and had not seen that discrepancy in job data. I will try and look into this soon

MushuEE on 13 Jan 2021

👍1

Here are the results I am seeing:
OLD QUERY TOP 10

|job | build_consistency | commit_consistency | flakes | runs | commits|
|--- | --- | --- | --- | --- | ---|
|pr:pull-kubernetes-e2e-gce-ubuntu-containerd | 0.943 | 0.934 | 26 | 491 | 395|
|ci-kubernetes-e2e-gce-multizone | 0.83 | 0.701 | 20 | 194 | 67|
|ci-kubernetes-e2e-gci-gce-ipvs | 0.628 | 0.746 | 16 | 137 | 63|
|ci-kubernetes-e2e-gci-gce-flaky | 0.591 | 0.761 | 16 | 242 | 67|
|ci-kubernetes-e2e-gci-gce-ip-alias | 0.925 | 0.761 | 16 | 254 | 67|
|ci-kubernetes-e2e-gci-gce | 0.937 | 0.783 | 15 | 255 | 69|
|ci-kubernetes-e2e-gci-gce-proto | 0.925 | 0.791 | 14 | 240 | 67|
|ci-kubernetes-e2e-gci-gce-kube-dns-nodecache | 0.799 | 0.794 | 13 | 134 | 63|
|pr:pull-kubernetes-node-e2e | 0.974 | 0.969 | 13 | 494 | 415|
|ci-kubernetes-e2e-gci-gce-coredns | 0.87 | 0.813 | 12 | 138 | 64|

New Query Top 10
|job | build_consistency | commit_consistency | flakes | runs | commits|
|--- | --- | --- | --- | --- | ---|
|ci-kubernetes-cached-make-test | 0.526 | 0.057 | 66 | 603 | 70|
|pr:pull-kubernetes-e2e-gce-ubuntu-containerd | 0.945 | 0.938 | 29 | 564 | 465 |
|ci-kubernetes-generate-make-test-cache | 0.753 | 0.687 | 21 | 158 | 67|
|ci-kubernetes-coverage-unit | 0.771 | 0.692 | 20 | 157 | 65|
|ci-kubernetes-e2e-gce-multizone | 0.83 | 0.701 | 20 | 194 | 67|
|ci-kubernetes-e2e-gci-gce-ipvs | 0.628 | 0.746 | 16 | 137 | 63|
|ci-kubernetes-e2e-gci-gce-flaky | 0.591 | 0.761 | 16 | 242 | 67|
|ci-kubernetes-e2e-gci-gce-ip-alias | 0.925 | 0.761 | 16 | 254 | 67|
|pr:pull-kubernetes-node-e2e | 0.972 | 0.967 | 16 | 562 | 480|
|ci-kubernetes-e2e-gci-gce | 0.937 | 0.783 | 15 | 255 | 69|

MushuEE on 14 Jan 2021

/close

MushuEE on 15 Jan 2021

@MushuEE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 15 Jan 2021

20500

MushuEE on 16 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

trusted_org not working in trigger

BenTheElder · 4Comments

Label plugin parsing problem ("The label(s) kind/ cannot be applied")

benmoss · 3Comments

fix keyword should not close other PRs, only Issues

sjenning · 4Comments

Prow GitHub Client test flakes

BenTheElder · 4Comments

submit-queue flags have outdated names

BenTheElder · 3Comments