Test-infra: flake data excludes pod-utils jobs

Created on 7 Oct 2019  Â·  23Comments  Â·  Source: kubernetes/test-infra

See http://storage.googleapis.com/k8s-metrics/flakes-latest.json etc. (metrics/ produced files)
and http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

The flake data is very misleading, for example pull-kubernetes-verify has "no flakes" which is definitely wrong.

What seems to be happening if we only include data from bootstrap.py results, not pod-utils (I think) possibly due to handling of the repos data (per @cjwagner)

We should fix this, not having flake data is a pretty big regression for managing kubernetes presubmits. I didn't realize that jobs I'd migrated were losing this.

aremetrics areprow kinbug

All 23 comments

What data/files is the flake analysis using? Are the pod-utils not uploading something they should?
/assign

I'm not sure I fully understand the pipeline just yet, I took a quick look at it seems something like:

  • https://github.com/kubernetes/test-infra/tree/master/kettle -> fetch finished / started.json from each job, move into some schema in big query
  • https://github.com/kubernetes/test-infra/tree/master/metrics -> query the table, select flake data (which appears to be reading repos in started?)

I think the issue is that repos is now a JSON blob, so @fejta suggested something like making this a field in the database and then reading that instead of this https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml#L64

I'm not actually sure which component is at fault here or exactly why this doesn't work, but looking at the data we produce jobs that are fully pod-utils are missing and jobs that migrated to pod-utils on newer branches have "0 flakes" when they definitely have non-zero flakes.

I would tend to suggest that the pipeline is a bit hairy and probably at fault, but the results are generally very useful for identifying sources of flakiness.

The data looks (?) present in pod-utils to me but I'm not fully familiar with that format or the big query pipeline...

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/started.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/started.json

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/finished.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/finished.json

... maybe it's reading repos from finished.json instead of started?

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help if we can identify what the utils should be doing to be compliant

Thanks.
I intend to take another look at this pipeline tomorrow to try to
understand what is different.

On Mon, Oct 7, 2019, 17:51 Steve Kuznetsov notifications@github.com wrote:

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help
if we can identify what the utils should be doing to be compliant

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/14643?email_source=notifications&email_token=AAHADK3ZVTJJ5COVEKJMMA3QNPKQRA5CNFSM4I6JCTWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEASIVGQ#issuecomment-539265690,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHADK2M6JBW4KA5ZI3DG3TQNPKQRANCNFSM4I6JCTWA
.

This came up again today wrt pull-kubernetes-integration flakes but I don't really understand this pipeline and I'm pretty far over capacity.

It seems like we do bigquery quer{y,ies} and then pipe through jq? ... these are fairly gnarly.

Who's an expert on that pipeline?

Cole is the only person I can remember touching it in the past year or so.

/assign
I'll take a look

Work that may overlap https://github.com/kubernetes/test-infra/issues/15469

The flakes query looks for version != 'unknown' for CI jobs and metadata.key == 'repos' for PR jobs https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml

looking at fields in the builds table for pr:pull-kubernetes-e2e-kind vs. pr:pull-kubernetes-e2e-gce,
kind has a null version and an empty metadata field, while gce's are populated

metadata comes from either finished.json or started.json
https://github.com/kubernetes/test-infra/blob/cdeb7eb64c641a48182c509b1a902a17292e4eec/kettle/make_json.py#L158-L163

version comes from finished.json
https://github.com/kubernetes/test-infra/blob/cdeb7eb64c641a48182c509b1a902a17292e4eec/kettle/make_json.py#L155-L156

example kind job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/85282/pull-kubernetes-e2e-kind/1215213961905967104

finished.json has no version field and no metadata field

{"timestamp":1578565960,"passed":false,"result":"FAILURE","revision":"05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}

started.json has repos in it, but no metadata field

{"timestamp":1578564680,"pull":"85282","repo-version":"49162743c0055b4395dd40bdf910f2c0472973b5","repos":{"kubernetes/kubernetes":"master:ef69bc910f0e47bbe3cf396d4bebf4f678cf6f3a,85282:05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}}

example gce job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/86450/pull-kubernetes-e2e-gce/1215395413390004227

finished.json has a version field and metadata with repos populated:

{
  "timestamp": 1578610945, 
  "version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "result": "FAILURE", 
  "passed": false, 
  "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "metadata": {
    "repo-commit": "64e0fc900b5b3fcd5e5a16cb76ed40b1b900df15", 
    "node_os_image": "cos-77-12371-89-0", 
    "repos": {
      "k8s.io/kubernetes": "master:aef336d71253d9897f83425e80a231763d1385e8,86450:91a6050b58898d14f48ef893733cff070b17c0db", 
      "k8s.io/release": "master"
    }, 
    "infra-commit": "dd307d2a7", 
    "repo": "k8s.io/kubernetes", 
    "master_os_image": "cos-77-12371-89-0", 
    "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
    "pod": "c130ee54-332c-11ea-9e6e-4a9fb1cbefb2", 
    "revision": "v1.18.0-alpha.1.550+64e0fc900b5b3f"
  }
}

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale
There is lots of organic data munging in the existing pipeline, and assumptions on the use of bootstrap and/or scripts in k/k's hack directory.

I got as far as writing a google-doc proposal which proposed adding repo and repo_commit fields, which would more closely match what testgrid is going to support going forward. Unfortunately it looks like this would require plumbing through the job->pod-utils->gcs->kettle->bigquery->metrics-queries pipeline and touching most every part along the way.

I was left with the impression that if we need to touch every part of the pipeline, maybe we want to consider rewriting parts of it piecemeal.

@spiffxp lacking this data seems problematic. can we at least add some snippet we run in the wrapper script to dump this to e.g. metadata.json, or update the pipeline to consume the prowjob.json or ..?

/assign
Going to need to build this into Flake efforts

Current status:

We haven't decided whether to swap out the old for the new:

  • There are more jobs in the new set of results: this is expected, and good!
  • Most jobs have flakiest: null - is this expected?
  • For jobs that appear in both sets of results, the new results show more flakes and lower consistency. Do we know why? Do we care?

When we decide to swap out old for new, we should also look at updating other queries before calling this done (ref: https://github.com/kubernetes/test-infra/issues/20013)

/milestone v1.21

Thanks @spiffxp, I have only done one juxtaposition of job results and had not seen that discrepancy in job data. I will try and look into this soon

Here are the results I am seeing:
OLD QUERY TOP 10

|job | build_consistency | commit_consistency | flakes | runs | commits|
|--- | --- | --- | --- | --- | ---|
|pr:pull-kubernetes-e2e-gce-ubuntu-containerd | 0.943 | 0.934 | 26 | 491 | 395|
|ci-kubernetes-e2e-gce-multizone | 0.83 | 0.701 | 20 | 194 | 67|
|ci-kubernetes-e2e-gci-gce-ipvs | 0.628 | 0.746 | 16 | 137 | 63|
|ci-kubernetes-e2e-gci-gce-flaky | 0.591 | 0.761 | 16 | 242 | 67|
|ci-kubernetes-e2e-gci-gce-ip-alias | 0.925 | 0.761 | 16 | 254 | 67|
|ci-kubernetes-e2e-gci-gce | 0.937 | 0.783 | 15 | 255 | 69|
|ci-kubernetes-e2e-gci-gce-proto | 0.925 | 0.791 | 14 | 240 | 67|
|ci-kubernetes-e2e-gci-gce-kube-dns-nodecache | 0.799 | 0.794 | 13 | 134 | 63|
|pr:pull-kubernetes-node-e2e | 0.974 | 0.969 | 13 | 494 | 415|
|ci-kubernetes-e2e-gci-gce-coredns | 0.87 | 0.813 | 12 | 138 | 64|

New Query Top 10
|job | build_consistency | commit_consistency | flakes | runs | commits|
|--- | --- | --- | --- | --- | ---|
|ci-kubernetes-cached-make-test | 0.526 | 0.057 | 66 | 603 | 70|
|pr:pull-kubernetes-e2e-gce-ubuntu-containerd | 0.945 | 0.938 | 29 | 564 | 465 |  
|ci-kubernetes-generate-make-test-cache | 0.753 | 0.687 | 21 | 158 | 67|
|ci-kubernetes-coverage-unit | 0.771 | 0.692 | 20 | 157 | 65|
|ci-kubernetes-e2e-gce-multizone | 0.83 | 0.701 | 20 | 194 | 67|
|ci-kubernetes-e2e-gci-gce-ipvs | 0.628 | 0.746 | 16 | 137 | 63|
|ci-kubernetes-e2e-gci-gce-flaky | 0.591 | 0.761 | 16 | 242 | 67|
|ci-kubernetes-e2e-gci-gce-ip-alias | 0.925 | 0.761 | 16 | 254 | 67|
|pr:pull-kubernetes-node-e2e | 0.972 | 0.967 | 16 | 562 | 480|
|ci-kubernetes-e2e-gci-gce | 0.937 | 0.783 | 15 | 255 | 69|

/close

@MushuEE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

20500

Was this page helpful?
0 / 5 - 0 ratings