Test-infra: Create Test Flakiness Improvement Dashboard

Created on 6 Dec 2018  路  22Comments  路  Source: kubernetes/test-infra

Per 1.13 release retro, our release test jobs are flaky across the board. Some are less flaky than others, but overall sig-release testgrid is 90% flaky which makes CI Signal really painful, as well as making troubleshooting by contributors frustrating.

In order to make real progress on this, per @spiffxp we have to be able to measure it. This means having some kind of UI that will display flakiness across testgrid. I don't know exactly what this should look like, but here's some requirements:

  • it must both display overall flakiness and allow drill-down to the flakiness of specific jobs and tests
  • is should show stats for:

    • Overall jobs

    • individual tests in a job

    • individual tests across multiple jobs

    • flakiness increase/decrease across arbitrary time ranges/commit ranges

  • it should integrate both with the current presubmit flakiness tool and with triage
  • it should be searchable/groupable by SIG

@spiffxp, other requirements?


/kind feature

aredeflake help wanted kinfeature lifecyclrotten sirelease

Most helpful comment

We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.

is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to

is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall

All 22 comments

attn:
@AishSundar @mariantalla @mortent

/milestone v1.14
/sig release

/area deflake

Prior art that exists today that you can use or build on:

  • we have a public bigquery dataset that is populated by kettle
  • one example of running against that dataset is the triage dashboard's update script
  • we also run metrics queries daily and store the results as json files, and influxdb
  • a number of these queries compute flakes
  • we have velodrome, a grafana instance, and the BigQuery Metrics dashboard shows results of those queries
  • testgrid provides a json summary file (eg: https://testgrid.k8s.io/release-master-blocking/summary) that is machine readable (with the exception of the summary, which is a string computed by an internal tool)

Things that I wanted to try if I had time to do this (I don't, so please, go for it)

  • figure out a way to make "is a job release blocking or not" something that could easily be queried in bigquery
  • or ignore that, and copy-paste a metric query, with a hardcoded list for "release blocking" jobs
  • copy paste a metric query, and don't filter on PR issues, but all issues
  • figure out a way to better quantify and display job and test ownership, so we could get leaderboards of sigs with best/worst pass rates or flake rates

Caveats:

  • bigquery is free to query up to the first TB, if you're playing around, I would highly recommend using the build.day and build.week tables before attempting to hit build.all

Not exactly related to flakiness, but another example of trying to tie these things together: Use testgrid + a metric output to generate a list of all test failures with links to all of the relevant testgrid dashboards https://gist.github.com/spiffxp/1e3ff608a92e8bfc0091a0b2918a11c6

It's bash and jq, so I was too ashamed to share or push it further in the test-infra repo, but hey, maybe it can be a starting point for someone

/help

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.

is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to

is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall

FYI, I've been working on BigQuery code for this. The definition of "what is a flake" is complex in query terms. What I have as an algorithm so far is:

Flaky Jobs

  1. The job fails, then passes, then fails, then passes (blocks of failures would be assumed to be actual failures). This would be expressed as (( #status_changes / runs ) - threshold ). Initial threshold will be 3 or 4, I'm checking what that looks like now.
  2. The job fails multiple times in a row, but on a different test each time. ( #failure_test_differences / runs - threshold )

Flaky Tests

  1. Fail/pass/fail/pass works here too, but only within the same job, exempting tests on flaky jobs in category (j2)
  2. The test fails multiple times in a row, but with a different reason each time, exempting tests on flaky jobs in category (j2)
  3. NOT the test fails on some jobs but not others. There's no good way to tell if the test failure differences are due to the test being flaky, or due to job setup.

As you can see, I'm trying to avoid false positives more than I'm trying to find every single flake. My reason for that is that the above finds enough flaky jobs & tests already to keep everyone busy.

Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.

Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.

this is what the current (limited) tooling relies on. we have a fair bit of data to this effect. I'm not sure if you can do anything other than this without sacrificing accuracy.

The problem with "same commit" for periodic tests is that most of them don't run frequently enough to run on the same commit. We could implement that, but it would result in only noticing something like 10% of flakes, and would never notice flakes for some of the tests at all.

There's also the whole "retry" thing, that I haven't even touched on ...

don't run frequently enough to run on the same commit

The release branches can be useful for this, at least for identifying flakes that persist across versions

(馃憢 picking this up again)

@jberkus -- it sounded you had something in the works. Is it at a place you'd be up for sharing?

An alternative I was thinking of is to create a test dashboard with the purpose of estimating flakiness per test/job. An example implementation could be: check out a commit, run e2e tests x times, calculate flakiness based on the results (e.g. 0 fails orx fails => not flaky, 0x => flaky). Repeat for more commits to get a trend on job flakiness.

Similar to what is talked about here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md

That approach has the added benefit (or drawback, depending on how you see it 馃檭) of separating the 2 purposes of running a test job, i.e. the "flakiness-measurement" purpose from the "CI signal" purpose.

@mariantalla

Huh, I hadn't thought of that approach. I think we can gather enough information or the tests that have already been run. But running your battery, maybe once per quarter, as a check on that would be awesome.

/milestone v1.15
I would like to try and get to this in the v1.15 timeframe, but will be thrilled if someone beats me to it

/milestone clear

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I put together the following dashboards for some of the tests the community uses for kubernetes/kubernetes (ref: https://github.com/kubernetes/test-infra/issues/13879)

It doesn't focus exclusively on flakes, but does show flake rate over time, and top two flakes per job for the week

If someone would like to focus more on flakes, please /reopen

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zacharysarah picture zacharysarah  路  3Comments

cblecker picture cblecker  路  4Comments

fejta picture fejta  路  4Comments

stevekuznetsov picture stevekuznetsov  路  4Comments

chaosaffe picture chaosaffe  路  3Comments