Test-infra: Create Test Flakiness Improvement Dashboard

Created on 6 Dec 2018 · 22Comments · Source: kubernetes/test-infra

Per 1.13 release retro, our release test jobs are flaky across the board. Some are less flaky than others, but overall sig-release testgrid is 90% flaky which makes CI Signal really painful, as well as making troubleshooting by contributors frustrating.

In order to make real progress on this, per @spiffxp we have to be able to measure it. This means having some kind of UI that will display flakiness across testgrid. I don't know exactly what this should look like, but here's some requirements:

it must both display overall flakiness and allow drill-down to the flakiness of specific jobs and tests
is should show stats for:
- Overall jobs
- individual tests in a job
- individual tests across multiple jobs
- flakiness increase/decrease across arbitrary time ranges/commit ranges
it should integrate both with the current presubmit flakiness tool and with triage
it should be searchable/groupable by SIG

@spiffxp, other requirements?

/kind feature

aredeflake help wanted kinfeature lifecyclrotten sirelease

Source

jberkus

Most helpful comment

We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.

is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to

is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall

spiffxp on 28 Jan 2019

👍2

All 22 comments

attn:
@AishSundar @mariantalla @mortent

jberkus on 6 Dec 2018

/milestone v1.14
/sig release

spiffxp on 3 Jan 2019

/area deflake

spiffxp on 5 Jan 2019

Prior art that exists today that you can use or build on:

we have a public bigquery dataset that is populated by kettle
one example of running against that dataset is the triage dashboard's update script
we also run metrics queries daily and store the results as json files, and influxdb
a number of these queries compute flakes
we have velodrome, a grafana instance, and the BigQuery Metrics dashboard shows results of those queries
testgrid provides a json summary file (eg: https://testgrid.k8s.io/release-master-blocking/summary) that is machine readable (with the exception of the summary, which is a string computed by an internal tool)

Things that I wanted to try if I had time to do this (I don't, so please, go for it)

figure out a way to make "is a job release blocking or not" something that could easily be queried in bigquery
or ignore that, and copy-paste a metric query, with a hardcoded list for "release blocking" jobs
copy paste a metric query, and don't filter on PR issues, but all issues
figure out a way to better quantify and display job and test ownership, so we could get leaderboards of sigs with best/worst pass rates or flake rates

Caveats:

bigquery is free to query up to the first TB, if you're playing around, I would highly recommend using the build.day and build.week tables before attempting to hit build.all

spiffxp on 28 Jan 2019

Not exactly related to flakiness, but another example of trying to tie these things together: Use testgrid + a metric output to generate a list of all test failures with links to all of the relevant testgrid dashboards https://gist.github.com/spiffxp/1e3ff608a92e8bfc0091a0b2918a11c6

It's bash and jq, so I was too ashamed to share or push it further in the test-infra repo, but hey, maybe it can be a starting point for someone

spiffxp on 28 Jan 2019

/help

spiffxp on 28 Jan 2019

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 28 Jan 2019

We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.

is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to

is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall

spiffxp on 28 Jan 2019

👍2

FYI, I've been working on BigQuery code for this. The definition of "what is a flake" is complex in query terms. What I have as an algorithm so far is:

Flaky Jobs

The job fails, then passes, then fails, then passes (blocks of failures would be assumed to be actual failures). This would be expressed as (( #status_changes / runs ) - threshold ). Initial threshold will be 3 or 4, I'm checking what that looks like now.
The job fails multiple times in a row, but on a different test each time. ( #failure_test_differences / runs - threshold )

Flaky Tests

Fail/pass/fail/pass works here too, but only within the same job, exempting tests on flaky jobs in category (j2)
The test fails multiple times in a row, but with a different reason each time, exempting tests on flaky jobs in category (j2)
NOT the test fails on some jobs but not others. There's no good way to tell if the test failure differences are due to the test being flaky, or due to job setup.

As you can see, I'm trying to avoid false positives more than I'm trying to find every single flake. My reason for that is that the above finds enough flaky jobs & tests already to keep everyone busy.

jberkus on 30 Jan 2019

Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.

mariantalla on 30 Jan 2019

Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.

this is what the current (limited) tooling relies on. we have a fair bit of data to this effect. I'm not sure if you can do anything other than this without sacrificing accuracy.

BenTheElder on 31 Jan 2019

The problem with "same commit" for periodic tests is that most of them don't run frequently enough to run on the same commit. We could implement that, but it would result in only noticing something like 10% of flakes, and would never notice flakes for some of the tests at all.

There's also the whole "retry" thing, that I haven't even touched on ...

jberkus on 31 Jan 2019

don't run frequently enough to run on the same commit

The release branches can be useful for this, at least for identifying flakes that persist across versions

spiffxp on 1 Feb 2019

(👋 picking this up again)

@jberkus -- it sounded you had something in the works. Is it at a place you'd be up for sharing?

An alternative I was thinking of is to create a test dashboard with the purpose of estimating flakiness per test/job. An example implementation could be: check out a commit, run e2e tests x times, calculate flakiness based on the results (e.g. 0 fails orx fails => not flaky, 0x => flaky). Repeat for more commits to get a trend on job flakiness.

That approach has the added benefit (or drawback, depending on how you see it 🙃) of separating the 2 purposes of running a test job, i.e. the "flakiness-measurement" purpose from the "CI signal" purpose.

mariantalla on 20 Feb 2019

@mariantalla

Huh, I hadn't thought of that approach. I think we can gather enough information or the tests that have already been run. But running your battery, maybe once per quarter, as a check on that would be awesome.

jberkus on 20 Feb 2019

/milestone v1.15
I would like to try and get to this in the v1.15 timeframe, but will be thrilled if someone beats me to it

spiffxp on 10 May 2019

/milestone clear

spiffxp on 9 Jul 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 7 Oct 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 6 Nov 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 6 Dec 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 6 Dec 2019

I put together the following dashboards for some of the tests the community uses for kubernetes/kubernetes (ref: https://github.com/kubernetes/test-infra/issues/13879)

It doesn't focus exclusively on flakes, but does show flake rate over time, and top two flakes per job for the week

If someone would like to focus more on flakes, please /reopen

spiffxp on 6 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings