Per 1.13 release retro, our release test jobs are flaky across the board. Some are less flaky than others, but overall sig-release testgrid is 90% flaky which makes CI Signal really painful, as well as making troubleshooting by contributors frustrating.
In order to make real progress on this, per @spiffxp we have to be able to measure it. This means having some kind of UI that will display flakiness across testgrid. I don't know exactly what this should look like, but here's some requirements:
@spiffxp, other requirements?
/kind feature
attn:
@AishSundar @mariantalla @mortent
/milestone v1.14
/sig release
/area deflake
Prior art that exists today that you can use or build on:
https://testgrid.k8s.io/release-master-blocking/summary) that is machine readable (with the exception of the summary, which is a string computed by an internal tool)Things that I wanted to try if I had time to do this (I don't, so please, go for it)
Caveats:
build.day and build.week tables before attempting to hit build.allNot exactly related to flakiness, but another example of trying to tie these things together: Use testgrid + a metric output to generate a list of all test failures with links to all of the relevant testgrid dashboards https://gist.github.com/spiffxp/1e3ff608a92e8bfc0091a0b2918a11c6
It's bash and jq, so I was too ashamed to share or push it further in the test-infra repo, but hey, maybe it can be a starting point for someone
/help
@spiffxp:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.
is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to
is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall
FYI, I've been working on BigQuery code for this. The definition of "what is a flake" is complex in query terms. What I have as an algorithm so far is:
Flaky Jobs
Flaky Tests
As you can see, I'm trying to avoid false positives more than I'm trying to find every single flake. My reason for that is that the above finds enough flaky jobs & tests already to keep everyone busy.
Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.
Yep, it's tricky to identify... (in my head) the only way something is a flake with certainty is if it passes & fails for the same commit/combination of commits; don't know if that gives you anything different from what you already have.
this is what the current (limited) tooling relies on. we have a fair bit of data to this effect. I'm not sure if you can do anything other than this without sacrificing accuracy.
The problem with "same commit" for periodic tests is that most of them don't run frequently enough to run on the same commit. We could implement that, but it would result in only noticing something like 10% of flakes, and would never notice flakes for some of the tests at all.
There's also the whole "retry" thing, that I haven't even touched on ...
don't run frequently enough to run on the same commit
The release branches can be useful for this, at least for identifying flakes that persist across versions
(馃憢 picking this up again)
@jberkus -- it sounded you had something in the works. Is it at a place you'd be up for sharing?
An alternative I was thinking of is to create a test dashboard with the purpose of estimating flakiness per test/job. An example implementation could be: check out a commit, run e2e tests x times, calculate flakiness based on the results (e.g. 0 fails orx fails => not flaky, 0
Similar to what is talked about here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md
That approach has the added benefit (or drawback, depending on how you see it 馃檭) of separating the 2 purposes of running a test job, i.e. the "flakiness-measurement" purpose from the "CI signal" purpose.
@mariantalla
Huh, I hadn't thought of that approach. I think we can gather enough information or the tests that have already been run. But running your battery, maybe once per quarter, as a check on that would be awesome.
/milestone v1.15
I would like to try and get to this in the v1.15 timeframe, but will be thrilled if someone beats me to it
/milestone clear
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I put together the following dashboards for some of the tests the community uses for kubernetes/kubernetes (ref: https://github.com/kubernetes/test-infra/issues/13879)
It doesn't focus exclusively on flakes, but does show flake rate over time, and top two flakes per job for the week
If someone would like to focus more on flakes, please /reopen
Most helpful comment
We have a job that periodically auto-files issues if there are flakes or failures. I think it's limited to kubernetes/kubernetes and I'm not exactly sure how far back it looks.
is:issue org:kubernetes author:fejta-bot sort:created-desc will show you how that's being consumed or paid attention to
is:issue org:kubernetes label:kind/flake sort:created-desc if you'd like to see how these issues are dealt with overall